Some content never served from S3 cache #240

aldenstpage · 2020-06-23T16:42:33Z

I use imageproxy to thumbnail some large images served from remote sources in my search engine, and it generally works beautifully for my use case. I have imageproxy connected to an S3 bucket for caching. However, some (not all!) images are never being served from the S3 cache. A hard page refresh always results in the content for these particular URLs being loaded from the origin. This is a problem because the images in question are huge and can take 10 or 15 seconds to download and thumbnail.

Here's the startup parameters I'm using:
imageproxy -verbose -addr localhost:8082 -cache s3://us-east-1/cc-thumbnail-cache-prod

Logs excerpt:

Jun 23 16:41:06 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:06 fetching remote URL: http://ids.si.edu/ids/deliveryService?id=CHSDM-1C3F9215FDF82-000001
Jun 23 16:41:06 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:06 fetching remote URL: http://ids.si.edu/ids/deliveryService?id=CHSDM-807BFDC2F6642-000001
Jun 23 16:41:06 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:06 request: {Method:GET URL:http://ids.si.edu/ids/deliveryService?id=CHSDM-807BFDC2F6642-000001#600x600 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[image/*] User-Agent:[willnorris/imageproxy]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:ids.si.edu Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc0000380b0} (served from cache: true)
Jun 23 16:41:06 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:06 request: {Method:GET URL:http://ids.si.edu/ids/deliveryService?id=CHSDM-1C3F9215FDF82-000001#600x600 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[image/*] User-Agent:[willnorris/imageproxy]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:ids.si.edu Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc0000380b0} (served from cache: false)
Jun 23 16:41:06 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:06 fetching remote URL: https://ids.si.edu/ids/deliveryService/id/ark:/65665/m35763bab4824f4b1c93ab7c2a2a420206
Jun 23 16:41:07 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:07 fetching remote URL: https://ids.si.edu/ids/deliveryService/id/ark:/65665/m32b3d2178e0e24504966ad6a4fbc65ad5
Jun 23 16:41:08 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:08 request: {Method:GET URL:https://ids.si.edu/ids/deliveryService/id/ark:/65665/m35763bab4824f4b1c93ab7c2a2a420206#600x600 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[image/*] User-Agent:[willnorris/imageproxy]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:ids.si.edu Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc0000380b0} (served from cache: false)
Jun 23 16:41:08 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:08 fetching remote URL: https://ids.si.edu/ids/deliveryService/id/ark:/65665/m36af88a4127b04ff6aaeb139db527dc18
Jun 23 16:41:09 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:09 request: {Method:GET URL:https://ids.si.edu/ids/deliveryService/id/ark:/65665/m38c5e91cf342c4f198feb8504e9e69f2d#600x600 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[image/*] User-Agent:[willnorris/imageproxy]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:ids.si.edu Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc0000380b0} (served from cache: false)
Jun 23 16:41:09 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:09 request: {Method:GET URL:https://ids.si.edu/ids/deliveryService/id/ark:/65665/m3847bf2c096eb43caa7e23cfc4026d1d0#600x600 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[image/*] User-Agent:[willnorris/imageproxy]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:ids.si.edu Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc0000380b0} (served from cache: false)
Jun 23 16:41:09 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:09 fetching remote URL: https://collections.nmnh.si.edu/media/?irn=13639590
Jun 23 16:41:09 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:09 request: {Method:GET URL:https://collections.nmnh.si.edu/media/?irn=13639590#600x600 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[image/*] User-Agent:[willnorris/imageproxy]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:collections.nmnh.si.edu Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc0000380b0} (served from cache: false)
Jun 23 16:41:09 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:09 fetching remote URL: https://ids.si.edu/ids/deliveryService/id/ark:/65665/m35e65f0dc197e4d1ab80dbd651e627e15
Jun 23 16:41:09 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:09 fetching remote URL: https://ids.si.edu/ids/deliveryService/id/ark:/65665/m3e710734addbe4ae09e6c5f6b06d4da75
Jun 23 16:41:10 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:10 request: {Method:GET URL:https://ids.si.edu/ids/deliveryService/id/ark:/65665/m32b3d2178e0e24504966ad6a4fbc65ad5#600x600 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[image/*] User-Agent:[willnorris/imageproxy]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:ids.si.edu Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc0000380b0} (served from cache: false)
Jun 23 16:41:11 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:11 fetching remote URL: https://collections.nmnh.si.edu/media/?irn=13333397
Jun 23 16:41:12 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:12 request: {Method:GET URL:https://ids.si.edu/ids/deliveryService/id/ark:/65665/m36af88a4127b04ff6aaeb139db527dc18#600x600 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[image/*] User-Agent:[willnorris/imageproxy]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:ids.si.edu Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc0000380b0} (served from cache: false)
Jun 23 16:41:12 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:12 fetching remote URL: https://collections.nmnh.si.edu/media/?irn=13074389
Jun 23 16:41:12 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:12 request: {Method:GET URL:https://collections.nmnh.si.edu/media/?irn=13333397#600x600 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Accept:[image/*] User-Agent:[willnorris/imageproxy]] Body:<nil> GetBody:<nil> ContentLength:0 TransferEncoding:[] Close:false Host:collections.nmnh.si.edu Form:map[] PostForm:map[] MultipartForm:<nil> Trailer:map[] RemoteAddr: RequestURI: TLS:<nil> Cancel:<nil> Response:<nil> ctx:0xc0000380b0} (served from cache: false)
Jun 23 16:41:13 ip-172-30-1-130.ec2.internal imageproxy[12575]: 2020/06/23 16:41:13 fetching remote URL: https://collections.nmnh.si.edu/media/?irn=13832774

Is there something about these URLs in particular that makes them uncacheable? Are the characters and query strings the culprit? I'm in the unusual position of proxying content from other origins outside of my control (with permission), so I unfortunately have no way to change what URL scheme is used upstream.

Here is an image that is never cached on refresh: https://collections.nmnh.si.edu/media/?irn=13498184#600x600

The text was updated successfully, but these errors were encountered:

willnorris · 2020-06-23T17:33:24Z

My first guess would be that the origin server is not providing any caching headers, and indeed that looks to be the case:

% curl -I -A "" "http://ids.si.edu/ids/deliveryService?id=CHSDM-1C3F9215FDF82-000001"
HTTP/1.1 200 OK
Date: Tue, 23 Jun 2020 17:30:27 GMT
Content-Type: image/jpeg
Content-Length: 728486
X-SI-SIRIS-RUID: 101189091592933427688544
Access-Control-Allow-Origin: *
Set-Cookie: ROUTEID=.ids2-16A; Path=/ids
Set-Cookie: TS01c2db25=01a3504f4c2dd7516139014fa2162241e98d511aa48028038f8c8ba5fe086614cafc546b91f3f8650749691c850b0b726d82303f91; Path=/; Domain=.si.edu; HTTPOnly
Set-Cookie: TS015d2ef6=01a3504f4c8f1a2c9c544b662cd8eda079829932578028038f8c8ba5fe086614cafc546b914eb1d8e54d2bd0955a7467e7d2f258786cbb8de929ffc839a5827f265f91d612; path=/ids; HTTPonly

There's no ETag, Expires, Last-Modified, etc headers, so imageproxy can't cache it. #208 will solve this for you, which I'll try to find time to get reviewed and merged this week.

willnorris · 2020-06-23T17:40:27Z

Also, that's really cool to see that @creativecommons is using imageproxy! Are you able/willing to share more about how you're using it? (Just for my own curiosity)

aldenstpage · 2020-06-23T18:05:12Z

That's great news, thanks for the speedy response! I'll keep an eye on that PR.

We're trying to index all of the Creative Commons works online and make them accessible through a single search portal. We've indexed about half a billion images so far, and plan to start indexing other types of content in the near future. Right now we're focusing on getting data to improve the search rankings of culturally significant works, which are currently drowned out by low quality stuff posted on social media.

As you can imagine, it is hard to thumbnail half a billion images that you don't have a copy of, so instead of crawling petabytes of images upfront, we use imageproxy to generate thumbnails on the results pages. When you perform a search and load thumbnails from our native URLs, our backend is talking to imageproxy to generate them.

We also use imageproxy to display the high-resolution version of images if they aren't TLS secured, which is common since we are showing content from museums that aren't able to invest a lot of resources into hosting their images properly.

It's safe to say that imageproxy is a key piece of our infrastructure!

aldenstpage changed the title ~~Content not being served from S3 cache~~ Some content never served from S3 cache Jun 23, 2020

aldenstpage mentioned this issue Jun 23, 2020

Verify that thumbnail S3 caching is working properly cc-archive/cccatalog-api#530

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some content never served from S3 cache #240

Some content never served from S3 cache #240

aldenstpage commented Jun 23, 2020 •

edited

willnorris commented Jun 23, 2020

willnorris commented Jun 23, 2020

aldenstpage commented Jun 23, 2020 •

edited

Some content never served from S3 cache #240

Some content never served from S3 cache #240

Comments

aldenstpage commented Jun 23, 2020 • edited

willnorris commented Jun 23, 2020

willnorris commented Jun 23, 2020

aldenstpage commented Jun 23, 2020 • edited

aldenstpage commented Jun 23, 2020 •

edited

aldenstpage commented Jun 23, 2020 •

edited