-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some content never served from S3 cache #240
Comments
My first guess would be that the origin server is not providing any caching headers, and indeed that looks to be the case:
There's no ETag, Expires, Last-Modified, etc headers, so imageproxy can't cache it. #208 will solve this for you, which I'll try to find time to get reviewed and merged this week. |
Also, that's really cool to see that @creativecommons is using imageproxy! Are you able/willing to share more about how you're using it? (Just for my own curiosity) |
That's great news, thanks for the speedy response! I'll keep an eye on that PR. We're trying to index all of the Creative Commons works online and make them accessible through a single search portal. We've indexed about half a billion images so far, and plan to start indexing other types of content in the near future. Right now we're focusing on getting data to improve the search rankings of culturally significant works, which are currently drowned out by low quality stuff posted on social media. As you can imagine, it is hard to thumbnail half a billion images that you don't have a copy of, so instead of crawling petabytes of images upfront, we use imageproxy to generate thumbnails on the results pages. When you perform a search and load thumbnails from our native URLs, our backend is talking to imageproxy to generate them. We also use imageproxy to display the high-resolution version of images if they aren't TLS secured, which is common since we are showing content from museums that aren't able to invest a lot of resources into hosting their images properly. It's safe to say that imageproxy is a key piece of our infrastructure! |
I use imageproxy to thumbnail some large images served from remote sources in my search engine, and it generally works beautifully for my use case. I have imageproxy connected to an S3 bucket for caching. However, some (not all!) images are never being served from the S3 cache. A hard page refresh always results in the content for these particular URLs being loaded from the origin. This is a problem because the images in question are huge and can take 10 or 15 seconds to download and thumbnail.
Here's the startup parameters I'm using:
imageproxy -verbose -addr localhost:8082 -cache s3://us-east-1/cc-thumbnail-cache-prod
Logs excerpt:
Is there something about these URLs in particular that makes them uncacheable? Are the characters and query strings the culprit? I'm in the unusual position of proxying content from other origins outside of my control (with permission), so I unfortunately have no way to change what URL scheme is used upstream.
Here is an image that is never cached on refresh: https://collections.nmnh.si.edu/media/?irn=13498184#600x600
The text was updated successfully, but these errors were encountered: