Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS, S3 timeouts are not retried #7980

Open
seizethedave opened this issue Apr 25, 2024 · 0 comments
Open

GCS, S3 timeouts are not retried #7980

seizethedave opened this issue Apr 25, 2024 · 0 comments
Assignees

Comments

@seizethedave
Copy link
Contributor

seizethedave commented Apr 25, 2024

Describe the bug

We had an incident which appeared as intermittent network flakiness for 1-2 hours. During the affected time, the compactor's cleanup loop had a number of small object fetch calls to GCS that errored with "context deadline exceeded" after one minute. Example:

ts=2024-04-10T18:33:29.26007212Z caller=blocks_cleaner.go:407 level=warn component=cleaner run_id=1712773933 task=clean_up_users user=redacted msg="failed blocks cleanup and maintenance" err="read bucket index: retry failed with context deadline exceeded; last error: Get \"https://storage.googleapis.com/bucket_redacted/tenant_id%2Fbucket-index.json.gz\": context deadline exceeded" duration=1m0.001110933s

I saw the mention of "retry failed" in the error, but I wondered why a retry would not paper over what looks like a transient network issue. It turns out that no retry actually occurred: The standard GCS client retries certain classes of idempotent errors automatically, but in this case when the first request takes up the entire deadline before failing, no retries will occur. This was also reported in Google's tracker some time ago:

A tenant's cleanup loop will fail if any of its GCS operation times out. In our configuration (cleanup every 15 mins, store gateway only loads bucket indexes <1hr old) a single tenant would have a partial query outage if four of their consecutive cleanup loops experience this kind of unluckiness. Tenants with more blob store operations to perform at cleanup time will be more impacted.

Store-gateway looks to be victim to this same issue. When a store-gateway -> GCS fetch stalls in the same way (say in the Series RPC), store-gateway will wait for the caller's context to expire (by default 2 minutes in querier). What ends up happening is querier's request context is canceled and querier cancels all fan-out requests on other store-gateways and returns an error to query-frontend, where a retry is finally performed. That's a lot of stuff riding on every GCS fetch not timing out.

(I am assuming in this issue that exhausting a 1 minute timeout on a fetch of 6KiB is a case of network instability and that a retry would help. Cuz networks are unreliable.)

My cursory read of Thanos/minio's S3 interactions is that it has the same behavior: Certain HTTP errors are retried, but if the failure is the caller's context being canceled of course no retry will occur.

Expected behavior

In the face of transient network blips, I expect individual blob storage operations to be retried.

Environment

  • Infrastructure: kubernetes

Tasks

No tasks being tracked yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant