GCS, S3 timeouts are not retried #7980

seizethedave · 2024-04-25T23:06:27Z

Describe the bug

We had an incident which appeared as intermittent network flakiness for 1-2 hours. During the affected time, the compactor's cleanup loop had a number of small object fetch calls to GCS that errored with "context deadline exceeded" after one minute. Example:

ts=2024-04-10T18:33:29.26007212Z caller=blocks_cleaner.go:407 level=warn component=cleaner run_id=1712773933 task=clean_up_users user=redacted msg="failed blocks cleanup and maintenance" err="read bucket index: retry failed with context deadline exceeded; last error: Get \"https://storage.googleapis.com/bucket_redacted/tenant_id%2Fbucket-index.json.gz\": context deadline exceeded" duration=1m0.001110933s

I saw the mention of "retry failed" in the error, but I wondered why a retry would not paper over what looks like a transient network issue. It turns out that no retry actually occurred: The standard GCS client retries certain classes of idempotent errors automatically, but in this case when the first request takes up the entire deadline before failing, no retries will occur. This was also reported in Google's tracker some time ago:

Make timeouts and retries work together googleapis/google-cloud-go#1941

A tenant's cleanup loop will fail if any of its GCS operation times out. In our configuration (cleanup every 15 mins, store gateway only loads bucket indexes <1hr old) a single tenant would have a partial query outage if four of their consecutive cleanup loops experience this kind of unluckiness. Tenants with more blob store operations to perform at cleanup time will be more impacted.

Store-gateway looks to be victim to this same issue. When a store-gateway -> GCS fetch stalls in the same way (say in the Series RPC), store-gateway will wait for the caller's context to expire (by default 2 minutes in querier). What ends up happening is querier's request context is canceled and querier cancels all fan-out requests on other store-gateways and returns an error to query-frontend, where a retry is finally performed. That's a lot of stuff riding on every GCS fetch not timing out.

(I am assuming in this issue that exhausting a 1 minute timeout on a fetch of 6KiB is a case of network instability and that a retry would help. Cuz networks are unreliable.)

My cursory read of Thanos/minio's S3 interactions is that it has the same behavior: Certain HTTP errors are retried, but if the failure is the caller's context being canceled of course no retry will occur.

Expected behavior

In the face of transient network blips, I expect individual blob storage operations to be retried.

Environment

Infrastructure: kubernetes

Tasks

Give feedback

No tasks being tracked yet.

Options

The text was updated successfully, but these errors were encountered:

seizethedave self-assigned this Apr 25, 2024

This was referenced May 2, 2024

Compactor: Add retries to block cleaner #8036

Closed

Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index #8071

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCS, S3 timeouts are not retried #7980

GCS, S3 timeouts are not retried #7980

seizethedave commented Apr 25, 2024 •

edited

Tasks

GCS, S3 timeouts are not retried #7980

GCS, S3 timeouts are not retried #7980

Comments

seizethedave commented Apr 25, 2024 • edited

Describe the bug

Expected behavior

Environment

Tasks

seizethedave commented Apr 25, 2024 •

edited