Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index #8071

seizethedave · 2024-05-06T22:09:41Z

What this PR does

This PR adds retries to operations in BlocksCleaner.cleanUser whose failures could lead to the bucket index failing to be rewritten. (ReadIndex and WriteIndex.)

And why:

When the blocks cleaner runs for a tenant, it carries out a series of steps to perform one cleanUser pass. Most of these steps involve an objstore invocation. (Fetching a block index, iterating the paths under a block folder, deleting a marker...)
In these series of steps, there are currently two avenues for "retries":

Retries that the GCS, Minio (and so on) objstore provider SDKs perform. For example, the GCS SDK will automatically retry operations that it deems idempotent. And it has a suite of rules to determine which errors it will retry. Minio has similar (but different) policies around automatically retrying things.
every 15 minutes (by default) the tenant's block cleaner job will be run again.

We are currently relying on Avenue 2 to eventually recover from past block cleaner failures. But the crux of a recent incident was that the stuff in cleanUser must 100% complete for the updated bucket index to be written. If cleanUser fails enough consecutive times, store-gateways will refuse to load the "stale" bucket index, and some queries will begin to fail. In that incident, a larger percentage of obj store calls were exceeding their context deadline (which looks like network flakiness) hence the >=4 consecutive cleanUser failures leading to a >=1 hour stale bucket index.

Notes:

ReadIndex and WriteIndex already have 1 minute hardcoded deadlines, so the new outer request deadlines I've chosen for those are safe.
UpdateIndex could use a retry, too, because if that method returns an error, the bucket index won't be rewritten. However, I've done some analysis in our logs and UpdateIndex for some tenants can take 5+ minutes while it updates scads of deletion markers. I'm not going to add time-based retries on that method so as not to accidentally break any legitimate work being done. There's room for improvement to come back and add finer grained retries on the operations inside of UpdateIndex.

Which issue(s) this PR fixes or relates to

Relates to GCS, S3 timeouts are not retried #7980

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

…written.

dimitarvdimitrov

SGTM, but maybe we can retry on temporary object store errors too?

pkg/compactor/blocks_cleaner.go

…ons fail. Allow backoff.Config to be passed in.

…retry-per-step

jhalterman · 2024-05-10T00:40:08Z

pkg/compactor/blocks_cleaner.go

+// the backoff config. Each invocation of f will be given perCallTimeout to
+// complete. This is specifically designed to retry timeouts due to flaky
+// connectivity with the objstore backend.
+func withRetries(ctx context.Context, perCallTimeout time.Duration, bc backoff.Config, logger log.Logger, f func(context.Context) error) error {


I like this functional approach towards doing retries. Maybe something like this could go into dskit? Alternatively, for this functional approach to retries, we could use Failsafe-go (which we're using for circuit breaking).

I was just looking at failsafe-go yesterday, randomly. Whoever wrote that has been around the block a few times. :)

Maybe something like this could go into dskit?

Yeah, I looked to see if it already existed in dskit. There's probably room for something there. For this application, the coupling of per-call timeout contexts and a shouldRetry function that looks for DeadlineExceeded seemed a little single-purpose.

Failsafe's backoff looks nice. But there is a plus to sticking with dskit/backoff as it is so pervasive in this codebase.

I expect to come back into this file to add similar retries inside of UpdateIndex ("sometime") so maybe we can keep this dialogue open/rule of three and all that?

pstibrany

Thank you. LGTM overall.

pstibrany · 2024-05-15T10:06:48Z

pkg/compactor/blocks_cleaner.go

+	var idx *bucketindex.Index
+	err := withRetries(ctx, 1*time.Minute, c.retryConfig, log.With(userLogger, "op", "readIndex"), func(ctx context.Context) error {
+		var err error
+		idx, err = c.readIndex(ctx, c.bucketClient, userID, c.cfgProvider, userLogger)


I'd prefer to just call bucketindex.ReadIndex here, and same for c.writeIndex vs bucketindex.WriteIndex later.

If we want to inject errors from read/write index calls, I'd suggest doing it at bucket level (see ErrorInjectedBucketClient, deceivingUploadBucket or errBucket), instead of introducing indirection in BlocksCleaner.

seizethedave added 4 commits May 6, 2024 14:07

Retry things in cleanUser that could affect the bucket index being re…

6dfa72d

…written.

Clean up withRetries.

85e551c

Tests.

8eabc98

Another test.

c1929ff

seizethedave mentioned this pull request May 6, 2024

Compactor: Add retries to block cleaner #8036

Closed

4 tasks

seizethedave added 3 commits May 6, 2024 15:32

less vars

a6ecbc8

undefer cancel.

bcc6426

readCalls -> writeCalls

fd504e5

dimitarvdimitrov reviewed May 7, 2024

View reviewed changes

pkg/compactor/blocks_cleaner.go Outdated Show resolved Hide resolved

seizethedave added 7 commits May 7, 2024 08:56

OK, use ErrorIs

e19aba9

Fix races in test.

ed84e27

Add errors that follow Temporary protocol. Add some logs when operati…

9c489f7

…ons fail. Allow backoff.Config to be passed in.

Merge remote-tracking branch 'origin/main' into davidgrant/cleanuser-…

31d79d3

…retry-per-step

same thing

97c7fe7

Don't give UpdateIndex a timeout.

0b2a0b9

Add changelog entry.

c1bfeb3

seizethedave marked this pull request as ready for review May 7, 2024 19:35

seizethedave requested a review from a team as a code owner May 7, 2024 19:35

seizethedave changed the title ~~[WIP] Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index~~ Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index May 7, 2024

seizethedave requested review from dimitarvdimitrov and jhalterman and removed request for dimitarvdimitrov May 7, 2024 19:36

jhalterman reviewed May 10, 2024

View reviewed changes

Merge

8da170d

pstibrany reviewed May 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index #8071

Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index #8071

seizethedave commented May 6, 2024 •

edited

dimitarvdimitrov left a comment

jhalterman May 10, 2024 •

edited

seizethedave May 10, 2024

pstibrany left a comment

pstibrany May 15, 2024

Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index #8071

Are you sure you want to change the base?

Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index #8071

Conversation

seizethedave commented May 6, 2024 • edited

What this PR does

And why:

Notes:

Which issue(s) this PR fixes or relates to

Checklist

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

jhalterman May 10, 2024 • edited

Choose a reason for hiding this comment

seizethedave May 10, 2024

Choose a reason for hiding this comment

pstibrany left a comment

Choose a reason for hiding this comment

pstibrany May 15, 2024

Choose a reason for hiding this comment

seizethedave commented May 6, 2024 •

edited

jhalterman May 10, 2024 •

edited