New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compactor blocks cleaner: retry operations that could interfere with rewriting bucket index #8071
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, but maybe we can retry on temporary object store errors too?
…ons fail. Allow backoff.Config to be passed in.
// the backoff config. Each invocation of f will be given perCallTimeout to | ||
// complete. This is specifically designed to retry timeouts due to flaky | ||
// connectivity with the objstore backend. | ||
func withRetries(ctx context.Context, perCallTimeout time.Duration, bc backoff.Config, logger log.Logger, f func(context.Context) error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this functional approach towards doing retries. Maybe something like this could go into dskit? Alternatively, for this functional approach to retries, we could use Failsafe-go (which we're using for circuit breaking).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just looking at failsafe-go yesterday, randomly. Whoever wrote that has been around the block a few times. :)
Maybe something like this could go into dskit?
Yeah, I looked to see if it already existed in dskit. There's probably room for something there. For this application, the coupling of per-call timeout contexts and a shouldRetry function that looks for DeadlineExceeded
seemed a little single-purpose.
Failsafe's backoff looks nice. But there is a plus to sticking with dskit/backoff as it is so pervasive in this codebase.
I expect to come back into this file to add similar retries inside of UpdateIndex ("sometime") so maybe we can keep this dialogue open/rule of three and all that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. LGTM overall.
var idx *bucketindex.Index | ||
err := withRetries(ctx, 1*time.Minute, c.retryConfig, log.With(userLogger, "op", "readIndex"), func(ctx context.Context) error { | ||
var err error | ||
idx, err = c.readIndex(ctx, c.bucketClient, userID, c.cfgProvider, userLogger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to just call bucketindex.ReadIndex
here, and same for c.writeIndex
vs bucketindex.WriteIndex
later.
If we want to inject errors from read/write index calls, I'd suggest doing it at bucket level (see ErrorInjectedBucketClient
, deceivingUploadBucket
or errBucket
), instead of introducing indirection in BlocksCleaner
.
What this PR does
This PR adds retries to operations in
BlocksCleaner.cleanUser
whose failures could lead to the bucket index failing to be rewritten. (ReadIndex
andWriteIndex
.)And why:
When the blocks cleaner runs for a tenant, it carries out a series of steps to perform one
cleanUser
pass. Most of these steps involve an objstore invocation. (Fetching a block index, iterating the paths under a block folder, deleting a marker...)In these series of steps, there are currently two avenues for "retries":
We are currently relying on Avenue 2 to eventually recover from past block cleaner failures. But the crux of a recent incident was that the stuff in cleanUser must 100% complete for the updated bucket index to be written. If cleanUser fails enough consecutive times, store-gateways will refuse to load the "stale" bucket index, and some queries will begin to fail. In that incident, a larger percentage of obj store calls were exceeding their context deadline (which looks like network flakiness) hence the >=4 consecutive cleanUser failures leading to a >=1 hour stale bucket index.
Notes:
ReadIndex
andWriteIndex
already have 1 minute hardcoded deadlines, so the new outer request deadlines I've chosen for those are safe.UpdateIndex
could use a retry, too, because if that method returns an error, the bucket index won't be rewritten. However, I've done some analysis in our logs and UpdateIndex for some tenants can take 5+ minutes while it updates scads of deletion markers. I'm not going to add time-based retries on that method so as not to accidentally break any legitimate work being done. There's room for improvement to come back and add finer grained retries on the operations inside ofUpdateIndex
.Which issue(s) this PR fixes or relates to
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.