Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ingester] Ingester service state and lifecycler ring state not synchronized #8097

Open
pr00se opened this issue May 10, 2024 · 4 comments
Open

Comments

@pr00se
Copy link
Contributor

pr00se commented May 10, 2024

Background

The ingester runs as a BasicService and moves to the services.Running state after the starting() function completes.

As part of its starting() function, the ingester starts a ring.Lifecycler. Once started, the lifecycler auto-joins the ring, and moves the ingester's ring state to ring.ACTIVE as soon as it can.

The Problem

  • Once an ingester's ring state is ring.ACTIVE it becomes available for read requests.
  • When the ingester services is not in the services.Running state, the ingester will reject read requests.

Because of the above, starting the Lifecycler essentially starts a timer on the ingester service getting to the services.Running state. If the ingester's starting() function is still being executed when the ring state becomes ring.ACTIVE, the ingester will start receiving read requests, but reject them all with error ingester is unavailable (current state: Starting).

This isn't much of an issue if a single ingester enters this state, since reads are able to complete using other zones to achieve quorum. However, when ingesters are scaled up horizontally, instances are added to all zones at the same time. If instances in multiple zones are rejecting reads while in the services.Stating state, quorum can't be achieved, and we suffer a read outage.

Solution

Ideally, moving the ring state to ring.ACTIVE should be the last thing done in the ingester's starting() function (or the first thing done in its running() function) -- no other code should run in between those two events.

Unfortunately the existing ring.Lifecycler used by the ingester doesn't offer much control over when the switch to ring.ACTIVE occurs, since it auto-joins the ring.

@dimitarvdimitrov
Copy link
Contributor

shouldn't the lifecycler be started only after all of its submodules are in a Running state? What is holding up the startup of the ingester?

@pr00se
Copy link
Contributor Author

pr00se commented May 22, 2024

shouldn't the lifecycler be started only after all of its submodules are in a Running state? What is holding up the startup of the ingester?

Do you mean the ingester's submodules? There are several ingester subservices that require the lifecycler to be running first, at least according to the comment in that file. That said, maybe the comment is wrong and we could just move the lifecycler start to the end of starting()?

@dimitarvdimitrov
Copy link
Contributor

Do you mean the ingester's submodules? There are several ingester subservices that require the lifecycler to be running first, at least according to the comment in that file

ah, yes, that's what I was looking for. From those services only the ingestPartitionLifecycler has any starting procedure; the rest are timer services which start ~immediately. I still don't understand why that would hold up the ingester from starting for that long. Is it possible that some components haven't yet received the ring update that the ingester is shutting down and entering LEAVING state and still send it queries?

@pr00se
Copy link
Contributor Author

pr00se commented May 22, 2024

@dimitarvdimitrov as the code stands currently, we shouldn't hit this issue, because (as you point out) all of the code that runs after the lifecycler is started should execute quickly, and not block the ingester's starting(). However, during development of the owned series service this behavior caused issues (#7087) and was confusing to navigate around.

So, this issue is less about fixing an active problem, but more about fixing non-deterministic behavior that can cause problems in non-obvious and can cause problems in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants