New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ingester] Ingester service state and lifecycler ring state not synchronized #8097
Comments
shouldn't the lifecycler be started only after all of its submodules are in a Running state? What is holding up the startup of the ingester? |
Do you mean the ingester's submodules? There are several ingester subservices that require the lifecycler to be running first, at least according to the comment in that file. That said, maybe the comment is wrong and we could just move the lifecycler start to the end of |
ah, yes, that's what I was looking for. From those services only the |
@dimitarvdimitrov as the code stands currently, we shouldn't hit this issue, because (as you point out) all of the code that runs after the lifecycler is started should execute quickly, and not block the ingester's So, this issue is less about fixing an active problem, but more about fixing non-deterministic behavior that can cause problems in non-obvious and can cause problems in the future. |
Background
The ingester runs as a
BasicService
and moves to theservices.Running
state after thestarting()
function completes.As part of its
starting()
function, the ingester starts aring.Lifecycler
. Once started, the lifecycler auto-joins the ring, and moves the ingester's ring state toring.ACTIVE
as soon as it can.The Problem
ring.ACTIVE
it becomes available for read requests.services.Running
state, the ingester will reject read requests.Because of the above, starting the
Lifecycler
essentially starts a timer on the ingester service getting to theservices.Running
state. If the ingester'sstarting()
function is still being executed when the ring state becomesring.ACTIVE
, the ingester will start receiving read requests, but reject them all with erroringester is unavailable (current state: Starting)
.This isn't much of an issue if a single ingester enters this state, since reads are able to complete using other zones to achieve quorum. However, when ingesters are scaled up horizontally, instances are added to all zones at the same time. If instances in multiple zones are rejecting reads while in the
services.Stating
state, quorum can't be achieved, and we suffer a read outage.Solution
Ideally, moving the ring state to
ring.ACTIVE
should be the last thing done in the ingester'sstarting()
function (or the first thing done in itsrunning()
function) -- no other code should run in between those two events.Unfortunately the existing
ring.Lifecycler
used by the ingester doesn't offer much control over when the switch toring.ACTIVE
occurs, since it auto-joins the ring.The text was updated successfully, but these errors were encountered: