Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Server] Stuck consumer task would kill SITs having issues post online #981

Closed
wants to merge 2 commits into from

Conversation

adamxchen
Copy link
Collaborator

Summary

Background

Today if an unrecoverable exception occurs during current version RT ingestion, the ingest task unsubscribes and then sits around. This is a terminal stale state. That's the state model semantic we have today, that is, we don't put a replica into ERROR state if it's being online.

Recently, we observed that replicas stopped ingestion after being online due to a dependency outage. In order to recover from this, our oncall had to do manual host deployment/fleet-wide restart so these hosts can restart the workflow and ingest data again.

Goal

In this PR, we are trying to add auto-recovery mechanism for such tasks so oncalls don't have to do it manually

Technical details

We have several options to implement this.

  • [Implemented by this PR] Relocate the replica to another one by killing the SIT.
    • The idea is we can piggyback on existing implementation of stuck consumer task to kill the task that's producing to a non-existing topic or is marked as stale after being online. Technically we can consider the stale leaders as its "consumer" being stuck 馃槂. The biggest drawback is when there's an outage, these leader would kill themselves w/o considering other replicas readiness. It potentially also cause some availability issue. It's acceptable given follower still serve the traffic
  • Relocate the replica to another one by disabling this replica
    • The idea is we leverage existing logic for disabling replicas for future version and makes it work for current version. However, there are a few potential issues with this approach.
      • It breaks our state transition semantic. In order to leverage the logic, we need to either use ERROR or a new state after COMPLETED so helix can act on it, but COMPLETED will no longer be a terminal state. It鈥檚 okay to change it tho but the cost outweighs the gain (unless we foresee we will do this soon), given it's relatively rare to get into unrecoverable state after being online.
      • Need to keep listening more states on ZK. We had scaling challenges on ZK listeners. When the node enters terminal state, we unsubscribe the listener. With this approach, we have to keep listening to it even though 90% of time there would be no change. That's a waste.
  • Unsubscribe and re-subscribe the topic with a retry with backoff pattern to see if we can recover. This is a workable solution but during outages, it can be just infinite loops and eventually ends up in the same state.

How was this PR tested?

unit tests

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

@sushantmane
Copy link
Contributor

The idea is we can piggyback on existing implementation of stuck consumer task to kill the task that's producing to a non-existing topic or is marked as stale after being online. Technically we can consider the stale leaders as its "consumer" being stuck 馃槂. The biggest drawback is when there's an outage, these leader would kill themselves w/o considering other replicas readiness. It potentially also cause some availability issue. It's acceptable given follower still serve the traffic

Say 10 replicas of a store are hosted on a given node and only one of them is stale, then killing SIT will affect all of them, right?

@sushantmane
Copy link
Contributor

sushantmane commented May 13, 2024

I also think we should have a design proposal for this task and review it internally.

@adamxchen
Copy link
Collaborator Author

The idea is we can piggyback on existing implementation of stuck consumer task to kill the task that's producing to a non-existing topic or is marked as stale after being online. Technically we can consider the stale leaders as its "consumer" being stuck 馃槂. The biggest drawback is when there's an outage, these leader would kill themselves w/o considering other replicas readiness. It potentially also cause some availability issue. It's acceptable given follower still serve the traffic

Say 10 replicas of a store are hosted on a given node and only one of them is stale, then killing SIT will affect all of them, right?

That's right... I somehow missed this.. In the event of a outage, all partitions had this issue so I thought it's okay to kill entire SIT. That's a good point. We should limit the impact if possible. Maybe have a new scheduling task could be feasible. Let me start a conversation on this.

@adamxchen
Copy link
Collaborator Author

Pending some discussions now. Let me close this one

@adamxchen adamxchen closed this May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants