[Server] Stuck consumer task would kill SITs having issues post online #981

adamxchen · 2024-05-10T22:53:11Z

Summary

Background

Today if an unrecoverable exception occurs during current version RT ingestion, the ingest task unsubscribes and then sits around. This is a terminal stale state. That's the state model semantic we have today, that is, we don't put a replica into ERROR state if it's being online.

Recently, we observed that replicas stopped ingestion after being online due to a dependency outage. In order to recover from this, our oncall had to do manual host deployment/fleet-wide restart so these hosts can restart the workflow and ingest data again.

Goal

In this PR, we are trying to add auto-recovery mechanism for such tasks so oncalls don't have to do it manually

Technical details

We have several options to implement this.

[Implemented by this PR] Relocate the replica to another one by killing the SIT.
- The idea is we can piggyback on existing implementation of stuck consumer task to kill the task that's producing to a non-existing topic or is marked as stale after being online. Technically we can consider the stale leaders as its "consumer" being stuck 😃. The biggest drawback is when there's an outage, these leader would kill themselves w/o considering other replicas readiness. It potentially also cause some availability issue. It's acceptable given follower still serve the traffic
Relocate the replica to another one by disabling this replica
- The idea is we leverage existing logic for disabling replicas for future version and makes it work for current version. However, there are a few potential issues with this approach.
  - It breaks our state transition semantic. In order to leverage the logic, we need to either use ERROR or a new state after COMPLETED so helix can act on it, but COMPLETED will no longer be a terminal state. It’s okay to change it tho but the cost outweighs the gain (unless we foresee we will do this soon), given it's relatively rare to get into unrecoverable state after being online.
  - Need to keep listening more states on ZK. We had scaling challenges on ZK listeners. When the node enters terminal state, we unsubscribe the listener. With this approach, we have to keep listening to it even though 90% of time there would be no change. That's a waste.
Unsubscribe and re-subscribe the topic with a retry with backoff pattern to see if we can recover. This is a workable solution but during outages, it can be just infinite loops and eventually ends up in the same state.

How was this PR tested?

unit tests

Does this PR introduce any user-facing changes?

No. You can skip the rest of this section.
Yes. Make sure to explain your proposed changes and call out the behavior change.

sushantmane · 2024-05-13T18:40:35Z

The idea is we can piggyback on existing implementation of stuck consumer task to kill the task that's producing to a non-existing topic or is marked as stale after being online. Technically we can consider the stale leaders as its "consumer" being stuck 😃. The biggest drawback is when there's an outage, these leader would kill themselves w/o considering other replicas readiness. It potentially also cause some availability issue. It's acceptable given follower still serve the traffic

Say 10 replicas of a store are hosted on a given node and only one of them is stale, then killing SIT will affect all of them, right?

sushantmane · 2024-05-13T18:41:27Z

I also think we should have a design proposal for this task and review it internally.

adamxchen · 2024-05-13T21:47:57Z

The idea is we can piggyback on existing implementation of stuck consumer task to kill the task that's producing to a non-existing topic or is marked as stale after being online. Technically we can consider the stale leaders as its "consumer" being stuck 😃. The biggest drawback is when there's an outage, these leader would kill themselves w/o considering other replicas readiness. It potentially also cause some availability issue. It's acceptable given follower still serve the traffic

Say 10 replicas of a store are hosted on a given node and only one of them is stale, then killing SIT will affect all of them, right?

That's right... I somehow missed this.. In the event of a outage, all partitions had this issue so I thought it's okay to kill entire SIT. That's a good point. We should limit the impact if possible. Maybe have a new scheduling task could be feasible. Let me start a conversation on this.

adamxchen · 2024-05-13T22:36:44Z

Pending some discussions now. Let me close this one

adamxchen added 2 commits May 10, 2024 15:02

Stuck consumer task would kill SITs having issues post online

d121edf

remove unused codes

e71bd34

adamxchen closed this May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Server] Stuck consumer task would kill SITs having issues post online #981

[Server] Stuck consumer task would kill SITs having issues post online #981

adamxchen commented May 10, 2024

sushantmane commented May 13, 2024

sushantmane commented May 13, 2024 •

edited

adamxchen commented May 13, 2024

adamxchen commented May 13, 2024

[Server] Stuck consumer task would kill SITs having issues post online #981

[Server] Stuck consumer task would kill SITs having issues post online #981

Conversation

adamxchen commented May 10, 2024

Summary

Background

Goal

Technical details

How was this PR tested?

Does this PR introduce any user-facing changes?

sushantmane commented May 13, 2024

sushantmane commented May 13, 2024 • edited

adamxchen commented May 13, 2024

adamxchen commented May 13, 2024

sushantmane commented May 13, 2024 •

edited