You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Remove faulty Host Detection check and replace it with gRPC keepalive mechanism
Current Behavior:
The current implementation of faultyHostDetection in Dapr keeps track of every host's update message. It involves locking the store and looping through all the members in the raft state to identify and disconnect hosts that haven't reported their status in a while.
This approach, while functional, is inefficient and can lead to reliability issues due to the overhead of maintaining and processing these state checks.
Proposed Change:
I propose removing the faultyHostDetection check and replacing it with the gRPC keepalive mechanism. gRPC keepalive is a built-in feature designed to handle connection liveness, making it a more efficient and reliable solution for detecting and handling unresponsive hosts.
Dapr would still keep sending the status messages because they're needed for determining which hosts haven't connected to the new leader in case of placement server failover and also they are a metric we're exposing
On placement service failover, the new leader should wait for faultyHostDetectInitialDuration (currently 6 seconds) to give enough time for all sidecars to connect to the placement service and after that it should run the faulty Host detection check only once, to remove from the placement table any hosts that weren't able to connect.
The text was updated successfully, but these errors were encountered:
/area placement
Describe the proposal
Remove faulty Host Detection check and replace it with gRPC keepalive mechanism
Current Behavior:
The current implementation of faultyHostDetection in Dapr keeps track of every host's update message. It involves locking the store and looping through all the members in the raft state to identify and disconnect hosts that haven't reported their status in a while.
This approach, while functional, is inefficient and can lead to reliability issues due to the overhead of maintaining and processing these state checks.
Proposed Change:
I propose removing the faultyHostDetection check and replacing it with the gRPC keepalive mechanism. gRPC keepalive is a built-in feature designed to handle connection liveness, making it a more efficient and reliable solution for detecting and handling unresponsive hosts.
Dapr would still keep sending the status messages because they're needed for determining which hosts haven't connected to the new leader in case of placement server failover and also they are a metric we're exposing
On placement service failover, the new leader should wait for
faultyHostDetectInitialDuration
(currently 6 seconds) to give enough time for all sidecars to connect to the placement service and after that it should run the faulty Host detection check only once, to remove from the placement table any hosts that weren't able to connect.The text was updated successfully, but these errors were encountered: