round_robin can continually be CONNECTING #6650

ejona86 · 2020-01-28T00:02:18Z

If you have enough backends and they are all unavailable, then it becomes likely that at least one of them will be CONNECTING. That will delay RPCs and avoid giving them a clear error message.

When a subchannel becomes TRANSIENT_FAILURE, we want RR to continue considering it (for channel state and picking logic) TRANSIENT_FAILURE until the subchannel becomes READY. That means it would "ignore" CONNECTING subchannels, except for new and recently-READY subchannels.

This was done in C core in grpc/grpc#20245

This problem really impacts all LBs, even including pick_first. However, round_robin is hit particularly harder than pick_first. @dfawley and I are quite interested in expanding the scope of this change to more parts of grpc, but there are some issues it creates that would need to be resolved, mainly in when we choose to reconnect.

dapengzhang0 · 2020-02-07T19:12:30Z

Does this rule apply to all other loadbalancers, especially for multi-locality or multi-cluster balancers? However, if all other balancers apply this rule, then multi-locality and multi-cluster are not of big concern for now because as all their underlying child policies apply this rule the parent balancer will automatically qualify this rule.

ejona86 · 2020-02-07T19:52:18Z

This issue is only referring to round_robin. But if you ask me or Doug, yes, this should apply across the board (including the Channel). Short-term that may mean the policy handles it manually, but really I think we should move the behavior into subchannel itself. That will take a substantial redesign of pick_first however.

dapengzhang0 · 2020-04-07T21:35:32Z

Fixed by #6657.

voidzcy · 2020-04-07T21:40:25Z

Wait. Isn't this issue intended to be open for tracking the long-term fix (as mentioned above)? I intentionally did not close it in #6657.

ejona86 · 2020-04-07T21:53:50Z

This issue is only for round robin.

dapengzhang0 · 2020-04-07T21:55:42Z

@voidzcy I filed a tracking issue #6906

ejona86 added the enhancement label Jan 28, 2020

ejona86 added this to the 1.28 milestone Jan 28, 2020

dfawley mentioned this issue Jan 28, 2020

round_robin can continually be CONNECTING grpc/grpc-go#3339

Closed

voidzcy self-assigned this Jan 28, 2020

voidzcy mentioned this issue Jan 29, 2020

core: keep round_robin lb subchannel in TRANSIENT_FAILURE until becoming READY #6657

Merged

creamsoup modified the milestones: 1.28, 1.29 Feb 27, 2020

dapengzhang0 mentioned this issue Mar 3, 2020

xds: implement WeightedTargetLoadBalancer #6731

Merged

dapengzhang0 mentioned this issue Apr 7, 2020

LoadBalancer/Subchannel might need keep in TRANSIENT_FAILURE until READY #6906

Open

dapengzhang0 closed this as completed Apr 7, 2020

ejona86 mentioned this issue Mar 11, 2021

Stabilize channel state API #4359

Open

github-actions bot locked as resolved and limited conversation to collaborators Jun 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

round_robin can continually be CONNECTING #6650

round_robin can continually be CONNECTING #6650

ejona86 commented Jan 28, 2020

dapengzhang0 commented Feb 7, 2020 •

edited

ejona86 commented Feb 7, 2020

dapengzhang0 commented Apr 7, 2020

voidzcy commented Apr 7, 2020

ejona86 commented Apr 7, 2020

dapengzhang0 commented Apr 7, 2020

round_robin can continually be CONNECTING #6650

round_robin can continually be CONNECTING #6650

Comments

ejona86 commented Jan 28, 2020

dapengzhang0 commented Feb 7, 2020 • edited

ejona86 commented Feb 7, 2020

dapengzhang0 commented Apr 7, 2020

voidzcy commented Apr 7, 2020

ejona86 commented Apr 7, 2020

dapengzhang0 commented Apr 7, 2020

dapengzhang0 commented Feb 7, 2020 •

edited