New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
round_robin can continually be CONNECTING #6650
Comments
Does this rule apply to all other loadbalancers, especially for multi-locality or multi-cluster balancers? However, if all other balancers apply this rule, then multi-locality and multi-cluster are not of big concern for now because as all their underlying child policies apply this rule the parent balancer will automatically qualify this rule. |
This issue is only referring to round_robin. But if you ask me or Doug, yes, this should apply across the board (including the Channel). Short-term that may mean the policy handles it manually, but really I think we should move the behavior into subchannel itself. That will take a substantial redesign of pick_first however. |
Fixed by #6657. |
Wait. Isn't this issue intended to be open for tracking the long-term fix (as mentioned above)? I intentionally did not close it in #6657. |
This issue is only for round robin. |
If you have enough backends and they are all unavailable, then it becomes likely that at least one of them will be CONNECTING. That will delay RPCs and avoid giving them a clear error message.
When a subchannel becomes TRANSIENT_FAILURE, we want RR to continue considering it (for channel state and picking logic) TRANSIENT_FAILURE until the subchannel becomes READY. That means it would "ignore" CONNECTING subchannels, except for new and recently-READY subchannels.
This was done in C core in grpc/grpc#20245
This problem really impacts all LBs, even including pick_first. However, round_robin is hit particularly harder than pick_first. @dfawley and I are quite interested in expanding the scope of this change to more parts of grpc, but there are some issues it creates that would need to be resolved, mainly in when we choose to reconnect.
The text was updated successfully, but these errors were encountered: