Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

round_robin can continually be CONNECTING #6650

Closed
ejona86 opened this issue Jan 28, 2020 · 6 comments
Closed

round_robin can continually be CONNECTING #6650

ejona86 opened this issue Jan 28, 2020 · 6 comments
Assignees
Milestone

Comments

@ejona86
Copy link
Member

ejona86 commented Jan 28, 2020

If you have enough backends and they are all unavailable, then it becomes likely that at least one of them will be CONNECTING. That will delay RPCs and avoid giving them a clear error message.

When a subchannel becomes TRANSIENT_FAILURE, we want RR to continue considering it (for channel state and picking logic) TRANSIENT_FAILURE until the subchannel becomes READY. That means it would "ignore" CONNECTING subchannels, except for new and recently-READY subchannels.

This was done in C core in grpc/grpc#20245

This problem really impacts all LBs, even including pick_first. However, round_robin is hit particularly harder than pick_first. @dfawley and I are quite interested in expanding the scope of this change to more parts of grpc, but there are some issues it creates that would need to be resolved, mainly in when we choose to reconnect.

@dapengzhang0
Copy link
Member

dapengzhang0 commented Feb 7, 2020

Does this rule apply to all other loadbalancers, especially for multi-locality or multi-cluster balancers? However, if all other balancers apply this rule, then multi-locality and multi-cluster are not of big concern for now because as all their underlying child policies apply this rule the parent balancer will automatically qualify this rule.

@ejona86
Copy link
Member Author

ejona86 commented Feb 7, 2020

This issue is only referring to round_robin. But if you ask me or Doug, yes, this should apply across the board (including the Channel). Short-term that may mean the policy handles it manually, but really I think we should move the behavior into subchannel itself. That will take a substantial redesign of pick_first however.

@dapengzhang0
Copy link
Member

Fixed by #6657.

@voidzcy
Copy link
Contributor

voidzcy commented Apr 7, 2020

Wait. Isn't this issue intended to be open for tracking the long-term fix (as mentioned above)? I intentionally did not close it in #6657.

@ejona86
Copy link
Member Author

ejona86 commented Apr 7, 2020

This issue is only for round robin.

@dapengzhang0
Copy link
Member

@voidzcy I filed a tracking issue #6906

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants