Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A76: Improvements to the Ring Hash LB Policy #412
base: master
Are you sure you want to change the base?
A76: Improvements to the Ring Hash LB Policy #412
Changes from 2 commits
e558476
5407438
cc3af7f
01449d7
cf65a14
202a225
0de42f2
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably also worth referencing A61, which significantly simplified the ring_hash picker logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks for pointing this out, I wasn't aware of A61's impact on ring hash. I spent some time to understand how that affects the implementation I started in Go. Go implements A62 (pick first sticky transient failure) but not A61, so it cannot take advantage of this without further refactoring the ring hash policy. I'll sync up with the Go team to see what the best path forward is, but I imagine it'd be best to start with implementing at least part of A61 (delegating to
pick_first
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a quick discussion with @dfawley I decided to try to implement this before A61. I think the proposal could be more clear in this case, at least for Go, because there is a difference between immediately attempting to connect (which happens only for subchannels in "idle" state), and queuing a connect for subchannels in transient failure, where a connection attempt will be triggered when the subchannel returns to
idle
state after the connection backoff.I implemented the new behavior of scanning forward for a ready subchannel, either when:
I think it has the desired behaviour of making sure we don't trigger more than one connection attempt on each pick if there is a ready subchannel, not adding latency if there is a ready subchannel, and converge to random. But I also considered only implementing the scanning forward for ready subchannel when we triggered an immediate connection attempt, which happens when either:
This ambiguity will disappear when all implementations have A61 implemented. This is planned for this quarter for Go, IIUC, but I imagine there may be reasons for some implementations to want to implement A76 before A61. My question is whether you think it's worth adding a note about it in this gRFC to lift this ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My inclination is that this logic is already complicated enough to understand, so I'd prefer to avoid muddying the waters by trying to do this without having first implemented the relevant parts of A61. From first glance, what you describe here seems reasonable, but I think I'd have to stare at it for a lot longer to convince myself it didn't have any potential pitfalls.
@dfawley, is it not feasible to implement just the ring_hash part of A61 before implementing this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDS responses are handled in a load balancing policy, so I'm not sure if "name resolver attribute" is the right terminology here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I interpreted your comment as two things:
1 seems to be in motion with A74, which removes the cluster resolver LB policy in favor of the "xDS resolver".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To get protobuf syntax highlighting, instead of indenting the code, use a fenced code block, with the language name
proto
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of a Protobuf 3 message, an unset string field is the same as an empty string value, so stating that the field "defaults to the empty string" is redundant with the default Protobuf behavior. In addition, "defaults to the empty string" seems to imply that the empty string value will be used in some way, but earlier it says that "not set" and "empty" are treated the same. I think it would be clearer to say something like "Optional, unused if unset".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, replaced with "Optional, unused if unset.". I also removed mentions of "not set" in the implementation section, since there is no way to distinguish not set from empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deploying a new version of gRPC will clear all ring_hash state anyway, so I don't understand how this environment variable would do anything to prevent churn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deploying with the environment variable unset will keep the current behavior of hashing the IP address instead of the new behavior of using the value from LbEndpoint.Metadata. As a result, the locations of existing endpoints on the ring will be guaranteed to be the same as before, and requests will continue being routed to the same endpoint. It will churn connections, if that's what you mean, but not endpoint locations on the ring. Suddenly changing all locations on the ring may have unintended consequences, removing the locality that ring hash may have been used for in the first place.
I'm thinking of a better phrasing, and the best I could think of was to replace "churn" with "change". Let me know if this is still unclear.