New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A42 update: fix ring_hash connectivity state aggregation rules #296
Changes from 5 commits
8ea1977
52895a0
d2efa4a
9e8ee0d
7bc65fa
183057b
8265ea3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -347,8 +347,10 @@ ring will be stored within the picker, so any time a new ring is | |
generated, a new picker will be returned. | ||
|
||
The picker will contain not just the ring but also the current state of | ||
every subchannel in the ring. Whenever a subchannel's state changes, a | ||
new picker will need to be generated. | ||
every subchannel in the ring. Implementations may either generate a new | ||
picker every time a subchannel's state changes, or they may provide a | ||
mechanism to synchronize subchannel state data between the LB policy | ||
itself and the picker. | ||
|
||
##### LB Policy Config | ||
|
||
|
@@ -406,36 +408,85 @@ parent. | |
|
||
##### Aggregated Connectivity State | ||
|
||
The `ring_hash` policy will use the normal aggregation rules for reporting | ||
the overall connectivity state to its parent (i.e., the same rules used | ||
by `weighted_target`), but with one special case. If there are 2 or more | ||
subchannels in state `TRANSIENT_FAILURE` and all other subchannels are in | ||
state `IDLE` or `CONNECTING`, then the policy will report `TRANSIENT_FAILURE`. | ||
This heuristic ensures that the priority policy will fail over to the | ||
next priority quickly when there's an outage. | ||
|
||
So the overall aggregation rules here are: | ||
1. If there is at least one subchannel in `READY` state, report `READY`. | ||
2. If there are 2 or more subchannels in `TRANSIENT_FAILURE` state, report | ||
`TRANSIENT_FAILURE`. | ||
3. If there is at least one subchannel in `CONNECTING` state, report | ||
`CONNECTING`. | ||
4. If there is at least one subchannel in `IDLE` state, report `IDLE`. | ||
5. Otherwise, report `TRANSIENT_FAILURE`. | ||
|
||
While the `ring_hash` policy is reporting `TRANSIENT_FAILURE`, it will not be | ||
getting any pick requests from the priority policy. However, because the | ||
`ring_hash` policy does not attempt to reconnect to subchannels unless it | ||
is getting pick requests, it will need special handling to ensure that it | ||
will eventually recover from `TRANSIENT_FAILURE` state once the problem is | ||
resolved. Specifically, it will make sure that it is attempting to | ||
Because the `ring_hash` policy does not proactively connect to | ||
subchannels but rather triggers connection attempts based on picks, it | ||
cannot use the aggregation rules used by most LB policies (e.g., those used | ||
by the `weighted_target` policy) as-is. Instead, it has two differences | ||
in how it handles `TRANSIENT_FAILURE`, both of which are motivated by | ||
ensuring that the `ring_hash` policy will work properly as a child of the | ||
`priority` policy. | ||
|
||
The `priority` policy essentially has an ordered list of child policies and | ||
will send picks to the highest priority child that is currently reporting | ||
`READY` or `IDLE`. This means that for `ring_hash` to function as a child | ||
of the `priority` policy, it needs to report `TRANSIENT_FAILURE` when its | ||
subchannels are not reachable. However, because `ring_hash` attempts to | ||
connect only to those subchannels that pick requests hash to, there are | ||
likely to be some subchannels that never attempt to connect and are reporting | ||
state `IDLE`, which means that under the normal aggregation rules, the | ||
`ring_hash` policy would never report `TRANSIENT_FAILURE`. And more | ||
generally, if `ring_hash` has only attempted to connect to a small | ||
subset of its subchannels, it cannot truly know the overall reachability | ||
of all of the subchannels. To address these problems, the `ring_hash` | ||
policy will use a heuristic that if there are two or more subchannels | ||
reporting `TRANSIENT_FAILURE` and none in state `READY`, it will report | ||
an aggregated connectivity state of `TRANSIENT_FAILURE`. This heuristic | ||
is an attempt to balance the need to allow the `priority` policy to | ||
quickly failover to the next priority and the desire to avoid reporting | ||
the entire `ring_hash` policy as having failed when the problem is just | ||
one individual subchannel that happens to be unreachable. | ||
|
||
In addition, once the `ring_hash` policy reports `TRANSIENT_FAILURE`, it | ||
needs some way to recover from that state. The `ring_hash` policy | ||
normally requires pick requests to trigger subchannel connection | ||
attempts, but if it is being used as a child of the `priority` policy, | ||
it will not be getting any picks once it reports `TRANSIENT_FAILURE`. | ||
To work around this, it will make sure that it is attempting to | ||
connect (after applicable backoff period) to at least one subchannel at | ||
any given time. After a given subchannel fails a connection attempt, it | ||
will move on to the next subchannel in the ring. It will keep doing this | ||
until one of the subchannels successfully connects, at which point it | ||
will report `READY` and stop proactively trying to connect. The policy | ||
will remain in `TRANSIENT_FAILURE` until at least one subchannel becomes | ||
connected, even if subchannels are in state `CONNECTING` during that time. | ||
will report `READY` and stop proactively trying to connect. | ||
|
||
Another issue is the way that the `priority` policy handles its failover | ||
timer. The failover timer is used to apply an upper bound to the amount | ||
of time that `priority` policy waits for a child policy to become | ||
connected before it gives up and creates the child policy for the next | ||
priority. The failover timer is started when a child is first created | ||
and is cancelled when the child reports any state other than `CONNECTING`. | ||
To allow this timer to work properly, the `ring_hash` policy should | ||
remain in state `CONNECTING` until it transitions to either | ||
`TRANSIENT_FAILURE` or `READY`. Specifically, after the first subchannel | ||
reports `TRANSIENT_FAILURE` and all other subchannels are in `IDLE`, it | ||
should continue to report `CONNECTING` instead of `IDLE`. In this case, | ||
just as in the `TRANSIENT_FAILURE` case above, it will proactively attempt | ||
to connect to at least one subchannel at all times while it is reporting | ||
`CONNECTING`, so that it does not stay in state `CONNECTING` indefinitely | ||
if it is not receiving picks (e.g., if the application is only occassionally | ||
starting RPCs and giving them very short deadlines). | ||
|
||
Note that when the `ring_hash` policy first starts up with a completely | ||
new set of subchannels that are all in state `IDLE`, it will report `IDLE` | ||
as a consequence of the aggregation rules shown below. This is different | ||
from most policies, which start in state `CONNECTING`, and it will | ||
prevent the failover timer in the `priority` policy from working | ||
correctly, because the timer will be started when the child is created | ||
but then immediately cancelled when it reports `IDLE`. To address this, | ||
we will change the `priority` policy to restart the failover timer when a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to confirm, consider the following case:
Does this mean we can now rely on the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it does mean that we'll failover faster in the case where all backends are blackholed. The down-side of this is that we may wind up failing over without ever having tried more than one address. So while this will fail over more quickly in the case where all of the addresses are blackholed, if there is a case where only the first address we try is blackholed but all of the other addresses are working fine, we will wind up failing over when we ideally shouldn't. If that becomes a problem, and if can ever agree on the semantics for happy eyeballs, we could potentially use something like that to improve this in the future. |
||
child reports `CONNECTING`, if that child has not reported `TRANSIENT_FAILURE` | ||
more recently than it reported `READY` or `IDLE`. | ||
|
||
Taking all of the above into account, the aggregation rules for | ||
the `ring_hash` policy are as follows: | ||
1. If there is at least one subchannel in `READY` state, report `READY`. | ||
2. If there are 2 or more subchannels in `TRANSIENT_FAILURE` state, report | ||
`TRANSIENT_FAILURE`. | ||
3. If there is at least one subchannel in `TRANSIENT_FAILURE` and there | ||
is more than one subchannel, report state `CONNECTING`. | ||
4. If there is at least one subchannel in `CONNECTING` state, report | ||
`CONNECTING`. | ||
5. If there is at least one subchannel in `IDLE` state, report `IDLE`. | ||
6. Otherwise, report `TRANSIENT_FAILURE`. | ||
|
||
##### Picker Behavior | ||
|
||
|
@@ -504,6 +555,7 @@ if (ring[first_index].state == CONNECTING) { | |
return PICK_QUEUE; | ||
} | ||
if (ring[first_index].state == TRANSIENT_FAILURE) { | ||
ring[first_index].subchannel.TriggerConnectionAttemptInControlPlane(); | ||
first_subchannel = ring[first_index].subchannel; | ||
found_second_subchannel = false; | ||
found_first_non_failed = false; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this accurate, that there can be addresses which ring hash will never attempt?
I might be wrong, but I thought that if every connection attempt failed, that ring hash will eventually circle around and try every address?
I'm asking this because, if ring hash will eventually attempt every address, then considering the new priority changes to properly handle the fail over timeout with ring hash, I'm wondering if this heuristic to enter TRANSIENT_FAILURE after encountering two connection failures is still worth having - i.e. what problem would it solve that isn't solved by priority fail over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right, it will eventually try all of the subchannels. However, this change is still needed for two reasons:
priority
policy uses an idempotent algorithm for choosing a priority that is triggered every time a child's state changes, and that algorithm treatsIDLE
the same asREADY
-- i.e., the state of the failover timer matters only when the child is reportingCONNECTING
, so ifring_hash
continues to reportIDLE
after the failover timer fires, we'll wind up re-selecting it as soon as the newly created next priority reportsCONNECTING
, which is not what we want. (To say this another way, the priority policy assumes that a policy inIDLE
will transition toCONNECTING
as soon as it gets a pick.)ring_hash
policy is designed to not be xDS-specific; eventually, we'd like to be able to use it as a top-level policy even without xDS. At that point, it would still be necessary to have the same hueristic here so that the channel will properly go intoTRANSIENT_FAILURE
when things are not working, rather than staying inIDLE
indefinitely. If things are not working, we want non-wait_for_ready RPCs to fail quickly.I've updated the wording here to try to clarify this a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying over context from chat FTR:
When
ring_hash
is underpriority
, it could stay in connecting for however long it took until reaching TD, and priority would do the right thing.However, when ring_hash is a top-level policy, we need to balance the goals of failing fast in some cases and using all failover addresses in others.
My only thought about waiting until every subchannel enters TF is that it may be easier for users to understand - that time taken for ring hash to reach TF is proportional to ring hash's address list size (i.e. we remove the need to understand this heuristic). But OTOH, taking forever to reach TF introduces its own problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we really can't wait until every single subchannel has been attempted before going TF, since that would basically turn every RPC into a wait_for_ready RPC, which is definitely not what we want.
This is a heuristic, and like any heuristic, it's an imperfect attempt to balance between competing objectives. No matter what value we choose, someone can argue that a different value would be better in their case, but we can't choose the value they want without making it worse in some other case.
At the end of the day, I think two is the right number here. We definitely don't want to do just one, because having one individual backend unreachable is probably not uncommon, and we don't want to fail all RPCs in that case. But it's fairly unlikely that the first two backends that we happen to try are both independently down at the same time without there being some broader reachability issue that will also affect any other backend we might try. Increasing the number to three might give us slightly more confidence that there is a broader reachability issue that will also affect any other backend, but that slight boost in confidence doesn't seem like enough to justify the additional delay in failing non-wait_for_ready RPCs. And the trade-off gets dramatically worse as you go up from three: the increased confidence becomes smaller and smaller, and the delay before failing non-wait_for_ready RPCs becomes larger and larger. So I think two is the right sweet-spot here.