Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A42 update: fix ring_hash connectivity state aggregation rules #296

Merged
merged 7 commits into from Apr 14, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
109 changes: 82 additions & 27 deletions A42-xds-ring-hash-lb-policy.md
Expand Up @@ -347,8 +347,10 @@ ring will be stored within the picker, so any time a new ring is
generated, a new picker will be returned.

The picker will contain not just the ring but also the current state of
every subchannel in the ring. Whenever a subchannel's state changes, a
new picker will need to be generated.
every subchannel in the ring. Implementations may either generate a new
picker every time a subchannel's state changes, or they may provide a
mechanism to synchronize subchannel state data between the LB policy
itself and the picker.

##### LB Policy Config

Expand Down Expand Up @@ -406,36 +408,88 @@ parent.

##### Aggregated Connectivity State

The `ring_hash` policy will use the normal aggregation rules for reporting
the overall connectivity state to its parent (i.e., the same rules used
by `weighted_target`), but with one special case. If there are 2 or more
subchannels in state `TRANSIENT_FAILURE` and all other subchannels are in
state `IDLE` or `CONNECTING`, then the policy will report `TRANSIENT_FAILURE`.
This heuristic ensures that the priority policy will fail over to the
next priority quickly when there's an outage.

So the overall aggregation rules here are:
Because the `ring_hash` policy does not proactively connect to
subchannels but rather triggers connection attempts based on picks, it
cannot use the aggregation rules used by most LB policies (e.g., those used
by the `weighted_target` policy) as-is. Instead, it has two differences
in how it handles `TRANSIENT_FAILURE`, both of which are motivated by
ensuring that the `ring_hash` policy will report reasonable connectivity
state transitions to its parent (which today is always the `priority`
policy but might at some point in the future be the channel itself).

The `priority` policy essentially has an ordered list of child policies and
will send picks to the highest priority child that is currently reporting
`READY` or `IDLE`. This means that for `ring_hash` to function as a child
of the `priority` policy, it needs to report `TRANSIENT_FAILURE` when its
subchannels are not reachable. However, `ring_hash` attempts to connect
only to those subchannels that pick requests hash to, and if the first
subchannel fails to connect, it then sequentially attempts to connecto to
subsequent subchannels in ring order, so it may take a very long time
for all of the subchannels to report `TRANSIENT_FAILURE` instead of `IDLE`.
Under the normal aggregation rules, that means that the `ring_hash` policy
would take far too long to report `TRANSIENT_FAILURE`. And more
generally, if `ring_hash` has only attempted to connect to a small
subset of its subchannels, it cannot truly know the overall reachability
of all of the subchannels. To address these problems, the `ring_hash`
policy will use a heuristic that if there are two or more subchannels
reporting `TRANSIENT_FAILURE` and none in state `READY`, it will report
an aggregated connectivity state of `TRANSIENT_FAILURE`. This heuristic
is an attempt to balance the need to allow the `priority` policy to
quickly failover to the next priority and the desire to avoid reporting
the entire `ring_hash` policy as having failed when the problem is just
one individual subchannel that happens to be unreachable.

In addition, once the `ring_hash` policy reports `TRANSIENT_FAILURE`, it
needs some way to recover from that state. The `ring_hash` policy
normally requires pick requests to trigger subchannel connection
attempts, but if it is being used as a child of the `priority` policy,
it will not be getting any picks once it reports `TRANSIENT_FAILURE`.
To work around this, it will make sure that it is attempting to
connect (after applicable backoff period) to at least one subchannel at
any given time. After a given subchannel fails a connection attempt, it
will move on to the next subchannel in the ring. It will keep doing this
until one of the subchannels successfully connects, at which point it
will report `READY` and stop proactively trying to connect.

Another issue is the way that the `priority` policy handles its failover
timer. The failover timer is used to apply an upper bound to the amount
of time that `priority` policy waits for a child policy to become
connected before it gives up and creates the child policy for the next
priority. The failover timer is started when a child is first created
and is cancelled when the child reports any state other than `CONNECTING`.
To allow this timer to work properly, the `ring_hash` policy should
remain in state `CONNECTING` until it transitions to either
`TRANSIENT_FAILURE` or `READY`. Specifically, after the first subchannel
reports `TRANSIENT_FAILURE` and all other subchannels are in `IDLE`, it
should continue to report `CONNECTING` instead of `IDLE`. In this case,
just as in the `TRANSIENT_FAILURE` case above, it will proactively attempt
to connect to at least one subchannel at all times while it is reporting
`CONNECTING`, so that it does not stay in state `CONNECTING` indefinitely
if it is not receiving picks (e.g., if the application is only occassionally
starting RPCs and giving them very short deadlines).

Note that when the `ring_hash` policy first starts up with a completely
new set of subchannels that are all in state `IDLE`, it will report `IDLE`
as a consequence of the aggregation rules shown below. This is different
from most policies, which start in state `CONNECTING`, and it will
prevent the failover timer in the `priority` policy from working
correctly, because the timer will be started when the child is created
but then immediately cancelled when it reports `IDLE`. To address this,
we will change the `priority` policy to restart the failover timer when a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, consider the following case:

  • ring hash subchannels are all unreachable in an unresponsive way (so it takes e.g. 30 seconds to notice each connection failure)
  • ring hash first under a priority policy, and a working failover priority follows it
  • channel and the LB policy tree are just starting up, i.e. first RPC is made on a new channel

Does this mean we can now rely on the priority failover timer to fail over after 10 seconds? (where before we needed to wait for ring hash to enter TF)

Copy link
Member Author

@markdroth markdroth Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does mean that we'll failover faster in the case where all backends are blackholed.

The down-side of this is that we may wind up failing over without ever having tried more than one address. So while this will fail over more quickly in the case where all of the addresses are blackholed, if there is a case where only the first address we try is blackholed but all of the other addresses are working fine, we will wind up failing over when we ideally shouldn't.

If that becomes a problem, and if can ever agree on the semantics for happy eyeballs, we could potentially use something like that to improve this in the future.

child reports `CONNECTING`, if that child has not reported `TRANSIENT_FAILURE`
more recently than it reported `READY` or `IDLE`.

Taking all of the above into account, the aggregation rules for
the `ring_hash` policy are as follows:
1. If there is at least one subchannel in `READY` state, report `READY`.
2. If there are 2 or more subchannels in `TRANSIENT_FAILURE` state, report
`TRANSIENT_FAILURE`.
3. If there is at least one subchannel in `CONNECTING` state, report
`CONNECTING`.
4. If there is at least one subchannel in `IDLE` state, report `IDLE`.
5. Otherwise, report `TRANSIENT_FAILURE`.

While the `ring_hash` policy is reporting `TRANSIENT_FAILURE`, it will not be
getting any pick requests from the priority policy. However, because the
`ring_hash` policy does not attempt to reconnect to subchannels unless it
is getting pick requests, it will need special handling to ensure that it
will eventually recover from `TRANSIENT_FAILURE` state once the problem is
resolved. Specifically, it will make sure that it is attempting to
connect (after applicable backoff period) to at least one subchannel at
any given time. After a given subchannel fails a connection attempt, it
will move on to the next subchannel in the ring. It will keep doing this
until one of the subchannels successfully connects, at which point it
will report `READY` and stop proactively trying to connect. The policy
will remain in `TRANSIENT_FAILURE` until at least one subchannel becomes
connected, even if subchannels are in state `CONNECTING` during that time.
4. If there is one subchannel in `TRANSIENT_FAILURE` and there is more than
one subchannel, report state `CONNECTING`.
5. If there is at least one subchannel in `IDLE` state, report `IDLE`.
6. Otherwise, report `TRANSIENT_FAILURE`.

##### Picker Behavior

Expand Down Expand Up @@ -504,6 +558,7 @@ if (ring[first_index].state == CONNECTING) {
return PICK_QUEUE;
}
if (ring[first_index].state == TRANSIENT_FAILURE) {
ring[first_index].subchannel.TriggerConnectionAttemptInControlPlane();
first_subchannel = ring[first_index].subchannel;
found_second_subchannel = false;
found_first_non_failed = false;
Expand Down