Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A42 update: fix ring_hash connectivity state aggregation rules #296

Merged
merged 7 commits into from Apr 14, 2022
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
100 changes: 72 additions & 28 deletions A42-xds-ring-hash-lb-policy.md
Expand Up @@ -347,8 +347,10 @@ ring will be stored within the picker, so any time a new ring is
generated, a new picker will be returned.

The picker will contain not just the ring but also the current state of
every subchannel in the ring. Whenever a subchannel's state changes, a
new picker will need to be generated.
every subchannel in the ring. Implementations may either generate a new
picker every time a subchannel's state changes, or they may provide a
mechanism to synchronize subchannel state data between the LB policy
itself and the picker.

##### LB Policy Config

Expand Down Expand Up @@ -406,36 +408,77 @@ parent.

##### Aggregated Connectivity State

The `ring_hash` policy will use the normal aggregation rules for reporting
the overall connectivity state to its parent (i.e., the same rules used
by `weighted_target`), but with one special case. If there are 2 or more
subchannels in state `TRANSIENT_FAILURE` and all other subchannels are in
state `IDLE` or `CONNECTING`, then the policy will report `TRANSIENT_FAILURE`.
This heuristic ensures that the priority policy will fail over to the
next priority quickly when there's an outage.

So the overall aggregation rules here are:
1. If there is at least one subchannel in `READY` state, report `READY`.
2. If there are 2 or more subchannels in `TRANSIENT_FAILURE` state, report
`TRANSIENT_FAILURE`.
3. If there is at least one subchannel in `CONNECTING` state, report
`CONNECTING`.
4. If there is at least one subchannel in `IDLE` state, report `IDLE`.
5. Otherwise, report `TRANSIENT_FAILURE`.

While the `ring_hash` policy is reporting `TRANSIENT_FAILURE`, it will not be
getting any pick requests from the priority policy. However, because the
`ring_hash` policy does not attempt to reconnect to subchannels unless it
is getting pick requests, it will need special handling to ensure that it
will eventually recover from `TRANSIENT_FAILURE` state once the problem is
resolved. Specifically, it will make sure that it is attempting to
Because the `ring_hash` policy does not proactively connect to
subchannels but rather triggers connection attempts based on picks, it
cannot use the aggregation rules used by most LB policies (e.g., those used
by the `weighted_target` policy) as-is. Instead, it has two differences
in how it handles `TRANSIENT_FAILURE`, both of which are motivated by
ensuring that the `ring_hash` policy will work properly as a child of the
`priority` policy.

The `priority` policy essentially has an ordered list of child policies and
will send picks to the highest priority child that is currently reporting
`READY` or `IDLE`. This means that for `ring_hash` to function as a child
of the `priority` policy, it needs to report `TRANSIENT_FAILURE` when its
subchannels are not reachable. However, because `ring_hash` attempts to
connect only to those subchannels that pick requests hash to, there are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accurate, that there can be addresses which ring hash will never attempt?

I might be wrong, but I thought that if every connection attempt failed, that ring hash will eventually circle around and try every address?

I'm asking this because, if ring hash will eventually attempt every address, then considering the new priority changes to properly handle the fail over timeout with ring hash, I'm wondering if this heuristic to enter TRANSIENT_FAILURE after encountering two connection failures is still worth having - i.e. what problem would it solve that isn't solved by priority fail over.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right, it will eventually try all of the subchannels. However, this change is still needed for two reasons:

  1. The priority policy uses an idempotent algorithm for choosing a priority that is triggered every time a child's state changes, and that algorithm treats IDLE the same as READY -- i.e., the state of the failover timer matters only when the child is reporting CONNECTING, so if ring_hash continues to report IDLE after the failover timer fires, we'll wind up re-selecting it as soon as the newly created next priority reports CONNECTING, which is not what we want. (To say this another way, the priority policy assumes that a policy in IDLE will transition to CONNECTING as soon as it gets a pick.)
  2. The ring_hash policy is designed to not be xDS-specific; eventually, we'd like to be able to use it as a top-level policy even without xDS. At that point, it would still be necessary to have the same hueristic here so that the channel will properly go into TRANSIENT_FAILURE when things are not working, rather than staying in IDLE indefinitely. If things are not working, we want non-wait_for_ready RPCs to fail quickly.

I've updated the wording here to try to clarify this a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying over context from chat FTR:

When ring_hash is under priority, it could stay in connecting for however long it took until reaching TD, and priority would do the right thing.

However, when ring_hash is a top-level policy, we need to balance the goals of failing fast in some cases and using all failover addresses in others.

My only thought about waiting until every subchannel enters TF is that it may be easier for users to understand - that time taken for ring hash to reach TF is proportional to ring hash's address list size (i.e. we remove the need to understand this heuristic). But OTOH, taking forever to reach TF introduces its own problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we really can't wait until every single subchannel has been attempted before going TF, since that would basically turn every RPC into a wait_for_ready RPC, which is definitely not what we want.

This is a heuristic, and like any heuristic, it's an imperfect attempt to balance between competing objectives. No matter what value we choose, someone can argue that a different value would be better in their case, but we can't choose the value they want without making it worse in some other case.

At the end of the day, I think two is the right number here. We definitely don't want to do just one, because having one individual backend unreachable is probably not uncommon, and we don't want to fail all RPCs in that case. But it's fairly unlikely that the first two backends that we happen to try are both independently down at the same time without there being some broader reachability issue that will also affect any other backend we might try. Increasing the number to three might give us slightly more confidence that there is a broader reachability issue that will also affect any other backend, but that slight boost in confidence doesn't seem like enough to justify the additional delay in failing non-wait_for_ready RPCs. And the trade-off gets dramatically worse as you go up from three: the increased confidence becomes smaller and smaller, and the delay before failing non-wait_for_ready RPCs becomes larger and larger. So I think two is the right sweet-spot here.

likely to be some subchannels that never attempt to connect and are reporting
state `IDLE`, which means that under the normal aggregation rules, the
`ring_hash` policy would never report `TRANSIENT_FAILURE`. And more
generally, if `ring_hash` has only attempted to connect to a small
subset of its subchannels, it cannot truly know the overall reachability
of all of the subchannels. To address these problems, the `ring_hash`
policy will use a heuristic that if there are two or more subchannels
reporting `TRANSIENT_FAILURE` and none in state `READY`, it will report
an aggregated connectivity state of `TRANSIENT_FAILURE`. This heuristic
is an attempt to balance the need to allow the `priority` policy to
quickly failover to the next priority and the desire to avoid reporting
the entire `ring_hash` policy as having failed when the problem is just
one individual subchannel that happens to be unreachable.

In addition, once the `ring_hash` policy reports `TRANSIENT_FAILURE`, it
needs some way to recover from that state. The `ring_hash` policy
normally requires pick requests to trigger subchannel connection
attempts, but if it is being used as a child of the `priority` policy,
it will not be getting any picks once it reports `TRANSIENT_FAILURE`.
To work around this, it will make sure that it is attempting to
connect (after applicable backoff period) to at least one subchannel at
any given time. After a given subchannel fails a connection attempt, it
will move on to the next subchannel in the ring. It will keep doing this
until one of the subchannels successfully connects, at which point it
will report `READY` and stop proactively trying to connect. The policy
will remain in `TRANSIENT_FAILURE` until at least one subchannel becomes
connected, even if subchannels are in state `CONNECTING` during that time.
will report `READY` and stop proactively trying to connect.

Another issue is the way that the `priority` policy handles its failover
timer. The failover timer is used to apply an upper bound to the amount
of time that `priority` policy waits for a child policy to become
connected before it gives up and creates the child policy for the next
priority. The failover timer is started when the child reports state
`CONNECTING` and is cancelled when the child reports any other state.
To allow this timer to work properly, the `ring_hash` policy should
remain in state `CONNECTING` until it transitions to either
`TRANSIENT_FAILURE` or `READY`. And, just as in the `TRANSIENT_FAILURE`
case above, it will proactively attempt to connect to at least one
subchannel at all times while it is reporting `CONNECTING`, so that it
does not stay in state `CONNECTING` indefinitely if it is not receiving
picks (e.g., if the application is only occassionally starting RPCs and
giving them very short deadlines).

Taking all of the above into account, the aggregation rules for
the `ring_hash` policy are as follows:
1. If there is at least one subchannel in `READY` state, report `READY`.
2. If there are 2 or more subchannels in `TRANSIENT_FAILURE` state, report
`TRANSIENT_FAILURE`.
3. If there is at least one subchannel in `TRANSIENT_FAILURE` and there
is more than one subchannel, report state `CONNECTING`.
4. If there is at least one subchannel in `CONNECTING` state, report
`CONNECTING`.
5. If there is at least one subchannel in `IDLE` state, report `IDLE`.
6. Otherwise, report `TRANSIENT_FAILURE`.

Note that when the `ring_hash` policy first starts up with a completely
new set of subchannels that are all in state `IDLE`, it will report `IDLE`
as a consequence of the above aggregation rules. This is different from
most policies, which start in state `CONNECTING`.

##### Picker Behavior

Expand Down Expand Up @@ -504,6 +547,7 @@ if (ring[first_index].state == CONNECTING) {
return PICK_QUEUE;
}
if (ring[first_index].state == TRANSIENT_FAILURE) {
ring[first_index].subchannel.TriggerConnectionAttemptInControlPlane();
first_subchannel = ring[first_index].subchannel;
found_second_subchannel = false;
found_first_non_failed = false;
Expand Down