diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md index fd70f15c1..88ec463f3 100644 --- a/A42-xds-ring-hash-lb-policy.md +++ b/A42-xds-ring-hash-lb-policy.md @@ -347,8 +347,10 @@ ring will be stored within the picker, so any time a new ring is generated, a new picker will be returned. The picker will contain not just the ring but also the current state of -every subchannel in the ring. Whenever a subchannel's state changes, a -new picker will need to be generated. +every subchannel in the ring. Implementations may either generate a new +picker every time a subchannel's state changes, or they may provide a +mechanism to synchronize subchannel state data between the LB policy +itself and the picker. ##### LB Policy Config @@ -406,36 +408,88 @@ parent. ##### Aggregated Connectivity State -The `ring_hash` policy will use the normal aggregation rules for reporting -the overall connectivity state to its parent (i.e., the same rules used -by `weighted_target`), but with one special case. If there are 2 or more -subchannels in state `TRANSIENT_FAILURE` and all other subchannels are in -state `IDLE` or `CONNECTING`, then the policy will report `TRANSIENT_FAILURE`. -This heuristic ensures that the priority policy will fail over to the -next priority quickly when there's an outage. - -So the overall aggregation rules here are: +Because the `ring_hash` policy does not proactively connect to +subchannels but rather triggers connection attempts based on picks, it +cannot use the aggregation rules used by most LB policies (e.g., those used +by the `weighted_target` policy) as-is. Instead, it has two differences +in how it handles `TRANSIENT_FAILURE`, both of which are motivated by +ensuring that the `ring_hash` policy will report reasonable connectivity +state transitions to its parent (which today is always the `priority` +policy but might at some point in the future be the channel itself). + +The `priority` policy essentially has an ordered list of child policies and +will send picks to the highest priority child that is currently reporting +`READY` or `IDLE`. This means that for `ring_hash` to function as a child +of the `priority` policy, it needs to report `TRANSIENT_FAILURE` when its +subchannels are not reachable. However, `ring_hash` attempts to connect +only to those subchannels that pick requests hash to, and if the first +subchannel fails to connect, it then sequentially attempts to connecto to +subsequent subchannels in ring order, so it may take a very long time +for all of the subchannels to report `TRANSIENT_FAILURE` instead of `IDLE`. +Under the normal aggregation rules, that means that the `ring_hash` policy +would take far too long to report `TRANSIENT_FAILURE`. And more +generally, if `ring_hash` has only attempted to connect to a small +subset of its subchannels, it cannot truly know the overall reachability +of all of the subchannels. To address these problems, the `ring_hash` +policy will use a heuristic that if there are two or more subchannels +reporting `TRANSIENT_FAILURE` and none in state `READY`, it will report +an aggregated connectivity state of `TRANSIENT_FAILURE`. This heuristic +is an attempt to balance the need to allow the `priority` policy to +quickly failover to the next priority and the desire to avoid reporting +the entire `ring_hash` policy as having failed when the problem is just +one individual subchannel that happens to be unreachable. + +In addition, once the `ring_hash` policy reports `TRANSIENT_FAILURE`, it +needs some way to recover from that state. The `ring_hash` policy +normally requires pick requests to trigger subchannel connection +attempts, but if it is being used as a child of the `priority` policy, +it will not be getting any picks once it reports `TRANSIENT_FAILURE`. +To work around this, it will make sure that it is attempting to +connect (after applicable backoff period) to at least one subchannel at +any given time. After a given subchannel fails a connection attempt, it +will move on to the next subchannel in the ring. It will keep doing this +until one of the subchannels successfully connects, at which point it +will report `READY` and stop proactively trying to connect. + +Another issue is the way that the `priority` policy handles its failover +timer. The failover timer is used to apply an upper bound to the amount +of time that `priority` policy waits for a child policy to become +connected before it gives up and creates the child policy for the next +priority. The failover timer is started when a child is first created +and is cancelled when the child reports any state other than `CONNECTING`. +To allow this timer to work properly, the `ring_hash` policy should +remain in state `CONNECTING` until it transitions to either +`TRANSIENT_FAILURE` or `READY`. Specifically, after the first subchannel +reports `TRANSIENT_FAILURE` and all other subchannels are in `IDLE`, it +should continue to report `CONNECTING` instead of `IDLE`. In this case, +just as in the `TRANSIENT_FAILURE` case above, it will proactively attempt +to connect to at least one subchannel at all times while it is reporting +`CONNECTING`, so that it does not stay in state `CONNECTING` indefinitely +if it is not receiving picks (e.g., if the application is only occassionally +starting RPCs and giving them very short deadlines). + +Note that when the `ring_hash` policy first starts up with a completely +new set of subchannels that are all in state `IDLE`, it will report `IDLE` +as a consequence of the aggregation rules shown below. This is different +from most policies, which start in state `CONNECTING`, and it will +prevent the failover timer in the `priority` policy from working +correctly, because the timer will be started when the child is created +but then immediately cancelled when it reports `IDLE`. To address this, +we will change the `priority` policy to restart the failover timer when a +child reports `CONNECTING`, if that child has not reported `TRANSIENT_FAILURE` +more recently than it reported `READY` or `IDLE`. + +Taking all of the above into account, the aggregation rules for +the `ring_hash` policy are as follows: 1. If there is at least one subchannel in `READY` state, report `READY`. 2. If there are 2 or more subchannels in `TRANSIENT_FAILURE` state, report `TRANSIENT_FAILURE`. 3. If there is at least one subchannel in `CONNECTING` state, report `CONNECTING`. -4. If there is at least one subchannel in `IDLE` state, report `IDLE`. -5. Otherwise, report `TRANSIENT_FAILURE`. - -While the `ring_hash` policy is reporting `TRANSIENT_FAILURE`, it will not be -getting any pick requests from the priority policy. However, because the -`ring_hash` policy does not attempt to reconnect to subchannels unless it -is getting pick requests, it will need special handling to ensure that it -will eventually recover from `TRANSIENT_FAILURE` state once the problem is -resolved. Specifically, it will make sure that it is attempting to -connect (after applicable backoff period) to at least one subchannel at -any given time. After a given subchannel fails a connection attempt, it -will move on to the next subchannel in the ring. It will keep doing this -until one of the subchannels successfully connects, at which point it -will report `READY` and stop proactively trying to connect. The policy -will remain in `TRANSIENT_FAILURE` until at least one subchannel becomes -connected, even if subchannels are in state `CONNECTING` during that time. +4. If there is one subchannel in `TRANSIENT_FAILURE` and there is more than + one subchannel, report state `CONNECTING`. +5. If there is at least one subchannel in `IDLE` state, report `IDLE`. +6. Otherwise, report `TRANSIENT_FAILURE`. ##### Picker Behavior @@ -504,6 +558,7 @@ if (ring[first_index].state == CONNECTING) { return PICK_QUEUE; } if (ring[first_index].state == TRANSIENT_FAILURE) { + ring[first_index].subchannel.TriggerConnectionAttemptInControlPlane(); first_subchannel = ring[first_index].subchannel; found_second_subchannel = false; found_first_non_failed = false;