From 8ea1977469b6f1688a21082582d3bb25c7502970 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 11 Apr 2022 22:07:00 +0000 Subject: [PATCH 1/7] A42 update: fix ring_hash connectivity state aggregation rules --- A42-xds-ring-hash-lb-policy.md | 95 ++++++++++++++++++++++++---------- 1 file changed, 67 insertions(+), 28 deletions(-) diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md index fd70f15c1..0b484a283 100644 --- a/A42-xds-ring-hash-lb-policy.md +++ b/A42-xds-ring-hash-lb-policy.md @@ -347,8 +347,10 @@ ring will be stored within the picker, so any time a new ring is generated, a new picker will be returned. The picker will contain not just the ring but also the current state of -every subchannel in the ring. Whenever a subchannel's state changes, a -new picker will need to be generated. +every subchannel in the ring. Implementations may either generate a new +picker every time a subchannel's state changes, or they may provide a +mechanism to synchronize subchannel state data between the LB policy +itself and the picker. ##### LB Policy Config @@ -406,36 +408,72 @@ parent. ##### Aggregated Connectivity State -The `ring_hash` policy will use the normal aggregation rules for reporting -the overall connectivity state to its parent (i.e., the same rules used -by `weighted_target`), but with one special case. If there are 2 or more -subchannels in state `TRANSIENT_FAILURE` and all other subchannels are in -state `IDLE` or `CONNECTING`, then the policy will report `TRANSIENT_FAILURE`. -This heuristic ensures that the priority policy will fail over to the -next priority quickly when there's an outage. - -So the overall aggregation rules here are: -1. If there is at least one subchannel in `READY` state, report `READY`. -2. If there are 2 or more subchannels in `TRANSIENT_FAILURE` state, report - `TRANSIENT_FAILURE`. -3. If there is at least one subchannel in `CONNECTING` state, report - `CONNECTING`. -4. If there is at least one subchannel in `IDLE` state, report `IDLE`. -5. Otherwise, report `TRANSIENT_FAILURE`. - -While the `ring_hash` policy is reporting `TRANSIENT_FAILURE`, it will not be -getting any pick requests from the priority policy. However, because the -`ring_hash` policy does not attempt to reconnect to subchannels unless it -is getting pick requests, it will need special handling to ensure that it -will eventually recover from `TRANSIENT_FAILURE` state once the problem is -resolved. Specifically, it will make sure that it is attempting to +Because the `ring_hash` policy does not proactively connect to +subchannels but rather triggers connection attempts based on picks, it +cannot use the aggregation rules used by most LB policies (e.g., those used +by the `weighted_target` policy) as-is. Instead, it has two differences +in how it handles `TRANSIENT_FAILURE`, both of which are motivated by +ensuring that the `ring_hash` policy will work properly as a child of the +`priority` policy. + +The `priority` policy essentially has an ordered list of child policies and +will send picks to the highest priority child that is currently reporting +`READY` or `IDLE`. This means that for `ring_hash` to function as a child +of the `priority` policy, it needs to report `TRANSIENT_FAILURE` when its +subchannels are not reachable. However, because `ring_hash` attempts to +connect only to those subchannels that pick requests hash to, there are +likely to be some subchannels that never attempt to connect and are reporting +state `IDLE`, which means that under the normal aggregation rules, the +`ring_hash` policy would never report `TRANSIENT_FAILURE`. And more +generally, if `ring_hash` has only attempted to connect to a small +subset of its subchannels, it cannot truly know the overall reachability +of all of the subchannels. To address these problems, the `ring_hash` +policy will use a heuristic that if there are two or more subchannels +reporting `TRANSIENT_FAILURE` and none in state `READY`, it will report +an aggregated connectivity state of `TRANSIENT_FAILURE`. This heuristic +is an attempt to balance the need to allow the `priority` policy to +quickly failover to the next priority and the desire to avoid reporting +the entire `ring_hash` policy as having failed when the problem is just +one individual subchannel that happens to be unreachable. + +In addition, once the `ring_hash` policy reports `TRANSIENT_FAILURE`, it +needs some way to recover from that state. The `ring_hash` policy +normally requires pick requests to trigger subchannel connection +attempts, but if it is being used as a child of the `priority` policy, +it will not be getting any picks once it reports `TRANSIENT_FAILURE`. +To work around this, it will make sure that it is attempting to connect (after applicable backoff period) to at least one subchannel at any given time. After a given subchannel fails a connection attempt, it will move on to the next subchannel in the ring. It will keep doing this until one of the subchannels successfully connects, at which point it -will report `READY` and stop proactively trying to connect. The policy -will remain in `TRANSIENT_FAILURE` until at least one subchannel becomes -connected, even if subchannels are in state `CONNECTING` during that time. +will report `READY` and stop proactively trying to connect. + +Another issue is the way that the `priority` policy handles its failover +timer. The failover timer is used to apply an upper bound to the amount +of time that `priority` policy waits for a child policy to become +connected before it gives up and creates the child policy for the next +priority. The failover timer is started when the child reports state +`CONNECTING` and is cancelled when the child reports any other state. +To allow this timer to work properly, the `ring_hash` policy should +remain in state `CONNECTING` until it transitions to either +`TRANSIENT_FAILURE` or `READY`. And, just as in the `TRANSIENT_FAILURE` +case above, it will proactively attempt to connect to at least one +subchannel at all times while it is reporting `CONNECTING`, so that it +does not stay in state `CONNECTING` indefinitely if it is not receiving +picks (e.g., if the application is only occassionally starting RPCs and +giving them very short deadlines). + +Taking all of the above into account, the aggregation rules for +the `ring_hash` policy are as follows: +1. If there is at least one subchannel in `READY` state, report `READY`. +2. If there are 2 or more subchannels in `TRANSIENT_FAILURE` state, report + `TRANSIENT_FAILURE`. +3. If there is at least one subchannel in `TRANSIENT_FAILURE` and there + is more than one subchannel, report state `CONNECTING`. +4. If there is at least one subchannel in `CONNECTING` state, report + `CONNECTING`. +5. If there is at least one subchannel in `IDLE` state, report `IDLE`. +6. Otherwise, report `TRANSIENT_FAILURE`. ##### Picker Behavior @@ -504,6 +542,7 @@ if (ring[first_index].state == CONNECTING) { return PICK_QUEUE; } if (ring[first_index].state == TRANSIENT_FAILURE) { + ring[first_index].subchannel.TriggerConnectionAttemptInControlPlane(); first_subchannel = ring[first_index].subchannel; found_second_subchannel = false; found_first_non_failed = false; From 52895a05f288e710f5816413210b1a0dd3aa78d3 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 11 Apr 2022 22:25:30 +0000 Subject: [PATCH 2/7] clarify that ring_hash usually starts in IDLE --- A42-xds-ring-hash-lb-policy.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md index 0b484a283..94b0cee74 100644 --- a/A42-xds-ring-hash-lb-policy.md +++ b/A42-xds-ring-hash-lb-policy.md @@ -475,6 +475,11 @@ the `ring_hash` policy are as follows: 5. If there is at least one subchannel in `IDLE` state, report `IDLE`. 6. Otherwise, report `TRANSIENT_FAILURE`. +Note that when the `ring_hash` policy first starts up with a completely +new set of subchannels that are all in state `IDLE`, it will report `IDLE` +as a consequence of the above aggregation rules. This is different from +most policies, which start in state `CONNECTING`. + ##### Picker Behavior The picker will start by finding the subchannel on the ring that is From d2efa4ae31713cc20515dfdaa409243e62d77e89 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 11 Apr 2022 22:46:18 +0000 Subject: [PATCH 3/7] clarify behavior change in priority policy --- A42-xds-ring-hash-lb-policy.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md index 94b0cee74..5072c68ea 100644 --- a/A42-xds-ring-hash-lb-policy.md +++ b/A42-xds-ring-hash-lb-policy.md @@ -452,8 +452,8 @@ Another issue is the way that the `priority` policy handles its failover timer. The failover timer is used to apply an upper bound to the amount of time that `priority` policy waits for a child policy to become connected before it gives up and creates the child policy for the next -priority. The failover timer is started when the child reports state -`CONNECTING` and is cancelled when the child reports any other state. +priority. The failover timer is started when a child is first created +and is cancelled when the child reports any state other than `CONNECTING`. To allow this timer to work properly, the `ring_hash` policy should remain in state `CONNECTING` until it transitions to either `TRANSIENT_FAILURE` or `READY`. And, just as in the `TRANSIENT_FAILURE` @@ -461,7 +461,10 @@ case above, it will proactively attempt to connect to at least one subchannel at all times while it is reporting `CONNECTING`, so that it does not stay in state `CONNECTING` indefinitely if it is not receiving picks (e.g., if the application is only occassionally starting RPCs and -giving them very short deadlines). +giving them very short deadlines). Additionally, we will change the +`priority` policy to restart the failover timer when a child reports +`CONNECTING`, if that child has not reported `TRANSIENT_FAILURE` more +recently than it reported `READY` or `IDLE`. Taking all of the above into account, the aggregation rules for the `ring_hash` policy are as follows: From 9e8ee0d43153840c76e4f8701da72abed1096f04 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 11 Apr 2022 22:59:28 +0000 Subject: [PATCH 4/7] further clarify the priority change --- A42-xds-ring-hash-lb-policy.md | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md index 5072c68ea..8df5e880a 100644 --- a/A42-xds-ring-hash-lb-policy.md +++ b/A42-xds-ring-hash-lb-policy.md @@ -461,10 +461,18 @@ case above, it will proactively attempt to connect to at least one subchannel at all times while it is reporting `CONNECTING`, so that it does not stay in state `CONNECTING` indefinitely if it is not receiving picks (e.g., if the application is only occassionally starting RPCs and -giving them very short deadlines). Additionally, we will change the -`priority` policy to restart the failover timer when a child reports -`CONNECTING`, if that child has not reported `TRANSIENT_FAILURE` more -recently than it reported `READY` or `IDLE`. +giving them very short deadlines). + +Note that when the `ring_hash` policy first starts up with a completely +new set of subchannels that are all in state `IDLE`, it will report `IDLE` +as a consequence of the aggregation rules shown below. This is different +from most policies, which start in state `CONNECTING`, and it will +prevent the failover timer in the `priority` policy from working +correctly, because the timer will be started when the child is created +but then immediately cancelled when it reports `IDLE`. To address this, +we will change the `priority` policy to restart the failover timer when a +child reports `CONNECTING`, if that child has not reported `TRANSIENT_FAILURE` +more recently than it reported `READY` or `IDLE`. Taking all of the above into account, the aggregation rules for the `ring_hash` policy are as follows: @@ -478,11 +486,6 @@ the `ring_hash` policy are as follows: 5. If there is at least one subchannel in `IDLE` state, report `IDLE`. 6. Otherwise, report `TRANSIENT_FAILURE`. -Note that when the `ring_hash` policy first starts up with a completely -new set of subchannels that are all in state `IDLE`, it will report `IDLE` -as a consequence of the above aggregation rules. This is different from -most policies, which start in state `CONNECTING`. - ##### Picker Behavior The picker will start by finding the subchannel on the ring that is From 7bc65fa683be4de60f6b332d49f66e426925455a Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 11 Apr 2022 23:10:06 +0000 Subject: [PATCH 5/7] clarify wording --- A42-xds-ring-hash-lb-policy.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md index 8df5e880a..7480534c6 100644 --- a/A42-xds-ring-hash-lb-policy.md +++ b/A42-xds-ring-hash-lb-policy.md @@ -456,12 +456,14 @@ priority. The failover timer is started when a child is first created and is cancelled when the child reports any state other than `CONNECTING`. To allow this timer to work properly, the `ring_hash` policy should remain in state `CONNECTING` until it transitions to either -`TRANSIENT_FAILURE` or `READY`. And, just as in the `TRANSIENT_FAILURE` -case above, it will proactively attempt to connect to at least one -subchannel at all times while it is reporting `CONNECTING`, so that it -does not stay in state `CONNECTING` indefinitely if it is not receiving -picks (e.g., if the application is only occassionally starting RPCs and -giving them very short deadlines). +`TRANSIENT_FAILURE` or `READY`. Specifically, after the first subchannel +reports `TRANSIENT_FAILURE` and all other subchannels are in `IDLE`, it +should continue to report `CONNECTING` instead of `IDLE`. In this case, +just as in the `TRANSIENT_FAILURE` case above, it will proactively attempt +to connect to at least one subchannel at all times while it is reporting +`CONNECTING`, so that it does not stay in state `CONNECTING` indefinitely +if it is not receiving picks (e.g., if the application is only occassionally +starting RPCs and giving them very short deadlines). Note that when the `ring_hash` policy first starts up with a completely new set of subchannels that are all in state `IDLE`, it will report `IDLE` From 183057b962ff3b34110c24c61be6ea65dca46c75 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 11 Apr 2022 23:20:56 +0000 Subject: [PATCH 6/7] swap the order of aggregation rules 3 and 4 --- A42-xds-ring-hash-lb-policy.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md index 7480534c6..530907e31 100644 --- a/A42-xds-ring-hash-lb-policy.md +++ b/A42-xds-ring-hash-lb-policy.md @@ -481,10 +481,10 @@ the `ring_hash` policy are as follows: 1. If there is at least one subchannel in `READY` state, report `READY`. 2. If there are 2 or more subchannels in `TRANSIENT_FAILURE` state, report `TRANSIENT_FAILURE`. -3. If there is at least one subchannel in `TRANSIENT_FAILURE` and there - is more than one subchannel, report state `CONNECTING`. -4. If there is at least one subchannel in `CONNECTING` state, report +3. If there is at least one subchannel in `CONNECTING` state, report `CONNECTING`. +4. If there is one subchannel in `TRANSIENT_FAILURE` and there is more than + one subchannel, report state `CONNECTING`. 5. If there is at least one subchannel in `IDLE` state, report `IDLE`. 6. Otherwise, report `TRANSIENT_FAILURE`. From 8265ea3f386232c132a41f87da3358ab0c074213 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Wed, 13 Apr 2022 22:38:01 +0000 Subject: [PATCH 7/7] clarify wording --- A42-xds-ring-hash-lb-policy.md | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md index 530907e31..88ec463f3 100644 --- a/A42-xds-ring-hash-lb-policy.md +++ b/A42-xds-ring-hash-lb-policy.md @@ -413,18 +413,21 @@ subchannels but rather triggers connection attempts based on picks, it cannot use the aggregation rules used by most LB policies (e.g., those used by the `weighted_target` policy) as-is. Instead, it has two differences in how it handles `TRANSIENT_FAILURE`, both of which are motivated by -ensuring that the `ring_hash` policy will work properly as a child of the -`priority` policy. +ensuring that the `ring_hash` policy will report reasonable connectivity +state transitions to its parent (which today is always the `priority` +policy but might at some point in the future be the channel itself). The `priority` policy essentially has an ordered list of child policies and will send picks to the highest priority child that is currently reporting `READY` or `IDLE`. This means that for `ring_hash` to function as a child of the `priority` policy, it needs to report `TRANSIENT_FAILURE` when its -subchannels are not reachable. However, because `ring_hash` attempts to -connect only to those subchannels that pick requests hash to, there are -likely to be some subchannels that never attempt to connect and are reporting -state `IDLE`, which means that under the normal aggregation rules, the -`ring_hash` policy would never report `TRANSIENT_FAILURE`. And more +subchannels are not reachable. However, `ring_hash` attempts to connect +only to those subchannels that pick requests hash to, and if the first +subchannel fails to connect, it then sequentially attempts to connecto to +subsequent subchannels in ring order, so it may take a very long time +for all of the subchannels to report `TRANSIENT_FAILURE` instead of `IDLE`. +Under the normal aggregation rules, that means that the `ring_hash` policy +would take far too long to report `TRANSIENT_FAILURE`. And more generally, if `ring_hash` has only attempted to connect to a small subset of its subchannels, it cannot truly know the overall reachability of all of the subchannels. To address these problems, the `ring_hash`