From 8ea1977469b6f1688a21082582d3bb25c7502970 Mon Sep 17 00:00:00 2001
From: "Mark D. Roth" <roth@google.com>
Date: Mon, 11 Apr 2022 22:07:00 +0000
Subject: [PATCH 1/7] A42 update: fix ring_hash connectivity state aggregation
 rules

---
 A42-xds-ring-hash-lb-policy.md | 95 ++++++++++++++++++++++++----------
 1 file changed, 67 insertions(+), 28 deletions(-)

diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md
index fd70f15c1..0b484a283 100644
--- a/A42-xds-ring-hash-lb-policy.md
+++ b/A42-xds-ring-hash-lb-policy.md
@@ -347,8 +347,10 @@ ring will be stored within the picker, so any time a new ring is
 generated, a new picker will be returned.
 
 The picker will contain not just the ring but also the current state of
-every subchannel in the ring.  Whenever a subchannel's state changes, a
-new picker will need to be generated.
+every subchannel in the ring.  Implementations may either generate a new
+picker every time a subchannel's state changes, or they may provide a
+mechanism to synchronize subchannel state data between the LB policy
+itself and the picker.
 
 ##### LB Policy Config
 
@@ -406,36 +408,72 @@ parent.
 
 ##### Aggregated Connectivity State
 
-The `ring_hash` policy will use the normal aggregation rules for reporting
-the overall connectivity state to its parent (i.e., the same rules used
-by `weighted_target`), but with one special case.  If there are 2 or more
-subchannels in state `TRANSIENT_FAILURE` and all other subchannels are in
-state `IDLE` or `CONNECTING`, then the policy will report `TRANSIENT_FAILURE`.
-This heuristic ensures that the priority policy will fail over to the
-next priority quickly when there's an outage.
-
-So the overall aggregation rules here are:
-1. If there is at least one subchannel in `READY` state, report `READY`.
-2. If there are 2 or more subchannels in `TRANSIENT_FAILURE` state, report
-   `TRANSIENT_FAILURE`.
-3. If there is at least one subchannel in `CONNECTING` state, report
-   `CONNECTING`.
-4. If there is at least one subchannel in `IDLE` state, report `IDLE`.
-5. Otherwise, report `TRANSIENT_FAILURE`.
-
-While the `ring_hash` policy is reporting `TRANSIENT_FAILURE`, it will not be
-getting any pick requests from the priority policy.  However, because the
-`ring_hash` policy does not attempt to reconnect to subchannels unless it
-is getting pick requests, it will need special handling to ensure that it
-will eventually recover from `TRANSIENT_FAILURE` state once the problem is
-resolved.  Specifically, it will make sure that it is attempting to
+Because the `ring_hash` policy does not proactively connect to
+subchannels but rather triggers connection attempts based on picks, it
+cannot use the aggregation rules used by most LB policies (e.g., those used
+by the `weighted_target` policy) as-is.  Instead, it has two differences
+in how it handles `TRANSIENT_FAILURE`, both of which are motivated by
+ensuring that the `ring_hash` policy will work properly as a child of the
+`priority` policy.
+
+The `priority` policy essentially has an ordered list of child policies and
+will send picks to the highest priority child that is currently reporting
+`READY` or `IDLE`.  This means that for `ring_hash` to function as a child
+of the `priority` policy, it needs to report `TRANSIENT_FAILURE` when its
+subchannels are not reachable.  However, because `ring_hash` attempts to
+connect only to those subchannels that pick requests hash to, there are
+likely to be some subchannels that never attempt to connect and are reporting
+state `IDLE`, which means that under the normal aggregation rules, the
+`ring_hash` policy would never report `TRANSIENT_FAILURE`.  And more
+generally, if `ring_hash` has only attempted to connect to a small
+subset of its subchannels, it cannot truly know the overall reachability
+of all of the subchannels.  To address these problems, the `ring_hash`
+policy will use a heuristic that if there are two or more subchannels
+reporting `TRANSIENT_FAILURE` and none in state `READY`, it will report
+an aggregated connectivity state of `TRANSIENT_FAILURE`.  This heuristic
+is an attempt to balance the need to allow the `priority` policy to
+quickly failover to the next priority and the desire to avoid reporting
+the entire `ring_hash` policy as having failed when the problem is just
+one individual subchannel that happens to be unreachable.
+
+In addition, once the `ring_hash` policy reports `TRANSIENT_FAILURE`, it
+needs some way to recover from that state.  The `ring_hash` policy
+normally requires pick requests to trigger subchannel connection
+attempts, but if it is being used as a child of the `priority` policy,
+it will not be getting any picks once it reports `TRANSIENT_FAILURE`.
+To work around this, it will make sure that it is attempting to
 connect (after applicable backoff period) to at least one subchannel at
 any given time.  After a given subchannel fails a connection attempt, it
 will move on to the next subchannel in the ring.  It will keep doing this
 until one of the subchannels successfully connects, at which point it
-will report `READY` and stop proactively trying to connect.  The policy
-will remain in `TRANSIENT_FAILURE` until at least one subchannel becomes
-connected, even if subchannels are in state `CONNECTING` during that time.
+will report `READY` and stop proactively trying to connect.
+
+Another issue is the way that the `priority` policy handles its failover
+timer.  The failover timer is used to apply an upper bound to the amount
+of time that `priority` policy waits for a child policy to become
+connected before it gives up and creates the child policy for the next
+priority.  The failover timer is started when the child reports state
+`CONNECTING` and is cancelled when the child reports any other state.
+To allow this timer to work properly, the `ring_hash` policy should
+remain in state `CONNECTING` until it transitions to either
+`TRANSIENT_FAILURE` or `READY`.  And, just as in the `TRANSIENT_FAILURE`
+case above, it will proactively attempt to connect to at least one
+subchannel at all times while it is reporting `CONNECTING`, so that it
+does not stay in state `CONNECTING` indefinitely if it is not receiving
+picks (e.g., if the application is only occassionally starting RPCs and
+giving them very short deadlines).
+
+Taking all of the above into account, the aggregation rules for
+the `ring_hash` policy are as follows:
+1. If there is at least one subchannel in `READY` state, report `READY`.
+2. If there are 2 or more subchannels in `TRANSIENT_FAILURE` state, report
+   `TRANSIENT_FAILURE`.
+3. If there is at least one subchannel in `TRANSIENT_FAILURE` and there
+   is more than one subchannel, report state `CONNECTING`.
+4. If there is at least one subchannel in `CONNECTING` state, report
+   `CONNECTING`.
+5. If there is at least one subchannel in `IDLE` state, report `IDLE`.
+6. Otherwise, report `TRANSIENT_FAILURE`.
 
 ##### Picker Behavior
 
@@ -504,6 +542,7 @@ if (ring[first_index].state == CONNECTING) {
   return PICK_QUEUE;
 }
 if (ring[first_index].state == TRANSIENT_FAILURE) {
+  ring[first_index].subchannel.TriggerConnectionAttemptInControlPlane();
   first_subchannel = ring[first_index].subchannel;
   found_second_subchannel = false;
   found_first_non_failed = false;

From 52895a05f288e710f5816413210b1a0dd3aa78d3 Mon Sep 17 00:00:00 2001
From: "Mark D. Roth" <roth@google.com>
Date: Mon, 11 Apr 2022 22:25:30 +0000
Subject: [PATCH 2/7] clarify that ring_hash usually starts in IDLE

---
 A42-xds-ring-hash-lb-policy.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md
index 0b484a283..94b0cee74 100644
--- a/A42-xds-ring-hash-lb-policy.md
+++ b/A42-xds-ring-hash-lb-policy.md
@@ -475,6 +475,11 @@ the `ring_hash` policy are as follows:
 5. If there is at least one subchannel in `IDLE` state, report `IDLE`.
 6. Otherwise, report `TRANSIENT_FAILURE`.
 
+Note that when the `ring_hash` policy first starts up with a completely
+new set of subchannels that are all in state `IDLE`, it will report `IDLE`
+as a consequence of the above aggregation rules.  This is different from
+most policies, which start in state `CONNECTING`.
+
 ##### Picker Behavior
 
 The picker will start by finding the subchannel on the ring that is

From d2efa4ae31713cc20515dfdaa409243e62d77e89 Mon Sep 17 00:00:00 2001
From: "Mark D. Roth" <roth@google.com>
Date: Mon, 11 Apr 2022 22:46:18 +0000
Subject: [PATCH 3/7] clarify behavior change in priority policy

---
 A42-xds-ring-hash-lb-policy.md | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md
index 94b0cee74..5072c68ea 100644
--- a/A42-xds-ring-hash-lb-policy.md
+++ b/A42-xds-ring-hash-lb-policy.md
@@ -452,8 +452,8 @@ Another issue is the way that the `priority` policy handles its failover
 timer.  The failover timer is used to apply an upper bound to the amount
 of time that `priority` policy waits for a child policy to become
 connected before it gives up and creates the child policy for the next
-priority.  The failover timer is started when the child reports state
-`CONNECTING` and is cancelled when the child reports any other state.
+priority.  The failover timer is started when a child is first created
+and is cancelled when the child reports any state other than `CONNECTING`.
 To allow this timer to work properly, the `ring_hash` policy should
 remain in state `CONNECTING` until it transitions to either
 `TRANSIENT_FAILURE` or `READY`.  And, just as in the `TRANSIENT_FAILURE`
@@ -461,7 +461,10 @@ case above, it will proactively attempt to connect to at least one
 subchannel at all times while it is reporting `CONNECTING`, so that it
 does not stay in state `CONNECTING` indefinitely if it is not receiving
 picks (e.g., if the application is only occassionally starting RPCs and
-giving them very short deadlines).
+giving them very short deadlines).  Additionally, we will change the
+`priority` policy to restart the failover timer when a child reports
+`CONNECTING`, if that child has not reported `TRANSIENT_FAILURE` more
+recently than it reported `READY` or `IDLE`.
 
 Taking all of the above into account, the aggregation rules for
 the `ring_hash` policy are as follows:

From 9e8ee0d43153840c76e4f8701da72abed1096f04 Mon Sep 17 00:00:00 2001
From: "Mark D. Roth" <roth@google.com>
Date: Mon, 11 Apr 2022 22:59:28 +0000
Subject: [PATCH 4/7] further clarify the priority change

---
 A42-xds-ring-hash-lb-policy.md | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md
index 5072c68ea..8df5e880a 100644
--- a/A42-xds-ring-hash-lb-policy.md
+++ b/A42-xds-ring-hash-lb-policy.md
@@ -461,10 +461,18 @@ case above, it will proactively attempt to connect to at least one
 subchannel at all times while it is reporting `CONNECTING`, so that it
 does not stay in state `CONNECTING` indefinitely if it is not receiving
 picks (e.g., if the application is only occassionally starting RPCs and
-giving them very short deadlines).  Additionally, we will change the
-`priority` policy to restart the failover timer when a child reports
-`CONNECTING`, if that child has not reported `TRANSIENT_FAILURE` more
-recently than it reported `READY` or `IDLE`.
+giving them very short deadlines).
+
+Note that when the `ring_hash` policy first starts up with a completely
+new set of subchannels that are all in state `IDLE`, it will report `IDLE`
+as a consequence of the aggregation rules shown below.  This is different
+from most policies, which start in state `CONNECTING`, and it will
+prevent the failover timer in the `priority` policy from working
+correctly, because the timer will be started when the child is created
+but then immediately cancelled when it reports `IDLE`.  To address this,
+we will change the `priority` policy to restart the failover timer when a
+child reports `CONNECTING`, if that child has not reported `TRANSIENT_FAILURE`
+more recently than it reported `READY` or `IDLE`.
 
 Taking all of the above into account, the aggregation rules for
 the `ring_hash` policy are as follows:
@@ -478,11 +486,6 @@ the `ring_hash` policy are as follows:
 5. If there is at least one subchannel in `IDLE` state, report `IDLE`.
 6. Otherwise, report `TRANSIENT_FAILURE`.
 
-Note that when the `ring_hash` policy first starts up with a completely
-new set of subchannels that are all in state `IDLE`, it will report `IDLE`
-as a consequence of the above aggregation rules.  This is different from
-most policies, which start in state `CONNECTING`.
-
 ##### Picker Behavior
 
 The picker will start by finding the subchannel on the ring that is

From 7bc65fa683be4de60f6b332d49f66e426925455a Mon Sep 17 00:00:00 2001
From: "Mark D. Roth" <roth@google.com>
Date: Mon, 11 Apr 2022 23:10:06 +0000
Subject: [PATCH 5/7] clarify wording

---
 A42-xds-ring-hash-lb-policy.md | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md
index 8df5e880a..7480534c6 100644
--- a/A42-xds-ring-hash-lb-policy.md
+++ b/A42-xds-ring-hash-lb-policy.md
@@ -456,12 +456,14 @@ priority.  The failover timer is started when a child is first created
 and is cancelled when the child reports any state other than `CONNECTING`.
 To allow this timer to work properly, the `ring_hash` policy should
 remain in state `CONNECTING` until it transitions to either
-`TRANSIENT_FAILURE` or `READY`.  And, just as in the `TRANSIENT_FAILURE`
-case above, it will proactively attempt to connect to at least one
-subchannel at all times while it is reporting `CONNECTING`, so that it
-does not stay in state `CONNECTING` indefinitely if it is not receiving
-picks (e.g., if the application is only occassionally starting RPCs and
-giving them very short deadlines).
+`TRANSIENT_FAILURE` or `READY`.  Specifically, after the first subchannel
+reports `TRANSIENT_FAILURE` and all other subchannels are in `IDLE`, it
+should continue to report `CONNECTING` instead of `IDLE`.  In this case,
+just as in the `TRANSIENT_FAILURE` case above, it will proactively attempt
+to connect to at least one subchannel at all times while it is reporting
+`CONNECTING`, so that it does not stay in state `CONNECTING` indefinitely
+if it is not receiving picks (e.g., if the application is only occassionally
+starting RPCs and giving them very short deadlines).
 
 Note that when the `ring_hash` policy first starts up with a completely
 new set of subchannels that are all in state `IDLE`, it will report `IDLE`

From 183057b962ff3b34110c24c61be6ea65dca46c75 Mon Sep 17 00:00:00 2001
From: "Mark D. Roth" <roth@google.com>
Date: Mon, 11 Apr 2022 23:20:56 +0000
Subject: [PATCH 6/7] swap the order of aggregation rules 3 and 4

---
 A42-xds-ring-hash-lb-policy.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md
index 7480534c6..530907e31 100644
--- a/A42-xds-ring-hash-lb-policy.md
+++ b/A42-xds-ring-hash-lb-policy.md
@@ -481,10 +481,10 @@ the `ring_hash` policy are as follows:
 1. If there is at least one subchannel in `READY` state, report `READY`.
 2. If there are 2 or more subchannels in `TRANSIENT_FAILURE` state, report
    `TRANSIENT_FAILURE`.
-3. If there is at least one subchannel in `TRANSIENT_FAILURE` and there
-   is more than one subchannel, report state `CONNECTING`.
-4. If there is at least one subchannel in `CONNECTING` state, report
+3. If there is at least one subchannel in `CONNECTING` state, report
    `CONNECTING`.
+4. If there is one subchannel in `TRANSIENT_FAILURE` and there is more than
+   one subchannel, report state `CONNECTING`.
 5. If there is at least one subchannel in `IDLE` state, report `IDLE`.
 6. Otherwise, report `TRANSIENT_FAILURE`.
 

From 8265ea3f386232c132a41f87da3358ab0c074213 Mon Sep 17 00:00:00 2001
From: "Mark D. Roth" <roth@google.com>
Date: Wed, 13 Apr 2022 22:38:01 +0000
Subject: [PATCH 7/7] clarify wording

---
 A42-xds-ring-hash-lb-policy.md | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/A42-xds-ring-hash-lb-policy.md b/A42-xds-ring-hash-lb-policy.md
index 530907e31..88ec463f3 100644
--- a/A42-xds-ring-hash-lb-policy.md
+++ b/A42-xds-ring-hash-lb-policy.md
@@ -413,18 +413,21 @@ subchannels but rather triggers connection attempts based on picks, it
 cannot use the aggregation rules used by most LB policies (e.g., those used
 by the `weighted_target` policy) as-is.  Instead, it has two differences
 in how it handles `TRANSIENT_FAILURE`, both of which are motivated by
-ensuring that the `ring_hash` policy will work properly as a child of the
-`priority` policy.
+ensuring that the `ring_hash` policy will report reasonable connectivity
+state transitions to its parent (which today is always the `priority`
+policy but might at some point in the future be the channel itself).
 
 The `priority` policy essentially has an ordered list of child policies and
 will send picks to the highest priority child that is currently reporting
 `READY` or `IDLE`.  This means that for `ring_hash` to function as a child
 of the `priority` policy, it needs to report `TRANSIENT_FAILURE` when its
-subchannels are not reachable.  However, because `ring_hash` attempts to
-connect only to those subchannels that pick requests hash to, there are
-likely to be some subchannels that never attempt to connect and are reporting
-state `IDLE`, which means that under the normal aggregation rules, the
-`ring_hash` policy would never report `TRANSIENT_FAILURE`.  And more
+subchannels are not reachable.  However, `ring_hash` attempts to connect
+only to those subchannels that pick requests hash to, and if the first
+subchannel fails to connect, it then sequentially attempts to connecto to
+subsequent subchannels in ring order, so it may take a very long time
+for all of the subchannels to report `TRANSIENT_FAILURE` instead of `IDLE`.
+Under the normal aggregation rules, that means that the `ring_hash` policy
+would take far too long to report `TRANSIENT_FAILURE`.  And more
 generally, if `ring_hash` has only attempted to connect to a small
 subset of its subchannels, it cannot truly know the overall reachability
 of all of the subchannels.  To address these problems, the `ring_hash`