A68: Deterministic Subsetting LB policy #383

s-matyukevich · 2023-07-31T14:06:38Z

Here is a POC for this policy in Go with working e2e tests grpc/grpc-go#6488

Co-authored-by: Antoine Tollenaere <atollena@gmail.com>

Co-authored-by: Anna Grigoriu <125993460+anna-grigoriu@users.noreply.github.com>

Co-authored-by: Antoine Tollenaere <atollena@gmail.com>

…ubsetting

linux-foundation-easycla · 2023-07-31T17:58:12Z

The committers listed above are authorized under a signed CLA.

✅ login: s-matyukevich / name: Sergey Matyukevich (7326b4d, 5370169, 3fdfb76, fa0275e, 837fb7a, 59bedf3, bbf20d9, bbbeca2, 1428640, 391c890, 0d86f1b, 7aba9b6, f8c4777)
✅ login: joybestourous (a2456e5)
✅ login: atollena / name: Antoine Tollenaere (fa0275e, 837fb7a, 1428640, 391c890)

s-matyukevich · 2023-08-23T16:11:13Z

cc @markdroth can you please take a look?

markdroth · 2023-08-23T18:02:41Z

Sorry for the delay -- I'm still trying to get caught up from having been out on vacation for a few weeks, so it might take me another week or two to get to this. But this is on my list, and I'll get to it as soon as I can. Again, sorry for the delay!

markdroth

This looks like a good start!

I think there is one significant design problem here, which is my comment about the client_id field. I really don't think the current design is workable in most environments. I suggest taking a step back and rethinking this a bit.

Please let me know if you have any questions. Thanks!

markdroth · 2023-08-24T19:12:24Z

A68-deterministic-subsetting-lb-policy.md

+A68: `deterministic_subsetting` LB policy.
+----
+* Author(s): @s-matyukevich, @joybestourous
+* Approver: 


You can list me as the approver.

markdroth · 2023-08-24T19:14:01Z

A68-deterministic-subsetting-lb-policy.md

+* Status: Draft
+* Implemented in: Go, Java
+* Last updated: [Date]
+* Discussion at:


As per the gRFC process, please start a thread on the grpc.io mailing list for this gRFC and then add a link to that thread here.

markdroth · 2023-08-24T19:15:50Z

A68-deterministic-subsetting-lb-policy.md

+
+## Background
+
+Currently, gRPC is lacking a way to select a subset of endpoints available from the resolver and load-balance requests between them. Out of the box, users have the choice between two extremes: `pick_first` which sends all requests to one random backend, and `round_robin` which sends requests to all available backends. `pick_first` has poor connection balancing when the number of client is not much higher than the number of servers because of the birthday paradox. The problem is exacerbated during rollouts because `pick_first` does not change endpoint on resolver updates if the current subchannel remains `READY`. `round_robin` results in every servers having as many connections open as there are clients, which is unnecessarily costly when there are many clients, and makes local decisions load balancing (such as outlier detection) less precise.


I had not actually heard of the birthday paradox before. Suggest making that a link to https://en.wikipedia.org/wiki/Birthday_problem. :)

markdroth · 2023-08-24T19:16:54Z

A68-deterministic-subsetting-lb-policy.md

+
+## Background
+
+Currently, gRPC is lacking a way to select a subset of endpoints available from the resolver and load-balance requests between them. Out of the box, users have the choice between two extremes: `pick_first` which sends all requests to one random backend, and `round_robin` which sends requests to all available backends. `pick_first` has poor connection balancing when the number of client is not much higher than the number of servers because of the birthday paradox. The problem is exacerbated during rollouts because `pick_first` does not change endpoint on resolver updates if the current subchannel remains `READY`. `round_robin` results in every servers having as many connections open as there are clients, which is unnecessarily costly when there are many clients, and makes local decisions load balancing (such as outlier detection) less precise.


s/every servers/every server/

markdroth · 2023-08-24T19:17:34Z

A68-deterministic-subsetting-lb-policy.md

+
+## Background
+
+Currently, gRPC is lacking a way to select a subset of endpoints available from the resolver and load-balance requests between them. Out of the box, users have the choice between two extremes: `pick_first` which sends all requests to one random backend, and `round_robin` which sends requests to all available backends. `pick_first` has poor connection balancing when the number of client is not much higher than the number of servers because of the birthday paradox. The problem is exacerbated during rollouts because `pick_first` does not change endpoint on resolver updates if the current subchannel remains `READY`. `round_robin` results in every servers having as many connections open as there are clients, which is unnecessarily costly when there are many clients, and makes local decisions load balancing (such as outlier detection) less precise.


s/local decisions load balancing/local load balancing decisions/

markdroth · 2023-08-24T19:30:17Z

A68-deterministic-subsetting-lb-policy.md

+
+### Handling Parent/Resolver Updates
+
+When the resolver updates the list of addresses, or the LB config changes, Deterministic subsetting LB will run the subsetting algorithm, described above, to filter the endpoint list. Then it will create a new resolver state with the filtered list of the addresses and pass it to the child LB. Attributes and service config from the old resolver state will be copied to the new one. 


I think the last sentence of this paragraph is incorrect and should be removed. Every update the LB policy gets will include all necessary attributes and LB policy config; there is no need to copy anything from the previous state. And doing that kind of copying could result in incorrect behavior if any of that information was intentionally changed between the two updates.

markdroth · 2023-08-24T23:15:01Z

A68-deterministic-subsetting-lb-policy.md

+    // client_index is an index within the
+    // interval [0..N-1], where N is the total number of clients.
+    // Every client must have a unique index.
+    google.protobuf.UInt32Value client_index = 1;


I think this client_index parameter is going to be a problem. You mention below that the control plane will need to figure out how to correctly maintain the right value for each client and that there are ways to do that that are outside the scope of this gRFC. However, I think that this is actually not likely to be very easy to solve, and I think it will actually affect any large-scale system for distributing service configs, not just xDS.

The client_index needs to be assigned such that (a) the value as seen by a given client instance does not change over the lifetime of that client instance and (b) the set of client IDs is evenly distributed across all clients to ensure that all subsets see a roughly equal set of connections. However, at large scale, the control plane itself will need to be scaled horizontally, which means that no one control plane instance will be able to assign IDs to all clients. Instead, it will be necessary to create a distributed service just to assign these IDs (which is a heavy lift by itself), and once you build that, it's not clear that it makes sense to integrate this into the control plane.

For xDS, even if you have build this functionality into your control plane, the control plane cannot possibly assign a different value to each client without making the xDS resources uncacheable, which is a major problem for scalability, because it means that you can't scale via the use of caching xDS proxies. (The ability to use such proxies was one of the main motivating factors for the move to xdstp-style xDS resource names, as per xRFC TP1.)

Also, it's worth noting that in xDS today, control planes typically have very different infrastructure for EDS than they do for the other resource types, because the EDS data changes very dynamically based on deployment changes, auto-scaling, network reachability, etc, whereas the other resource types are static configuration explicitly changed by humans. This design is adding fields that will be in CDS, a resource type that is generally static configuration, but the client_id field needs to be populated dynamically, much like the EDS data. I think many control planes would have trouble supporting this.

Even if you're not using xDS, I think the same issue applies; I think it will occur in basically any environment where application instances are dynamically scheduled. The service config is really intended to be injected dynamically into the client via the resolver, which gets the config from some control plane. Before we pivoted to xDS, we had been working on a design to distribute gRPC service configs across our network in a dynamicly updatable way, but it was primarily designed to distribute static configuration from the service owner, with very little intelligence in the control plane itself. I'm not actually aware of any config-distribution mechanism that would easily support this kind of thing.

(Note that this is the exact issue that I mentioned when we discussed this previously. Internally at Google, these indexes are assigned as an inherent part of the cluster system; every binary running on the cluster knows its own index, so it doesn't need to be configured via a control plane. And I don't think it's a good fit for the control plane at all, which is why I said that I thought it would be hard to build deterministic subsetting for gRPC in OSS.)

So first of all I would like to mention that if we want to do better than random subsetting we must use client_id, or some other parameter(s) that are client-specific.

Second, implementing subsetting via EDS (like we discussed here) has exactly the same problems, so we can’t block this proposal and at the same time recommend people to do that. Leaving people without the guidance on how to do subsetting is also a bad option.

Now let’s try to do some brainstorming:

Option 1: Relax the requirement to have an immutable index per client.

If we do that the client_id calculation becomes trivial:

Query own DNS name

Sort the response by IP

Use index in the resulting array as client_id

Of course this implementation will lead to a lot of connection churn, but I can provide some reasons why this may be acceptable in practise for many use-cases:

connection churn should not lead to errors on the client side. Even if a client completely changes its subset graceful switch LB should make sure there won’t be any errors, it actually works the same way with pick_first. The same argument applies for additional latency that might be added to requests that have to wait on new connections.

connection churn causes some resource overhead, but I can argue that it still will be orders of magnitude better than with round_robin because of how drastically subsetting helps to reduce the overall number of connections.
One implication of this solution is that we don’t really need xDS in this case and can use a simple resolver to update service config, so if you feel strongly about not adding a field to CDS that could negatively impact xDS cacheability, we can simply remove the xDS part from this proposal. Right now I am working on a benchmark that is going to test this solution and I can provide some results soon if this helps to make a better decision.

Option 2: Keep the proposal as-is and rethink the requirements for xDS cacheability

You probably won’t accept this option, but I still want to mention it for the sake of completeness and provide some arguments in support of it.

In many cases xDS is already non cacheable. For example, we use CDS metadata filters to implement sharding, which makes our CDS responses already non cacheable. Even our LDS is non cacheable because for envoys that run on k8s host network we provide a different listener IP to bind to, because we can’t use 0.0.0.0 due to some other constraints. I can provide more legitimate cases when xDS objects may be different per client, and therefore non cacheable .

Even without caching we still can easily scale control-plane horizontally. The control plane still should access some persistent storage to serve EDS and, potentially, to generate client_id

If the control-plane has access to some persistent storage with information about all endpoint (I assume this is the case for most control planes, as otherwise I can’t see how it can serve EDS) I can argue that generating immutable or at least mostly stable, continuous client ids is not that hard. We can just store the mapping between IP - client_id in the storage and backfill missing spots when clients from the middle of the list are deleted. If there are not enough new IPs to completely backfill missing spots for all deletions, we can take the IPs from the end of the list, which minimizes the amount of updates. We can also backfill gaps in the list with some delay, which should help to amortize the effect of k8s progressive rollouts.
All of this is necessary because k8s doesn’t support stable integer id per pod and because after some scale people inevitably start using multiple k8s clusters, but I don’t think we can realistically change that.

Option 3: Use RTDS to store client_id.

If we do that, the LB definition in CDS will contain a reference to the value defined in RTDS. This will keep CDS cacheable while we can make a convention that RTDS should never be cached (which I assume is already the case). Conceptually this is also similar to SDS (secret discovery service) Secrets, such as certificates, are never cacheable and are always different per client.

Option 4: Add some pluggable architecture to get client_id

We can assume that every client will run a separate plugin which will be responsible for providing client_id. This plugin can take the ID either from the deployment system or from DNS or from some other source. Conceptual this is similar to what grpc does for managing certificates

Second, implementing subsetting via EDS (like we discussed here) has exactly the same problems, so we can’t block this proposal and at the same time recommend people to do that.

That's true, but in that case the problem is limited to EDS; it does not affect the other resource types. Putting a parameter that needs to be generated dynamically into CDS makes the problem worse. So I agree that this is an existing problem, but it's also one that I want to try to solve in the long term, and I don't want to make it worse here, because that will make the problem harder to solve later.

Option 1: Relax the requirement to have an immutable index per client.

If we do that the client_id calculation becomes trivial:

Query own DNS name

Sort the response by IP

Use index in the resulting array as client_id

I don't think I understand this proposal. This seems to assume that (a) the client knows its own DNS name, (b) the client can do a DNS lookup, (c) that the name will return a large list of IP addresses, and (d) that the order of the addresses in the DNS result will be random enough that this will yield a useful ID. While it is certainly possible to construct an environment where those assumptions are true, I don't think they are likely to be true in the common case -- and if any one of them is false, it seems like this won't work properly.

If the real goal here is just to try to randomize the clients, we could just as easily have each client generate a random number at startup and use that. That would probably work reasonably if there are large numbers of clients, but I think it would not be sufficient in cases with a relatively small number of clients.

Of course this implementation will lead to a lot of connection churn

I don't see how this will lead to connection churn. In fact, regardless of what algorithm is used to compute the client_id, I don't see why it would lead to connection churn. I think we can just have the LB policy compute the client_id when it is created and then stick with that value for its entire lifetime, so there's no need for any churn. Or am I missing something here?

connection churn causes some resource overhead, but I can argue that it still will be orders of magnitude better than with round_robin because of how drastically subsetting helps to reduce the overall number of connections.

I'm not sure that's true in all cases. We're talking about two different types of overhead here: the overhead of maintaining an extra connection that you may not need is mainly memory usage, whereas the overhead of establishing new connections is mainly CPU and network usage (e.g., SSL handshaking). Depending on how often the connection churn happens and which resources are more expensive in a given environment, this could actually be quite a bit worse than just dealing with the additional connections.

One implication of this solution is that we don’t really need xDS in this case and can use a simple resolver to update service config, so if you feel strongly about not adding a field to CDS that could negatively impact xDS cacheability, we can simply remove the xDS part from this proposal.

I don't think this is an xDS vs. non-xDS question. I think the fundamental principle here is that we should not mix static config data with dynamic endpoint assignment data.

Note that even the resolver API enshrines this separation: the resolver returns the list of addresses and the service config as two separate pieces of data. So even if you were going to do something like this in a custom resolver, I think the right way to do it would be to have the resolver apply the subsetting and return only the chosen addresses, in which case there would be no need for a subsetting LB policy in the first place.

So yes, I do feel strongly that we should not have such a field in CDS, at least not as the only way to do this. But I don't think that removing the xDS part of this design actually solves the problem, because I think the same problem exists in a custom resolver that injects the service config.

Option 2: Keep the proposal as-is and rethink the requirements for xDS cacheability

You probably won’t accept this option, but I still want to mention it for the sake of completeness and provide some arguments in support of it.

In many cases xDS is already non cacheable. For example, we use CDS metadata filters to implement sharding, which makes our CDS responses already non cacheable. Even our LDS is non cacheable because for envoys that run on k8s host network we provide a different listener IP to bind to, because we can’t use 0.0.0.0 due to some other constraints. I can provide more legitimate cases when xDS objects may be different per client, and therefore non cacheable .

I think those cases can be made cacheable by using dynamic parameters, as described in xRFC TP2.

Even without caching we still can easily scale control-plane horizontally. The control plane still should access some persistent storage to serve EDS and, potentially, to generate client_id

At truly massive scale, I'm not sure this is sufficient. I think you wind up eventually needing to introduce read-only caching xDS proxies, and you can't do that without cacheability.

If the control-plane has access to some persistent storage with information about all endpoint (I assume this is the case for most control planes, as otherwise I can’t see how it can serve EDS) I can argue that generating immutable or at least mostly stable, continuous client ids is not that hard. We can just store the mapping between IP - client_id in the storage and backfill missing spots when clients from the middle of the list are deleted. If there are not enough new IPs to completely backfill missing spots for all deletions, we can take the IPs from the end of the list, which minimizes the amount of updates. We can also backfill gaps in the list with some delay, which should help to amortize the effect of k8s progressive rollouts.

This requires storing info about each individual client. Given a sufficiently large number of clients, I don't think this scales.

Option 3: Use RTDS to store client_id.

If we do that, the LB definition in CDS will contain a reference to the value defined in RTDS. This will keep CDS cacheable while we can make a convention that RTDS should never be cached (which I assume is already the case). Conceptually this is also similar to SDS (secret discovery service) Secrets, such as certificates, are never cacheable and are always different per client.

gRPC doesn't support RTDS -- that's an Envoy-only concept. (In fact, "runtime" as implemented in Envoy does not exist in gRPC and probably never will.)

Option 4: Add some pluggable architecture to get client_id

We can assume that every client will run a separate plugin which will be responsible for providing client_id. This plugin can take the ID either from the deployment system or from DNS or from some other source. Conceptual this is similar to what grpc does for managing certificates

This option is feasible, but it would require a lot of machinery. We'd basically need a new extension point with its own registry just for the plugins that could be used in this one LB policy. That's doable but not ideal.

Another option would be to define a few built-in mechanisms for setting the client_id. I think I could live with "set this explicit value" (your current proposal) as one of the options, as long as (a) there are other viable options, (b) this option is not the default, and (c) we document all of the shortcomings of this approach.

Unfortunately, the only other even semi-viable option I see is the one I mentioned above, where the client just generates a random number, which won't work right with a small enough set of clients.

I'd like to hear input from @ejona86 and @dfawley on this as well.

Second, implementing subsetting via EDS (like we discussed grpc/grpc-go#6370) has exactly the same problems, so we can’t block this proposal and at the same time recommend people to do that.

What is better about this proposal than doing subsetting via EDS? If the control plane can manage the client_index here, why couldn't it do it in the same way for EDS?

I don't think I understand this proposal. This seems to assume that (a) the client knows its own DNS name, (b) the client can do a DNS lookup, (c) that the name will return a large list of IP addresses, and (d) that the order of the addresses in the DNS result will be random enough that this will yield a useful ID. While it is certainly possible to construct an environment where those assumptions are true, I don't think they are likely to be true in the common case -- and if any one of them is false, it seems like this won't work properly.

d) that the order of the addresses in the DNS result will be random enough that this will yield a useful ID - this is not a requirement, on the contrary the suggestion is to sort the result so that the resulting order, as seen by different clients, is identical. Then iterate over the list and find the index or the current client IP (the client can always get its own IP from the OS) This algorithm produces a different integer id per client from the range [0..N] where N is the total number of clients at a given time. But this index is not stable, because when the clients are scaled the newest IPs won't always be added to the end of the sorted list, and therefore they will shift indexes for many existing clients. That's why we will have some additional client connection chan during client rollouts.

I'm not sure that's true in all cases. We're talking about two different types of overhead here: the overhead of maintaining an extra connection that you may not need is mainly memory usage, whereas the overhead of establishing new connections is mainly CPU and network usage (e.g., SSL handshaking). Depending on how often the connection churn happens and which resources are more expensive in a given environment, this could actually be quite a bit worse than just dealing with the additional connections.

Agree.

I don't think this is an xDS vs. non-xDS question. I think the fundamental principle here is that we should not mix static config data with dynamic endpoint assignment data.

I agree with this principle, that's why I was trying to suggest alternative approaches, such as using a separate xDS resource type (like RTDS) for dynamic data, or externalizing dynamic data provisioning to some sort of plugins (like certificate plugins) I don't think that we should transform this principle to "never serve dynamic data through xDS" because IMO it is very limiting and many other use-cases (like, for example, shading) may need such dynamic data.

I think those cases can be made cacheable by using dynamic parameters, as described in xRFC TP2.

Most of our clients are envoys, so until dynamic parameters support is added to envoy it is not possible. But I agree that fundamentally it solves the problem.

This requires storing info about each individual client. Given a sufficiently large number of clients, I don't think this scales.

Not sure I fully agree here. If a control-plane can serve EDS it should have access to a datastore with all endpoints anyway. Every client usually is a server for some other downstream clients, so in practice the requirement to store IP per client is already satisfied by all control-planes that serves EDS.

gRPC doesn't support RTDS -- that's an Envoy-only concept. (In fact, "runtime" as implemented in Envoy does not exist in gRPC and probably never will.)

I can generalize the idea: we can create a new xDS type that is specifically designed to store non-cashable dynamic data. Every other resource that need to reference such data will add a pointer to this new resource type instead of embedding dynamic data directly, which solves the problem of cashability for CDS and other static resources.

This option is feasible, but it would require a lot of machinery. We'd basically need a new extension point with its own registry just for the plugins that could be used in this one LB policy. That's doable but not ideal.

Maybe we can generalize "certificate provider plugins" to "dynamic config plugins", and design it in a way that can cover multiple use-cases? The plugin can return opaque data which will be interpreted by consumers. I am mostly doing brainstorming here, so feel free to dismiss my ideas if it is not feasible.

Another option would be to define a few built-in mechanisms for setting the client_id. I think I could live with "set this explicit value" (your current proposal) as one of the options, as long as (a) there are other viable options, (b) this option is not the default, and (c) we document all of the shortcomings of this approach.

This sounds good to me. I can document and implement the options after we agree on the option list.

Unfortunately, the only other even semi-viable option I see is the one I mentioned above, where the client just generates a random number, which won't work right with a small enough set of clients.

I can test this out, but my intuition that this will work no better than simple random subsetting (where every client simply selects N random backends) This option most likely won't work for all our use-cases, because in practice it results in a big imbalance of requests on the backend size (this is explained in more details in the Google's subsetting paper that I references in this gRFC) However, this is a very simple and practical option and I am not at all agains implementing it as well.

What is better about this proposal than doing subsetting via EDS? If the control plane can manage the client_index here, why couldn't it do it in the same way for EDS?

Here are a few reasons:

The subsetting code will be shared. Other companies that need subletting will reuse the subsetting balancer instead of re-implementing it in their own control-planes. We can also benefit from the contributions made by other people.

The amount of different EDS responses will increase a lot. Consider the following case: we have 1k clients that calls 1k servers. Currently we have a single EDS object with all 1k servers. We use https://github.com/envoyproxy/go-control-plane and store it as a single object in the LinearCache, which is reused by all 1k clients. After this change we will have to store 1k different EDS objects in the LinearCache and every such object will be consumed by a single client. This puts qute a lot of additional pressure on our control-plane. Alternatively, we can get rid of go-control-plane with all its abstractions (including LinearCache) and try to generate EDS responses on the fly, which is doable, but requires a lot of refactoring, and will force us to use a custom xDS control-plane server instead of reusing the open-source solution written by envoy maintainers.

Mostly my reasoning is that it is always worth trying to push a solution for some common problem upstream before implementing a custom one.

d) that the order of the addresses in the DNS result will be random enough that this will yield a useful ID - this is not a requirement, on the contrary the suggestion is to sort the result so that the resulting order, as seen by different clients, is identical. Then iterate over the list and find the index or the current client IP (the client can always get its own IP from the OS) This algorithm produces a different integer id per client from the range [0..N] where N is the total number of clients at a given time. But this index is not stable, because when the clients are scaled the newest IPs won't always be added to the end of the sorted list, and therefore they will shift indexes for many existing clients. That's why we will have some additional client connection chan during client rollouts.

Okay, thanks, I understand the intent of this proposal better now. However, I think it still rests on a bunch of assumptions that won't be true in many cases. For example, what if the client has multiple NICs, each with a different IP address? Also, this assumes the use of DNS -- what if the environment uses a name service other than DNS?

If a control-plane can serve EDS it should have access to a datastore with all endpoints anyway. Every client usually is a server for some other downstream clients, so in practice the requirement to store IP per client is already satisfied by all control-planes that serves EDS.

I think you're assuming use of a service mesh topology, but that's not the only use-case to consider. There are many environments in which the clients are distributed and are not themselves servers, so the control plane would not otherwise be aware of them.

I can generalize the idea: we can create a new xDS type that is specifically designed to store non-cashable dynamic data. Every other resource that need to reference such data will add a pointer to this new resource type instead of embedding dynamic data directly, which solves the problem of cashability for CDS and other static resources.

That would certainly be better than embedding it in the CDS resource. However, this would require us to make the mechanism for determining the client_id pluggable rather than simply supporting a fixed set of built-in options, because I don't think we want this LB policy to depend on xDS.

Maybe we can generalize "certificate provider plugins" to "dynamic config plugins"

The certificate provider plugin APIs are all fairly specific to what it's doing, so I don't think there's a good way to generalize the mechanism to serve both use-cases. But we could certainly duplicate that paradigm if it provided some advantage -- although I'm not sure that's the case here.

The main difference between the certificate provider paradigm and just having a plugin specified directly in the LB policy config is that in the certificate provider case, the decision of which plugin to use is actually made by the deployment, not by the control plane. This makes sense for certs, since there's a lot of deployment-specific machinery that determines how the cert is provided, and the control plane does not want to have to worry about which mechanism is actually used on each deployment, nor does it want to have to generate different variants of the LDS and CDS resources for each deployment. However, in this case, it's not clear to me that there's any benefit in having the decision of what plugin to use determined by the deployment (although I'm open to hearing counter-arguments if there's a reason why there might be), so it seems like it would be much simpler to just have the plugin be specified directly in the LB policy config.

The subsetting code will be shared. Other companies that need subletting will reuse the subsetting balancer instead of re-implementing it in their own control-planes. We can also benefit from the contributions made by other people.

I agree that avoiding the need for wheel reinvention is a good goal, but I'm not sure this proposal actually accomplishes it. Ultimately, as this discussion illustrates, I think the hard part of this is actually determining the client_id, and it's not at all clear to me that it's easier for a control plane to do that than it is to just compute the subsets itself. The main difference seems to be whether the subsetting algorithm itself is implemented on the client side or in the control plane, and that particular piece of code seems much less complex than the additional machinery needed to move it to the client side.

The amount of different EDS responses will increase a lot. Consider the following case: we have 1k clients that calls 1k servers. Currently we have a single EDS object with all 1k servers. We use https://github.com/envoyproxy/go-control-plane and store it as a single object in the LinearCache, which is reused by all 1k clients. After this change we will have to store 1k different EDS objects in the LinearCache and every such object will be consumed by a single client. This puts qute a lot of additional pressure on our control-plane. Alternatively, we can get rid of go-control-plane with all its abstractions (including LinearCache) and try to generate EDS responses on the fly, which is doable, but requires a lot of refactoring, and will force us to use a custom xDS control-plane server instead of reusing the open-source solution written by envoy maintainers.

Won't you have exactly this same problem by needing to encode the client_id in the CDS resource? You'll still need to send a different client_id to each client, right?

I'll also point out that this issue of resource fan-out is exactly the issue of xDS resource cacheability, which you were previously arguing that we don't need. It seems like you're actually relying on this property for EDS, but you're arguing to break this property for CDS, which seems very counter-intuitive to me.

Taking all of this into account, here's where I'm currently landing:

I am honestly not convinced that having the control plane provide the client_id is actually any better than the control plane simply doing the subsetting itself.

I do see the value of providing a subsetting LB policy for cases where there is no control plane. However, in order to make the deterministic subsetting policy generally useful, we need to provide a set of mechanism(s) for computing the client_id that will actually work in a wide variety of environments.

Your suggestion for doing a DNS lookup of the client's name to determine the client_id is probably the most broadly applicable of the suggestions I've heard so far, but it's still a little too niche for my comfort. I would feel a lot better about this proposal if we could come up with another option that would work in more cases, but I don't know what that would look like. I won't necessarily make that a strict precondition to approving this gRFC, especially if you are contributing the implementation work, but I will say that without this, I have serious reservations about how useful this will actually be.

I don't want this policy to have dependencies on specific control planes like DNS or xDS. If we are going to support mechanisms to compute client_id that rely on those control planes, then I think we do need a plugin architecture (i.e., the LB policy config would have an Any field that would be used to specify which plugin to use).

The subsetting code will be shared.

Can't it be shared just as well on the control plane, if not easier? Most control planes are Go, with some Java. Client-side there's Envoy, gRPC C, Java, Go, Node. For you specifically, you'd only need to implement it in a single control plane, and it'd work for all the clients.

An important detail here is that all client implementations must match exactly to get the benefits. Having it in the control plane avoids accidental implementation drift degrading performance.

The amount of different EDS responses will increase a lot.

Sure, but as Mark mentioned that applies to CDS as well, so how is using CDS a better fit than EDS?

Can't it be shared just as well on the control plane, if not easier? Most control planes are Go, with some Java. Client-side there's Envoy, gRPC C, Java, Go, Node. For you specifically, you'd only need to implement it in a single control plane, and it'd work for all the clients.

An important detail here is that all client implementations must match exactly to get the benefits. Having it in the control plane avoids accidental implementation drift degrading performance.

I am not aware about any open-source control-plane that could potentially implement subsetting. Projects like go-control-plane and java-control-plane are mostly just collections of low level primitives that allow implementing xDS protocol more easily. Those projects don't have any notion of endpoint storage and can't serve EDS as well as implement subsetting.

Of course we can implement it in our own control-plane, but it then gRPC and/or envoy implement a standard way of doing subsetting we will be left with our own custom solution and unclear migration path to the standard one. This proposal is an attempt to define the standard way of doing subsetting with gRPC. If it fails - we'll implement a custom solution either in our own control-plane or via a custom grpc load-balancer or resolver.

Another important detail is that we need a solution that works not only for xDS consumers, as most of our users don't use xDS yet.

markdroth · 2023-08-25T23:57:44Z

A68-deterministic-subsetting-lb-policy.md

+    // before applying the subsetting algorithm. This might be useful
+    // in cases when the resolver doesn't provide a stable order or
+    // when it could order addresses differently depending on the client.
+    // Default is false.


The default for sorting should probably be true. Most resolvers don't provide a stable ordering of results -- in fact, many of them explicitly randomize the order, because otherwise you don't get proper connection-level load balancing when your clients use pick_first.

I agree to have it enabled by default. I'd even be tempted to always sort, because it is easy to accidentally mess this up and hard to debug. Is performance the only reason not to sort all the time?

markdroth · 2023-08-25T23:59:15Z

A68-deterministic-subsetting-lb-policy.md

+As you can see, the fields in this policy match exactly the fields in the deterministic subsetting LB service config.
+
+#### Integration with xDS LB Policy Registry
+As described in [gRFC A52][A52], gRPC has an LB policy registry, which maintains a list of converters. Every converter translates xDS LB policy to the corresponding service config. In order to allow using the Deterministic subsetting LB policy via xDS, the only thing that needs to be done is providing a corresponding converter function. The function implementation will be trivial as the fields in the xDS LB policy will match exactly the fields in the service config.


Doesn't look like you're actually adding a link to gRFC A52 anywhere, so this isn't being formatted correctly.

markdroth · 2023-08-26T00:14:10Z

A68-deterministic-subsetting-lb-policy.md

+* Both algorithms are sufficient in reducing connection count, which is the primary drawback of `round_robin`. 
+* Both algorithms provide better load distribution guarantees than `pick_first`. 
+* When the client to backend ratio requires overlapping subsets, deterministic subsetting provides the additional guarantee of unique subsets. Because the aperture algorithm is built around a fixed ring, when clients \* aperture exceeds the number of servers, aperture takes "laps" over the ring, producing similar subsets for several clients. Deterministic subsetting strives to reduce the risk of misbehaving endpoints by distributing them more randomly across client subsets.  
+* In order to take advantage of the partial weighting of backends, Twitter's algorithm would require passing weight information from parent balancer to child balancer. The appropriate balancer to handle this sort of weighting would be weighted\_round\_robin, but weighted\_round\_robin currently calculates weights based on server load alone, and it is not designed to consider this additional weighting information. We opted for the solution that allowed us to keep this existing balancer as is. Though deterministic subsetting does not guarantee even load distribution across subsets, its diverse subset paired with weighted\_round\_robin as the child policy within subsets should be sufficient for most use cases.  


It's true that the gRPC WRR policy described in gRFC A58 uses weight information provided by the endpoints. However, there is a separate community-contributed proposal open (although admittedly hasn't been active for a while) to provide an Envoy-style WRR policy that uses weights specified by the control plane in pending gRFC A34. The two policies are not mutually exclusive -- we can support both if someone wants to contribute the latter.

gRPC C-core does pass endpoint weights down from the xDS control plane to the LB policy, but it's up to the individual LB policy as to whether it uses that information. The only policy we currently have that does so is the ring_hash policy. I believe gRPC Go passes the weight through as well, but I don't know about Java. And even if they didn't, I don't think it would be hard to add this plumbing for any gRPC implementation that supports xDS.

My reason for mentioning all of this is just to say that the need for endpoint weights from the control plane is not necessarily a reason not to support deterministic aperature subsetting in the general case. But it would make sense to say that deterministic aperature would not work with a control plane that does not send weight information or with an LB policy like gRPC's WRR that uses out-of-band weight information.

larry-safran · 2023-08-28T22:26:38Z

With a small set of clients one should be using some kind of round robin and not doing any subsetting. We should strongly recommend that subsetting should only be used in situations with at least X clients (probably need some math to derive X). Once you can ignore the small clients case, then you are looking for a way to either subset geographically (challenging to get a relatively uniform distribution) or randomly. It seems that there is general agreement that a client-id should be communicated. If the clientid were generated based upon a hash of the IP address then it would be stable along with being a relatively even distribution.

…

On Mon, Aug 28, 2023 at 3:05 PM Mark D. Roth ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In A68-deterministic-subsetting-lb-policy.md <#383 (comment)>: > +### LB Policy Config and Parameters + +The `deterministic_subsetting` LB policy config will be as follows. + +``` +message LoadBalancingConfig { + oneof policy { + DeterministicSubsettingLbConfig deterministic_subsetting = 21 [json_name = "deterministic_subsetting"]; + } +} + +message DeterministicSubsettingLbConfig { + // client_index is an index within the + // interval [0..N-1], where N is the total number of clients. + // Every client must have a unique index. + google.protobuf.UInt32Value client_index = 1; Second, implementing subsetting via EDS (like we discussed here <grpc/grpc-go#6370>) has exactly the same problems, so we can’t block this proposal and at the same time recommend people to do that. That's true, but in that case the problem is limited to EDS; it does not affect the other resource types. Putting a parameter that needs to be generated dynamically into CDS makes the problem worse. So I agree that this is an existing problem, but it's also one that I want to try to solve in the long term, and I don't want to make it worse here, because that will make the problem harder to solve later. *Option 1: Relax the requirement to have an immutable index per client.* If we do that the client_id calculation becomes trivial: - Query own DNS name - Sort the response by IP - Use index in the resulting array as client_id I don't think I understand this proposal. This seems to assume that (a) the client knows its own DNS name, (b) the client can do a DNS lookup, (c) that the name will return a large list of IP addresses, and (d) that the order of the addresses in the DNS result will be random enough that this will yield a useful ID. While it is certainly possible to construct an environment where those assumptions are true, I don't think they are likely to be true in the common case -- and if any one of them is false, it seems like this won't work properly. If the real goal here is just to try to randomize the clients, we could just as easily have each client generate a random number at startup and use that. That would probably work reasonably if there are large numbers of clients, but I think it would not be sufficient in cases with a relatively small number of clients. Of course this implementation will lead to a lot of connection churn I don't see how this will lead to connection churn. In fact, regardless of what algorithm is used to compute the client_id, I don't see why it would lead to connection churn. I think we can just have the LB policy compute the client_id when it is created and then stick with that value for its entire lifetime, so there's no need for any churn. Or am I missing something here? - connection churn causes some resource overhead, but I can argue that it still will be orders of magnitude better than with round_robin because of how drastically subsetting helps to reduce the overall number of connections. I'm not sure that's true in all cases. We're talking about two different types of overhead here: the overhead of maintaining an extra connection that you may not need is mainly memory usage, whereas the overhead of establishing new connections is mainly CPU and network usage (e.g., SSL handshaking). Depending on how often the connection churn happens and which resources are more expensive in a given environment, this could actually be quite a bit worse than just dealing with the additional connections. One implication of this solution is that we don’t really need xDS in this case and can use a simple resolver to update service config, so if you feel strongly about not adding a field to CDS that could negatively impact xDS cacheability, we can simply remove the xDS part from this proposal. I don't think this is an xDS vs. non-xDS question. I think the fundamental principle here is that we should not mix static config data with dynamic endpoint assignment data. Note that even the resolver API enshrines this separation: the resolver returns the list of addresses and the service config as two separate pieces of data. So even if you were going to do something like this in a custom resolver, I think the right way to do it would be to have the resolver apply the subsetting and return only the chosen addresses, in which case there would be no need for a subsetting LB policy in the first place. So yes, I do feel strongly that we should not have such a field in CDS, at least not as the only way to do this. But I don't think that removing the xDS part of this design actually solves the problem, because I think the same problem exists in a custom resolver that injects the service config. *Option 2: Keep the proposal as-is and rethink the requirements for xDS cacheability* You probably won’t accept this option, but I still want to mention it for the sake of completeness and provide some arguments in support of it. - In many cases xDS is already non cacheable. For example, we use CDS metadata filters to implement sharding, which makes our CDS responses already non cacheable. Even our LDS is non cacheable because for envoys that run on k8s host network we provide a different listener IP to bind to, because we can’t use 0.0.0.0 due to some other constraints. I can provide more legitimate cases when xDS objects may be different per client, and therefore non cacheable . I think those cases can be made cacheable by using dynamic parameters, as described in xRFC TP2 <https://github.com/cncf/xds/blob/main/proposals/TP2-dynamically-generated-cacheable-xds-resources.md> . - Even without caching we still can easily scale control-plane horizontally. The control plane still should access some persistent storage to serve EDS and, potentially, to generate client_id At truly massive scale, I'm not sure this is sufficient. I think you wind up eventually needing to introduce read-only caching xDS proxies, and you can't do that without cacheability. - If the control-plane has access to some persistent storage with information about all endpoint (I assume this is the case for most control planes, as otherwise I can’t see how it can serve EDS) I can argue that generating immutable or at least mostly stable, continuous client ids is not that hard. We can just store the mapping between IP - client_id in the storage and backfill missing spots when clients from the middle of the list are deleted. If there are not enough new IPs to completely backfill missing spots for all deletions, we can take the IPs from the end of the list, which minimizes the amount of updates. We can also backfill gaps in the list with some delay, which should help to amortize the effect of k8s progressive rollouts. This requires storing info about each individual client. Given a sufficiently large number of clients, I don't think this scales. *Option 3: Use RTDS <https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/runtime/v3/rtds.proto> to store client_id.* If we do that, the LB definition in CDS will contain a reference to the value defined in RTDS. This will keep CDS cacheable while we can make a convention that RTDS should never be cached (which I assume is already the case). Conceptually this is also similar to SDS (secret discovery service) Secrets, such as certificates, are never cacheable and are always different per client. gRPC doesn't support RTDS -- that's an Envoy-only concept. (In fact, "runtime" as implemented in Envoy does not exist in gRPC and probably never will.) *Option 4: Add some pluggable architecture to get client_id* We can assume that every client will run a separate plugin which will be responsible for providing client_id. This plugin can take the ID either from the deployment system or from DNS or from some other source. Conceptual this is similar to what grpc does for managing certificates <https://github.com/grpc/proposal/blob/master/A29-xds-tls-security.md?plain=1#L417> This option is feasible, but it would require a lot of machinery. We'd basically need a new extension point with its own registry just for the plugins that could be used in this one LB policy. That's doable but not ideal. Another option would be to define a few built-in mechanisms for setting the client_id. I think I could live with "set this explicit value" (your current proposal) as one of the options, as long as (a) there are other viable options, (b) this option is not the default, and (c) we document all of the shortcomings of this approach. Unfortunately, the only other even semi-viable option I see is the one I mentioned above, where the client just generates a random number, which won't work right with a small enough set of clients. I'd like to hear input from @ejona86 <https://github.com/ejona86> and @dfawley <https://github.com/dfawley> on this as well. — Reply to this email directly, view it on GitHub <#383 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AZQMCXVWNCFDKNRKWRASSXDXXUI2XANCNFSM6AAAAAA26KBW6Y> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ejona86 · 2023-08-21T17:07:26Z

A68-deterministic-subsetting-lb-policy.md

+
+    // subset_count indicates how many clients we can have so that every client is connected to exactly 
+    // subset_size distinct backends and no 2 clients connect to the same backend.
+    subset_count = backend_count / subset_size


Seems likely this is integer math, so best to include explicit floor() or make it clear it is integer another way.

ejona86 · 2023-08-28T22:00:09Z

A68-deterministic-subsetting-lb-policy.md

+The LB policy will implement the algorithm described in [Site Reliability Engineering: How Google Runs Production Systems, Chapter 20](https://sre.google/sre-book/load-balancing-datacenter/#a-subset-selection-algorithm-deterministic-subsetting-eKsdcaUm) with the additional modification mentioned in the "Deterministic subsetting" section of the [Reinventing Backend Subsetting at Google](https://queue.acm.org/detail.cfm?id=3570937) paper. Here is the relevant quote:
+
+```
+This is the algorithm as previously described in Site Reliability Engineering: How Google Runs Production Systems, Chapter 20 but one improvement remains that can be made by balancing the leftover tasks in each group. The simplest way to achieve this is by choosing (before shuffling) which tasks will be leftovers in a round-robin fashion. For example, the first group of frontend tasks would choose {0, 1} to be leftovers and then shuffle the remaining tasks to get subsets {8, 3, 9, 2} and {4, 6, 5, 7}, and then the second group of frontend tasks would choose {2, 3} to be leftovers and shuffle the remaining tasks to get subsets {9, 7, 1, 6} and {0, 5, 4, 8}. This additional balancing ensures that all backend tasks are evenly excluded from consideration, producing a better distribution.


Use the blockquote > instead of the code block ```.

ejona86 · 2023-08-28T22:01:57Z

A68-deterministic-subsetting-lb-policy.md

+    // subset_size distinct backends and no 2 clients connect to the same backend.
+    subset_count = backend_count / subset_size
+
+    // Given subset_count we now can divide clients by rounds. Every round have exactly subset_count clients.


nit: s/have/has/

ejona86 · 2023-08-28T22:02:54Z

A68-deterministic-subsetting-lb-policy.md

+    // excluded_count indicates how many leftover backends we have on every round.
+    excluded_count = backend_count % subset_size
+
+    // We want to choose what backends are excluded in a round robin fashion before shufling the backends.


nit: s/shufling/shuffling/

ejona86 · 2023-08-28T22:04:48Z

A68-deterministic-subsetting-lb-policy.md

+        // excluded_end wraps around the end of the addresses list, exclude intervals [0:excluded_start] and [excluded_end, end_of_the_array)
+    }
+
+    // randomly shuffle the addresses to increase subset diversity. Use round as seed to make sure


This shuffling is unlikely to produce the same results cross-language. How much do we care about that?

ejona86 · 2023-08-28T22:09:54Z

A68-deterministic-subsetting-lb-policy.md

+    google.protobuf.UInt32Value client_index = 1;
+
+    // subset_size indicates how many backends every client will be connected to.
+    // Default is 10.


I'm tempted to not have a default here (or make it MAX_INT). I know for WRR we believe the minimum subset size should be 20 for many workloads. But there's so much that goes into this that is usage-specific.

ejona86 · 2023-08-28T22:11:06Z

A68-deterministic-subsetting-lb-policy.md

+    // before applying the subsetting algorithm. This might be useful
+    // in cases when the resolver doesn't provide a stable order or
+    // when it could order addresses differently depending on the client.
+    // Default is false.


I agree to have it enabled by default. I'd even be tempted to always sort, because it is easy to accidentally mess this up and hard to debug. Is performance the only reason not to sort all the time?

ejona86 · 2023-08-28T22:15:51Z

A68-deterministic-subsetting-lb-policy.md

+    // calculate start start and end of the resulting subset 
+    start = subsetId * subset_size
+    end = start + subset_size
+    return addresses[start: end]


After we select which addresses, we could map the addresses back to their original order. Is that worthwhile?

ejona86 · 2023-08-28T22:20:51Z

A68-deterministic-subsetting-lb-policy.md

+
+`deterministic_subsetting` will be added as a new LB policy.
+
+```textproto


nit: This should just be proto. Textproto is for formatting a message instance. You can also add this code styling to the "LB Policy Config and Parameters" section.

ejona86 · 2023-08-28T22:37:22Z

A68-deterministic-subsetting-lb-policy.md

+    // client_index is an index within the
+    // interval [0..N-1], where N is the total number of clients.
+    // Every client must have a unique index.
+    google.protobuf.UInt32Value client_index = 1;


Second, implementing subsetting via EDS (like we discussed grpc/grpc-go#6370) has exactly the same problems, so we can’t block this proposal and at the same time recommend people to do that.

What is better about this proposal than doing subsetting via EDS? If the control plane can manage the client_index here, why couldn't it do it in the same way for EDS?

markdroth · 2023-08-28T23:08:03Z

@larry-safran

With a small set of clients one should be using some kind of round robin and not doing any subsetting.

I don't think that's what we want in all cases. If the number of servers is very large but the number of clients is small, subsetting is almost certainly the right thing. For example, consider a locality with a relatively small pool of ingress proxies providing access to a large number of backend services, each of which can have a large number of endpoints: in that case, we really want subsetting, not RR, because the connection fan-out could be immense.

larry-safran · 2023-08-28T23:26:35Z

@markdroth that's a good point. If there are a small number of clients, then I would expect no need to scale the control plane, so you could in that situation have a single control plane server handing out individual client ids. If there were 2 ways to get client-id, one from the control plane and the other being to generate it on the client, then the problem becomes much more tractable.

markdroth · 2023-08-28T23:29:49Z

@larry-safran

If there are a small number of clients, then I would expect no need to scale the control plane

I don't think that's true. A control plane can have many different pools of clients. The overall number of clients can be large enough that the control plane is scaled up, but one individual pool of clients can be small enough that randomly assigning its client_ids won't provide proper distribution.

s-matyukevich · 2023-08-29T18:06:20Z

A68-deterministic-subsetting-lb-policy.md

+    // client_index is an index within the
+    // interval [0..N-1], where N is the total number of clients.
+    // Every client must have a unique index.
+    google.protobuf.UInt32Value client_index = 1;


This is really a continuation of the discussion about client_id, but I want to start a separate thread to keep it more focused. What I am trying to do is to take a step back and think about options that don't involve client_id.

One obvious option is a fully random subsetting, where every client selects N backends at random. The problem with this option is that resulting load on the backends end up being not even, which is described in more details here But what if we can make it better by using WRR as a child balancer? My intuition is that WRR should make clients to send more requests to the least utilized backends within the subset, and if we have good subset diversity and relatively large subsets, WRR can, at least partially, compensate for backend load imbalance.

If this is not enough we can extend this idea by creating a "load aware subset balancer" The balancer will do what I described above, but additionally it will periodically remove the most utilized backend from its subset, if the deviation in load for this backend diverged too far from the mean for the whole subset. This will probably require an additional ORCA metric, which reports pure CPU load on the backends. (I think the metric that is used by WRR is more cost per request which makes it not suitable to compare load on the backends)

@markdroth what do you think?

One obvious option is a fully random subsetting, where every client selects N backends at random. The problem with this option is that resulting load on the backends end up being not even, which is described in more details here But what if we can make it better by using WRR as a child balancer? My intuition is that WRR should make clients to send more requests to the least utilized backends within the subset, and if we have good subset diversity and relatively large subsets, WRR can, at least partially, compensate for backend load imbalance.

WRR definitely does help balance load within the subset. However, I think that solves the problem only for workloads where every client generates approximately the same load. If the subsets are not evenly distributed and you happen to have a set of clients that generate more load than others and they all happen to be talking to the same set of backends, then you will still wind up with uneven load across all backends.

If this is not enough we can extend this idea by creating a "load aware subset balancer" The balancer will do what I described above, but additionally it will periodically remove the most utilized backend from its subset, if the deviation in load for this backend diverged too far from the mean for the whole subset. This will probably require an additional ORCA metric, which reports pure CPU load on the backends. (I think the metric that is used by WRR is more cost per request which makes it not suitable to compare load on the backends)

If WRR is using the right weight for balancing load across the backends, then it's not clear to me that this kind of outlier detection-style backend ejection is actually necessary. If one particular backend's weight is much lower than the others, WRR should give it much less traffic anyway, so ejecting the backend wouldn't really change anything.

Also, if the problem is that a particular set of clients is generating higher load than other clients, then ejecting some of the backends in the subset seems like it could result in even worse load imbalance, by forcing those heavy-traffic clients onto an even smaller set of backends.

(To be clear, BTW, I am not necessarily objecting to implementing random subsetting -- there may very well be workloads for which that works fine. I'm just pointing out other factors to consider here when evaluating whether a given algorithm will work for your particular workload.)

WRR definitely does help balance load within the subset. However, I think that solves the problem only for workloads where every client generates approximately the same load. If the subsets are not evenly distributed and you happen to have a set of clients that generate more load than others and they all happen to be talking to the same set of backends, then you will still wind up with uneven load across all backends.

I think this argument can be applied to any subsetting implementation, including the Google's one as well as Twitter's Aperture. All those algorithms don't take into account potential difference between clients and only balance number of connections per backend, not req per backend. I guess the way it works is by grouping clients into groups of identical instances. My intuition is that random subsetting should deal with this particular problem (different types of clients) just fine because the resulting subsets are fully random, which makes the probability of any group of clients talking to the same subset of backends almost neglectable.

If WRR is using the right weight for balancing load across the backends, then it's not clear to me that this kind of outlier detection-style backend ejection is actually necessary. If one particular backend's weight is much lower than the others, WRR should give it much less traffic anyway, so ejecting the backend wouldn't really change anything.

Also, if the problem is that a particular set of clients is generating higher load than other clients, then ejecting some of the backends in the subset seems like it could result in even worse load imbalance, by forcing those heavy-traffic clients onto an even smaller set of backends.

I think I described it poorly. When I said eject what I really meant is first eject the backend and then replace it with another random one, so we will always have exactly N backends per client. The problem with just WRR is that it can balance only requests and not connections per backend, and the proposed solution could also be used to balance the connections. For example, we can implement the load metric to take into account the number of incoming connections. instead of CPU load. Then top level subsetting LB will be balancing just the number of connections per backend and child LB (wrr) will be balancing load within the subset. I think that by tuning different types of load metrics we can achieve perfect load distribution across all backends, though I fully agree that it might not be necessary and just subsetting + wrr might be enough

What I am going to do is to test all those options on a synthetic benchmark. I'll get back to this PR when I have some results.

I got some results testing random subsetting with WRR

Test setup

87 backends, each have 1 CPU but some backends are more powerful then the others

93 clients. Clients [0-69] are "normal" (they sleep between requests for 15 millisecond) The rest of the clients are "noizy"(they keep sending requests without any delays).

Test scenario:

Start the test and wait for 5 min

do full backend rollout and wait for 5 min again

Here are the results:

round robin

Nothing surprising here: all servers get equal amount of connections and requests, there are 2 visibly distinct buckets for normal and noisy clients as well as for fast and slow servers.

wrr

As expected, wrr was able to fix backend CPU imbalance by adjusting corresponding weights, now slow servers receive less requests than fast servers, but CPU load is well aligned (with ~4% difference)

random subsetting + wrr (subset size = 10)

First of all, the resulting number of connections look quite bad. Second, wrr with default load function doesn't actually do anything to make the resulting CPU load better. WRR only takes into account "cost per request", which is different for slow and fast servers, but is identical for the servers that are in the same group but happen to receive different number of connections.

To fix this problem I changed the definition of CPULoad parameter and multiplied it by the number of incoming connections. (this is the thing I call "scaled wrr")

random subsetting + scaled wrr (subset size = 10)
Here I must admit that just scaled wrr version works very well if we have a single type of clients. If, however, we have both noisy and normal clients it doesn't work well. If a backend have 2 noisy connections and 1 normal ideally it should get a different weight than the server that get 1 noisy and 2 normal connections. To deal with this problem I added an interceptor that writes client type to metadata header and grouped the connections by client_type on the server. Then I returned different ORCA responses for normal and noisy clients (the scaling factor for a particular client now take into account only the number of incoming connections of the same type) In order to do that I had to fork grpc and slightly modify ORCA producer interface. The results are the following:

The results are much better but still not perfect. There are a few clear outliers with lower CPU usage then the rest of the backends. My guess is that those outliers are "lucky" in a sense that they got connections only from normal clients and wrr can't do anything to put more load on them. If this is true - this is a problem of subset size. So in the next test I bumped it to 30.

random subsetting + scaled wrr (subset size = 30)

Now we got perfect results visually indistinguishable from plain wrr. However, figuring out the right subset size for a particular workload in this scenario might be challenging and for some types of workloads we may need realatively large subset sizes, which is not ideal.

Open questions

We didn't explore the option to dynamically rebalance the connections in the subsetting load balancer based on the number of incoming connections per server. This will require making subsetting LB aware of ORCA metrics as well as some changes to ORCA setup. The potential benefit is that it could probably help us to achieve convergence on much smaller subset sizes. But I wold rather not spend time exploring this options if that is something gRPC maintainers will never accept.

Should we make ORCA capable of returning different metrics depending on the client? If yes, this probably should go to a different gRFC, but the change itself is pretty small.

Proposed plan
Given the shortcomings of the random subsetting + wrr approach we propose to get back to deterministic subsetting implementation. In order to address the major concern about propagating cleint_id to the clients we propose to do the following:

Get client ID from an environment variable (something like GRPC_SUBSETTING_CLEINT_ID) This guaranties that the ID is stable.

In multi-cluster k8s environment we will do the following to generate this ID

Treat the same clients from different k8s clusters as different set of clients and assign IDs for them independently.

Use admission webhook that watches all pods for a deployment and assigns the next available integer ID as soon as it detects a new pod. Then modify the pod spec and inject the ID as ENV.

The shortcomings of the described approach is that the resulting IDs won't start from 0, in some cases the indexes may have gaps (for example if a pod get manually deleted) and we will artificially split the clients across k8s clusters boundaries. All this will result in a worse connection per backend distribution, but my intuition is that it still will be way better than with random subsetting. We will use "scaled wrr" to correct the resulting connection imbalance, but given that this imbalance will be much smaller than with random subsetting we will be able to use much smaller subsets. And this should work for any type of workload.

@markdroth what do you think?

TL;DR; of our offline subsetting discussion:

Any form of deterministic subsetting requires coordination between clients, and if we need coordination between clients this should be done via the control-plane. But if control-plane is a requirement for the subsetting we can also run subsetting algorithm in the control-plane, which is the recommended way of doing deterministic subsetting. Doing deterministic subsetting inside a client-side LB policy is not an acceptable option.

Random subsetting is ok.

We should look into how we can improve imbalance generated by random subsetting. Scaling wrr by the number of connections is not ideal because it is not generic enough - instead we should look into algorithms that balance based on load and not on server capacity (like wrr does) Fighting oscillations is the major problem here.

To address the last point I want to test how Proportional–integral–derivative controller works for balancing the load. It was suggested to me by some folks at gRPC conf and it is actually used in practice by some companies to correct the imbalance generated by random subsetting. It is also specifically designed to prevent oscillations and provide some mathematical guaranties about that.

If PID controller approach doesn't work or end up being not acceptable either I'll five up on this proposal.

We've got some really interesting results while testing PID controller in combination with random subsetting. After we test it with real apps in production, I am going to close this proposal and instead open a new one, which will define 2 new LB policies:

Random subsetting using rendez-vous hashing.

PID controller LB policy.

@markdroth @ejona86 does this sound ok to you?

PID controller test results
Before I proceed to the results I want to highlight a few important details about my PID controller implementation

It is based on wrr, it uses signal from PID controller to modify wrr weights.

It stores a PID controller per wrr sub-channel, The controller is configured to use mean CPU utilization over all connected upstream servers as target value.

I used the following code to convert the signal from PID controller to wrr weight

w.pidController.Update(pid.ControllerInput{ ReferenceSignal: meanUtilization, ActualSignal: utilization, SamplingInterval: time.Since(w.lastUpdated), }) if w.weightVal == 0 { w.weightVal = 1 } mult := 1.0 signal := w.pidController.State.ControlSignal if signal >= 0 { mult = 1.0 + signal } else { mult = -1.0 / (signal - 1.0) } w.weightVal *= mult

If the signal is positive we slowly increase the weight, proportionally to the value of the signal. The opposite happens is the signal is negative

I used only proportional and derivative part of the PID controller (and ignored the integral pert) Integrals accumulate over time and need to be discharged somehow. There are technics to do that, but the didn't show any improvements in comparison to a simple PD controller. I think that integral part is really important if the target value changes sharply. In our case target value is meanUtilization, which should change only gradually. (We actually applied moving average while reporting CPU load to make sure that utilization metric is smooth, this is very important for PID controller to avoid oscillations. I'll cover this in more details later)

When using PID controller weight update period must be synchronized with reporting period, otherwise we risk updating the internal state of the controller with values that are not actually applied. I achieved this by using out-of-band reporting with the same interval as weight update period, we can achieve the same result with in-bound reporting as well.

The resulting PID controller LB exposes only 2 additional parameters in comparison to wrr LB: ProportionalGain and. DerivativeGain. It is easy to tune then: Increasing ProportionalGain reduces the time to converge but increases the chance of oscillations. Increasing DerivativeGain makes the convergence more smooth and reduces the chance of overshoot and oscillations, but if set too high it can slow down convergence.

We can provide some guaranties about non oscillating when using PID controller. The probability of oscillations depends on:

Delay between correction action and getting corresponding load results (depends on the frequency of weight updates and load reports)

ProportionalGain (the higher the value the higher the probability of oscillations)

DerivativeGain (the lower the value the higher the probability of oscillations)

How fast utilization changes on the server (can be capped by using moving average either on the server or on the client)

How fast meanUtilization changes (can be capped by using moving average either on the server or on the client)
The probability of oscillations only indirectly depends on QPS and load profile generated by individual clients (via changes to server utilization and meanUtilization) Using moving average will cut the noize in utilization. All other parameters are hardcoded and can be tuned in a generic way for any types of workloads.

Test setup
This time I run 2 different tests:

Test 1

93 clients, 87 server

2 types of clients (normal and noizy)

2 types of servers (fast and slow)

Test 2

93 clients, 87 server

2 types of clients (normal and spiky)

Spiky clients increase load by sinusoid with 1 min rate

2 types of servers (fast and slow)

3 types of request(light, medium, heavy) with random distribution
The second test is needed to validate that PID controller can adopt to the changes in load quickly enough (As you will see from the graph the time between spikes was ~1 min)

Test 1: plain wrr with no subsetting

I use this test as a reference point, we can't do bettter than that.

Test 1: PID controller + subsetting (subset_size = 20)

As you can see, the load converges to almost a perfect straight line after ~1 min.

Test 2: plain wrr with no subsetting

Test 2: PID controller + subsetting (subset_size = 20)

The load distribution during spikes is now as tight as with wrr and no subsetting, but IMO it doesn't look bad at all. The spikes are pretty sort (~ 1 min) and PID controller still can correct the load with no oscillations or overshoot.

Test 1: PID controller + subsetting (subset_size = 5)

I also decided to test what happens we we use very small subset size. As you can see, the resulting load still converges (even though the connection distributions is terribly unbalances in this case. Also, with very small subsets every client have a very limited view of the servers, but given enough subset diversity it still allows the load to converge)

I can't fully respond, but I also don't want to leave you hanging without hearing anything back.

Those results look really nice. The PID worked well. Subsetting behaved as expected. You threw a lot of ugly cases at it and it held up. I'm excited.

There's been some separate conversations that just so happen to have happened in parallel which may impact the random subsetting... In a conversation about very low QPS clients we found ourselves talking about dynamic subset sizing. @markdroth was going to figure out how much overlap there was with this, IIRC.

FYI, I'm on vacation Oct 10th through Oct 18th.

I finally can report some result here: we tested PID balancer for a few apps in prod environment and learned a few important lessons about how we should tune it. Here is the result of applying PID balancer in one of our production DCs for an app that uses random subsetting.

We want to contribute the code upstream, so we don't have to support a fork of wrr and the rest of the community can reuse it as well. I am going to write 3 gRFCs:

Add simple client-side random subsetting based on consistent hashing.

Add moving-average support to server-side ORCA metrics (This is critical for PID to work correctly, otherwise weights won't converge for some spiky workloads)

Add PID balancer (My implementation reuses 90% of WRR code, so I am going to propose some options how to reuse the code without duplication)

@markdroth @ejona86 does this sound good?

Those last two could be in one gRFC, since the PID balancer is to use the new ORCA functionality.

s-matyukevich · 2024-04-11T20:16:25Z

Closed in favor of #423

s-matyukevich and others added 9 commits July 25, 2023 10:37

Add Subsetting lb policy

7326b4d

more details on the proposal

5370169

correct grammar mistakes

3fdfb76

rationale and typos

a2456e5

Update A66-deterministic-subsetting-lb-policy.md

fa0275e

Co-authored-by: Antoine Tollenaere <atollena@gmail.com>

Update A66-deterministic-subsetting-lb-policy.md

837fb7a

Co-authored-by: Antoine Tollenaere <atollena@gmail.com>

add some pseudocode

59bedf3

Update A66-deterministic-subsetting-lb-policy.md

bbf20d9

Co-authored-by: Anna Grigoriu <125993460+anna-grigoriu@users.noreply.github.com>

rename

bbbeca2

This was referenced Jul 31, 2023

Deterministic Subsetting LB policy grpc/grpc-go#6488

Closed

How should we implement subsetting in gRPC? grpc/grpc-go#6370

Closed

s-matyukevich and others added 5 commits July 31, 2023 08:15

Update A66-deterministic-subsetting-lb-policy.md

1428640

Co-authored-by: Antoine Tollenaere <atollena@gmail.com>

Update A66-deterministic-subsetting-lb-policy.md

391c890

Co-authored-by: Antoine Tollenaere <atollena@gmail.com>

fix typos

0d86f1b

Merge branch 'subsetting' of github.com:s-matyukevich/proposal into s…

7aba9b6

…ubsetting

remove duplicated file

f8c4777

joybestourous mentioned this pull request Aug 15, 2023

Deterministic Subsetting Lb Policy grpc/grpc-java#10470

Open

markdroth self-assigned this Aug 23, 2023

markdroth reviewed Aug 26, 2023

View reviewed changes

ejona86 reviewed Aug 28, 2023

View reviewed changes

s-matyukevich commented Aug 29, 2023

View reviewed changes

s-matyukevich mentioned this pull request Apr 11, 2024

A68: Random subsetting with rendezvous hashing LB policy #423

Open

s-matyukevich closed this Apr 11, 2024

s-matyukevich mentioned this pull request Apr 26, 2024

PID LB policy #430

Open


		## Background

		Currently, gRPC is lacking a way to select a subset of endpoints available from the resolver and load-balance requests between them. Out of the box, users have the choice between two extremes: `pick_first` which sends all requests to one random backend, and `round_robin` which sends requests to all available backends. `pick_first` has poor connection balancing when the number of client is not much higher than the number of servers because of the birthday paradox. The problem is exacerbated during rollouts because `pick_first` does not change endpoint on resolver updates if the current subchannel remains `READY`. `round_robin` results in every servers having as many connections open as there are clients, which is unnecessarily costly when there are many clients, and makes local decisions load balancing (such as outlier detection) less precise.


		### Handling Parent/Resolver Updates

		When the resolver updates the list of addresses, or the LB config changes, Deterministic subsetting LB will run the subsetting algorithm, described above, to filter the endpoint list. Then it will create a new resolver state with the filtered list of the addresses and pass it to the child LB. Attributes and service config from the old resolver state will be copied to the new one.


		`deterministic_subsetting` will be added as a new LB policy.

		```textproto

A68: Deterministic Subsetting LB policy #383

A68: Deterministic Subsetting LB policy #383

Conversation

s-matyukevich commented Jul 31, 2023 • edited

linux-foundation-easycla bot commented Jul 31, 2023 • edited

s-matyukevich commented Aug 23, 2023

markdroth commented Aug 23, 2023

markdroth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s-matyukevich Aug 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s-matyukevich Aug 31, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larry-safran commented Aug 28, 2023 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markdroth commented Aug 28, 2023

larry-safran commented Aug 28, 2023

markdroth commented Aug 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s-matyukevich Sep 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s-matyukevich commented Apr 11, 2024

s-matyukevich commented Jul 31, 2023 •

edited

linux-foundation-easycla bot commented Jul 31, 2023 •

edited

s-matyukevich Aug 28, 2023 •

edited

s-matyukevich Aug 31, 2023 •

edited

s-matyukevich Sep 8, 2023 •

edited