New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audit code path when GrpcLB is not in ClassPath #4602
Comments
Oh, and we should coordinate with C to make sure we have similar logic here. I think Go has gRPCLB unconditionally, so wouldn't be impacted. |
I talked with @markdroth. We both agree that it's a configuration error that the service is configured with grpclb but the client is not loaded with the required library, and that the correct thing to do is to fail all RPCs. |
@zhangkun83, note that means that existing services wouldn't be able to enable gRPCLB, as it would break some existing clients. |
This is now being marked as a bug and blocker because 1.17 enabled SRV lookup by default, thus exposing all users to this behavior. |
Talked with @markdroth, he was proposing failing because of alts/grpclb use case. However, alts users would have alts which depends on grpclb, thus they will never have this failure mode. We have agreed that it'd be fine to log a warning per channel and fallback to round_robin when this happens. |
…ilable. When service owner turns on grpclb through service config, it shouldn't break existing clients that don't have grpclb in their classpath. Resolves grpc#4602
With bazel round_robin is in util package which is not included in core, thus falling back to round_robin may still fail for some users. Re-opening this issue. |
@zhangkun83 @markdroth why was round_robin chosen to be the fallback originally? The default behavior is normally pick_first, so changing that based on whether there are gRPCLB addresses seems unusual. Also, what if the LoadBalancingPolicy field of the service config specifies another policy - should we fall back to it instead? |
The original idea was that if you're using grpclb, then you're really trying to balance the load accross your backends, so round_robin is a much better fallback than pick_first. I believe that all of our grpclb client implementations currently using round_robin as the fallback when they can't talk to a balancer, and the case being described in this bug can be considered a special case of that. That having been said, we do know of at least one situation where it would be better to fall back to pick_first instead of round_robin, so we had intended to come up with a design to allow the grpclb fallback policy to be configurable. However, we never got around to doing that, and I don't think it makes much sense to do so at this point, since we're trying to move away from grpclb. (Note that the design for the new xds LB policy does include a configurable fallback policy.) |
We need to fully document the balancer selection process somewhere. Right now this is what I think it is:
Is that right?
I don't think it's so obvious that round_robin is superior (e.g. the fallbacks could be reverse proxies doing their own load balancing), but given the existing behavior of grpclb automatically doing round robin when the remote balancer cannot be reached, round robin also seems like the obvious choice if grpclb cannot be started. [EDIT: added client API case (# 4).] |
I think the order you specified is right, except that it's missing one element: if the LB policy is specified via the client API, that should go between 3 and 4 (i.e., it trumps the pick_first default but does not override anything that comes from the service config). The case you mentioned of the fallbacks being reverse proxies is the exact case I was referring to earlier that led us to think this should ideally be configurable. We're fixing this in the new design. |
I updated the comment to add the client API case. In the (experimental) Go API, our option actually takes the highest priority right now. I will look into changing that while adding support for the |
With #6637 we no longer pass grpclb addresses as normal addresses. Instead, they are passed in the attributes with the |
We need to guarantee that if GRPCLB addresses are provided by the NameResolver, they won't make it to a LB instance other than GRPCLB, as other LB types will not properly ignore the GRPCLB addresses and try to route traffic to them and never realize all such RPCs will fail.
Today it seems AutoConfiguredLoadBalancerFactory will throw, which isn't probably appropriate because it will cause the channel to go into panic mode. Instead, it could choose a LB instance that fails all RPCs with a clear Status. Or it could consider stripping the LB addresses and continue like normal, passing the non-LB addresses to the PF/RR/etc. If there are no non-LB addresses, maybe it could still fail RPCs with a clear status saying gRPCLB is required for this target.
We should also add a comment to the GRPCLB Attribute Key noting that LB policies won't receive it.
The text was updated successfully, but these errors were encountered: