Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit code path when GrpcLB is not in ClassPath #4602

Closed
ejona86 opened this issue Jun 29, 2018 · 12 comments
Closed

Audit code path when GrpcLB is not in ClassPath #4602

ejona86 opened this issue Jun 29, 2018 · 12 comments
Labels

Comments

@ejona86
Copy link
Member

ejona86 commented Jun 29, 2018

We need to guarantee that if GRPCLB addresses are provided by the NameResolver, they won't make it to a LB instance other than GRPCLB, as other LB types will not properly ignore the GRPCLB addresses and try to route traffic to them and never realize all such RPCs will fail.

Today it seems AutoConfiguredLoadBalancerFactory will throw, which isn't probably appropriate because it will cause the channel to go into panic mode. Instead, it could choose a LB instance that fails all RPCs with a clear Status. Or it could consider stripping the LB addresses and continue like normal, passing the non-LB addresses to the PF/RR/etc. If there are no non-LB addresses, maybe it could still fail RPCs with a clear status saying gRPCLB is required for this target.

We should also add a comment to the GRPCLB Attribute Key noting that LB policies won't receive it.

@ejona86 ejona86 added this to the Next milestone Jun 29, 2018
@ejona86
Copy link
Member Author

ejona86 commented Jun 29, 2018

Oh, and we should coordinate with C to make sure we have similar logic here. I think Go has gRPCLB unconditionally, so wouldn't be impacted.

@zhangkun83
Copy link
Contributor

I talked with @markdroth. We both agree that it's a configuration error that the service is configured with grpclb but the client is not loaded with the required library, and that the correct thing to do is to fail all RPCs.

@ejona86
Copy link
Member Author

ejona86 commented Jun 29, 2018

@zhangkun83, note that means that existing services wouldn't be able to enable gRPCLB, as it would break some existing clients.

@zpencer zpencer modified the milestones: Next, 1.14 Jul 6, 2018
@zpencer zpencer modified the milestones: 1.14, 1.15, Next Jul 18, 2018
@zhangkun83 zhangkun83 modified the milestones: 1.15, 1.16 Aug 28, 2018
@ejona86 ejona86 modified the milestones: 1.16, 1.17 Oct 11, 2018
@creamsoup creamsoup modified the milestones: 1.17, 1.18 Dec 4, 2018
@ejona86
Copy link
Member Author

ejona86 commented Dec 6, 2018

This is now being marked as a bug and blocker because 1.17 enabled SRV lookup by default, thus exposing all users to this behavior.

@ejona86 ejona86 added bug TODO:release blocker Issue/PR is important enough to delay the release. Removed after release issues resolved labels Dec 6, 2018
@zhangkun83
Copy link
Contributor

zhangkun83 commented Dec 6, 2018

Talked with @markdroth, he was proposing failing because of alts/grpclb use case. However, alts users would have alts which depends on grpclb, thus they will never have this failure mode. We have agreed that it'd be fine to log a warning per channel and fallback to round_robin when this happens.

zhangkun83 added a commit to zhangkun83/grpc-java that referenced this issue Dec 6, 2018
…ilable.

When service owner turns on grpclb through service config, it
shouldn't break existing clients that don't have grpclb in their
classpath.

Resolves grpc#4602
@zhangkun83 zhangkun83 reopened this Dec 7, 2018
@zhangkun83
Copy link
Contributor

With bazel round_robin is in util package which is not included in core, thus falling back to round_robin may still fail for some users. Re-opening this issue.

@dfawley
Copy link
Member

dfawley commented Dec 10, 2018

@zhangkun83 @markdroth why was round_robin chosen to be the fallback originally? The default behavior is normally pick_first, so changing that based on whether there are gRPCLB addresses seems unusual. Also, what if the LoadBalancingPolicy field of the service config specifies another policy - should we fall back to it instead?

@markdroth
Copy link
Member

The original idea was that if you're using grpclb, then you're really trying to balance the load accross your backends, so round_robin is a much better fallback than pick_first. I believe that all of our grpclb client implementations currently using round_robin as the fallback when they can't talk to a balancer, and the case being described in this bug can be considered a special case of that.

That having been said, we do know of at least one situation where it would be better to fall back to pick_first instead of round_robin, so we had intended to come up with a design to allow the grpclb fallback policy to be configurable. However, we never got around to doing that, and I don't think it makes much sense to do so at this point, since we're trying to move away from grpclb. (Note that the design for the new xds LB policy does include a configurable fallback policy.)

@dfawley
Copy link
Member

dfawley commented Dec 11, 2018

We need to fully document the balancer selection process somewhere. Right now this is what I think it is:

  1. If loadBalancingConfig is present in the service config, attempt to use it.
  2. If grpclb balancer records are returned by the resolver, use grpclb.
    • If there is an error starting grpclb, or if no addresses can by obtained from the remote balancer, round robin over the fallback addresses.
  3. If loadBalancingPolicy is present in the service config, attempt to use it.
  4. If LB policy is set via the client API, attempt to use it.
  5. Use pick_first

Is that right?

The original idea was that if you're using grpclb, then you're really trying to balance the load accross your backends, so round_robin is a much better fallback than pick_first

I don't think it's so obvious that round_robin is superior (e.g. the fallbacks could be reverse proxies doing their own load balancing), but given the existing behavior of grpclb automatically doing round robin when the remote balancer cannot be reached, round robin also seems like the obvious choice if grpclb cannot be started.

[EDIT: added client API case (# 4).]

@markdroth
Copy link
Member

I think the order you specified is right, except that it's missing one element: if the LB policy is specified via the client API, that should go between 3 and 4 (i.e., it trumps the pick_first default but does not override anything that comes from the service config).

The case you mentioned of the fallbacks being reverse proxies is the exact case I was referring to earlier that led us to think this should ideally be configurable. We're fixing this in the new design.

@dfawley
Copy link
Member

dfawley commented Dec 11, 2018

I updated the comment to add the client API case. In the (experimental) Go API, our option actually takes the highest priority right now. I will look into changing that while adding support for the loadBalancingConfig field.

@dapengzhang0 dapengzhang0 modified the milestones: 1.18, 1.19 Jan 2, 2019
@carl-mastrangelo carl-mastrangelo modified the milestones: 1.19, 1.20 Feb 13, 2019
@zhangkun83 zhangkun83 modified the milestones: 1.20, 1.21 Apr 1, 2019
@ejona86 ejona86 modified the milestones: 1.21, v1.22 May 7, 2019
@creamsoup creamsoup modified the milestones: v1.22, v1.23 Jun 18, 2019
@dapengzhang0 dapengzhang0 modified the milestones: 1.23, 1.24 Jul 30, 2019
@voidzcy voidzcy modified the milestones: 1.24, 1.25 Sep 24, 2019
@zhangkun83 zhangkun83 modified the milestones: 1.25, 1.26 Oct 23, 2019
@ejona86 ejona86 modified the milestones: 1.26, Next Dec 17, 2019
@ejona86
Copy link
Member Author

ejona86 commented Mar 27, 2020

With #6637 we no longer pass grpclb addresses as normal addresses. Instead, they are passed in the attributes with the GrpclbConstants.ATTR_LB_ADDRS key. So this issue can no longer occur.

@ejona86 ejona86 closed this as completed Mar 27, 2020
@ejona86 ejona86 removed this from the Next milestone Mar 31, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

9 participants