Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALTS: GRPCLB LoadBalancer had an error #7643

Closed
veblush opened this issue Nov 19, 2020 · 20 comments · Fixed by #7712
Closed

ALTS: GRPCLB LoadBalancer had an error #7643

veblush opened this issue Nov 19, 2020 · 20 comments · Fixed by #7712
Labels
Milestone

Comments

@veblush
Copy link
Contributor

veblush commented Nov 19, 2020

When running GCS benchmark over DirectPath with gRPC 1.34.0-SNAPSHOT, it failed with the error below. It has been working with gRPC 1.33.1.

What version of gRPC-Java are you using?

gRPC 1.34.0-SNAPSHOT

What is your environment?

Linux Debian/10

What did you expect to see?

Connecting to the backend using ALTS.

What did you see instead?

GRPCLB Error

Exception in thread "main" io.grpc.StatusRuntimeException: UNAUTHENTICATED: Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.
Stream to GRPCLB LoadBalancer had an error
        at io.grpc.Status.asRuntimeException(Status.java:533)
        at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:648)
        at io.grpc.gcs.GrpcClient.makeMediaRequest(GrpcClient.java:179)
        at io.grpc.gcs.GrpcClient.startCalls(GrpcClient.java:106)
        at io.grpc.gcs.TestMain.main(TestMain.java:52)

Steps to reproduce the bug

java -jar $HOME/grpc-gcp-java/end2end-test-examples/gcs/libs_1.34.0/gcs-1.0-SNAPSHOT.jar --client=grpc --bkt=$GCS_BUCKET --obj=$GCS_OBJECT --method=read --dp=true --calls=1
@veblush
Copy link
Contributor Author

veblush commented Nov 19, 2020

1.34.0_error.txt is a debug log of gRPC.

@sanjaypujare
Copy link
Contributor

There is an earlier error in the log. Is that related or is that also seen with 1.33.1

Nov 19, 2020 2:25:10 AM io.grpc.grpclb.GrpclbNameResolver resolveBalancerAddresses
FINE: Balancer resolution failure
javax.naming.OperationNotSupportedException: DNS service refused [response code 5]; remaining name '_grpclb._tcp.metadata.google.internal.'
	at com.sun.jndi.dns.DnsClient.checkResponseCode(DnsClient.java:663)
	at com.sun.jndi.dns.DnsClient.isMatchResponse(DnsClient.java:578)
	at com.sun.jndi.dns.DnsClient.doUdpQuery(DnsClient.java:426)
	at com.sun.jndi.dns.DnsClient.query(DnsClient.java:211)
	at com.sun.jndi.dns.Resolver.query(Resolver.java:81)
	at com.sun.jndi.dns.DnsContext.c_getAttributes(DnsContext.java:434)
	at com.sun.jndi.toolkit.ctx.ComponentDirContext.p_getAttributes(ComponentDirContext.java:235)
	at com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.getAttributes(PartialCompositeDirContext.java:141)
	at com.sun.jndi.toolkit.url.GenericURLDirContext.getAttributes(GenericURLDirContext.java:103)
	at javax.naming.directory.InitialDirContext.getAttributes(InitialDirContext.java:142)
	at io.grpc.internal.JndiResourceResolverFactory$JndiRecordFetcher.getAllRecords(JndiResourceResolverFactory.java:216)
	at io.grpc.internal.JndiResourceResolverFactory$JndiResourceResolver.resolveSrv(JndiResourceResolverFactory.java:133)
	at io.grpc.grpclb.GrpclbNameResolver.resolveBalancerAddresses(GrpclbNameResolver.java:79)
	at io.grpc.grpclb.GrpclbNameResolver.doResolve(GrpclbNameResolver.java:62)
	at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:318)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

@sanjaypujare
Copy link
Contributor

@dapengzhang0 any idea what changes in grpclb could have caused this?

@jiangtaoli2016
Copy link
Contributor

@apolcyn
It may relate to Identity bound token enforced by gRPC LB.
The problem is only on a particular version of grpc?

@apolcyn
Copy link
Contributor

apolcyn commented Nov 19, 2020

I think the '_grpclb._tcp.metadata.google.internal.' query failure is expected, and just happens because we have SRV queries enabled (so the inner ALTS channel will also make the SRV query). However, metadata.google.internal. isn't a load balanced name, so that query should fail.

+1 that this seems likely related to ID bound token. Can we check what the target name (https://github.com/grpc/grpc/blob/e6e6be4b0b12562e38083181aee27438d8258b5a/src/proto/grpc/gcp/handshaker.proto#L104) is that is passed to the ALTS handshake service?

@veblush
Copy link
Contributor Author

veblush commented Nov 19, 2020

I haven't seen this issue with gRPC Java prior to 1.34. I just checked that gRPC 1.33.1 doesn't have this issue. So commits after v1.33.1 most likely caused this.

@jiangtaoli2016
Copy link
Contributor

I don't think Id bound token rollout in grpclb touches grpc code. It is all backend enabling. Let me check with Jianing who is responsible for Id bound token.

@apolcyn
Copy link
Contributor

apolcyn commented Nov 19, 2020

Although ID bound token rollout itself doesn't touch grpc code, if I'm correct it does newly require the client to pass ALTS target names and RPC authority headers on the LB channel properly, which might have changed at the client

@jiangtaoli2016
Copy link
Contributor

Confirmed with Jianing, rollout only change ESF config and fallback code is still there.

I haven't seen this issue with gRPC Java prior to 1.34. I just checked that gRPC 1.33.1 doesn't have this issue. So commits after v1.33.1 most likely caused this.

Based on this, it looks like some changes in gRPC code caused the failure.

@dapengzhang0
Copy link
Member

dapengzhang0 commented Nov 23, 2020

There wasn't much change to grpclb/alts between 1.33.1 and 1.34. I suspect #7613 or #7589 which was added to 1.34 is causing the problem.

@ejona86 ejona86 added the TODO:release blocker Issue/PR is important enough to delay the release. Removed after release issues resolved label Dec 7, 2020
@ejona86
Copy link
Member

ejona86 commented Dec 7, 2020

This should have been a release blocker for v1.34 since it seems to be a serious regression.

gRPCLB doesn't provide any call credentials to the server. If the server now requires it, that would definitely break gRPC. But why would older versions succeed? That's the important question.

Looking at the logs, it appears that the gRPCLB request includes a JWT. That's surprising. I wonder if the swap to ChannelCredentials somehow added it.

@ejona86 ejona86 added the bug label Dec 7, 2020
@ejona86
Copy link
Member

ejona86 commented Dec 7, 2020

This is a regression caused by ChannelCredentials. ManagedChannelImpl injects the call credentials to the existing CallCredentialsApplyingTransportFactory which is used by both normal and OOB connections.

@ejona86 ejona86 added this to the 1.35 milestone Dec 7, 2020
@ejona86
Copy link
Member

ejona86 commented Dec 7, 2020

How was this not detected by interop tests?

FYI @apolcyn, 1.34 may be broken for DP.

@veblush, is the DP service you're using available only via DP? If not, then it isn't clear why this fails. gRPCLB should fall back to GFE when the LB server can't be reached.

@ejona86
Copy link
Member

ejona86 commented Dec 8, 2020

@veblush, can you try testing with this patch and see if it resolves the issue?

diff --git a/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java b/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java
index 2edb32fa2..97088059a 100644
--- a/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java
+++ b/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java
@@ -153,6 +153,7 @@ final class ManagedChannelImpl extends ManagedChannel implements
   private final NameResolver.Args nameResolverArgs;
   private final AutoConfiguredLoadBalancerFactory loadBalancerFactory;
   private final ClientTransportFactory transportFactory;
+  private final ClientTransportFactory oobTransportFactory;
   private final RestrictedScheduledExecutor scheduledExecutor;
   private final Executor executor;
   private final ObjectPool<? extends Executor> executorPool;
@@ -593,6 +594,8 @@ final class ManagedChannelImpl extends ManagedChannel implements
     this.executor = checkNotNull(executorPool.getObject(), "executor");
     this.transportFactory = new CallCredentialsApplyingTransportFactory(
         clientTransportFactory, builder.callCredentials, this.executor);
+    this.oobTransportFactory = new CallCredentialsApplyingTransportFactory(
+        clientTransportFactory, null, this.executor);
     this.scheduledExecutor =
         new RestrictedScheduledExecutor(transportFactory.getScheduledExecutorService());
     maxTraceEvents = builder.maxTraceEvents;
@@ -1517,7 +1520,7 @@ final class ManagedChannelImpl extends ManagedChannel implements
 
       final InternalSubchannel internalSubchannel = new InternalSubchannel(
           Collections.singletonList(addressGroup),
-          authority, userAgent, backoffPolicyProvider, transportFactory,
+          authority, userAgent, backoffPolicyProvider, oobTransportFactory,
           transportFactory.getScheduledExecutorService(), stopwatchSupplier, syncContext,
           // All callback methods are run from syncContext
           new ManagedOobChannelCallback(),

@apolcyn
Copy link
Contributor

apolcyn commented Dec 8, 2020

How was this not detected by interop tests?

Created b/175066772 to follow up on this internally.

@ejona86
Copy link
Member

ejona86 commented Dec 9, 2020

I spoke with @veblush, and the service was not DP-only, so fallback to CFE should have occurred.

I think if RPCs continued the fallback may have taken place. The initial RPCs seem to fail because the picker is immediately failed when lb RPC fails and there are no backends. Changing the logic to try fallback backends first should be possible, but it also seems it will be a hard to define/implement all the edge cases.

@apolcyn, does it seem appropriate that some RPCs failed, assuming later ones would have succeeded? It seems like we may need to change this error behavior.

@veblush
Copy link
Contributor Author

veblush commented Dec 9, 2020

This is interesting and I cannot reproduce this anymore on the same machine. Luckily I could run the exact same 1.34.0-SNAPSHOT binary which was used for the error case above and it somehow ran successfully over DirectPath.

@veblush
Copy link
Contributor Author

veblush commented Dec 9, 2020

1.34.0.log: Success log with the one which failed previously
1.34.1.log: Success log with the one with Eric's patch.

@veblush
Copy link
Contributor Author

veblush commented Dec 9, 2020

This might be caused by a combination of the server and the client issue. Since I cannot reproduce this issue anymore with the same client, something which made the client fail to the GCS backend via DirectPath appears to be addressed.

@ejona86
Copy link
Member

ejona86 commented Dec 9, 2020

@veblush, the v1.34.1 log still includes the authorization header in the gRPCLB request. So either there was a mistake when testing or my patch doesn't address the problem.

ejona86 added a commit to ejona86/grpc-java that referenced this issue Dec 9, 2020
This also fixes a bug where resolving OOB channels would have CallCreds
duplicated; that wasn't noticed or important because we don't use
CallCreds in OOB channels.

Fixes grpc#7643
ejona86 added a commit to ejona86/grpc-java that referenced this issue Dec 9, 2020
The addition of CompositeChannelCredentials allowed CallCredentials to
be passed to the ManagedChannel itself. But the implementation was buggy
and used the call creds for out-of-band channels as well, which is
inappropriate since they have a different authority.

This also fixes a bug where resolving OOB channels would have CallCreds
duplicated; that wasn't noticed or important because we don't use
CallCreds in OOB channels.

Fixes grpc#7643
ejona86 added a commit that referenced this issue Dec 10, 2020
The addition of CompositeChannelCredentials allowed CallCredentials to
be passed to the ManagedChannel itself. But the implementation was buggy
and used the call creds for out-of-band channels as well, which is
inappropriate since they have a different authority.

This also fixes a bug where resolving OOB channels would have CallCreds
duplicated; that wasn't noticed or important because we don't use
CallCreds in OOB channels.

Fixes #7643
@ejona86 ejona86 removed the TODO:release blocker Issue/PR is important enough to delay the release. Removed after release issues resolved label Dec 10, 2020
ejona86 added a commit to ejona86/grpc-java that referenced this issue Dec 10, 2020
The addition of CompositeChannelCredentials allowed CallCredentials to
be passed to the ManagedChannel itself. But the implementation was buggy
and used the call creds for out-of-band channels as well, which is
inappropriate since they have a different authority.

This also fixes a bug where resolving OOB channels would have CallCreds
duplicated; that wasn't noticed or important because we don't use
CallCreds in OOB channels.

Fixes grpc#7643
ejona86 added a commit that referenced this issue Dec 10, 2020
The addition of CompositeChannelCredentials allowed CallCredentials to
be passed to the ManagedChannel itself. But the implementation was buggy
and used the call creds for out-of-band channels as well, which is
inappropriate since they have a different authority.

This also fixes a bug where resolving OOB channels would have CallCreds
duplicated; that wasn't noticed or important because we don't use
CallCreds in OOB channels.

Fixes #7643
dfawley pushed a commit to dfawley/grpc-java that referenced this issue Jan 15, 2021
The addition of CompositeChannelCredentials allowed CallCredentials to
be passed to the ManagedChannel itself. But the implementation was buggy
and used the call creds for out-of-band channels as well, which is
inappropriate since they have a different authority.

This also fixes a bug where resolving OOB channels would have CallCreds
duplicated; that wasn't noticed or important because we don't use
CallCreds in OOB channels.

Fixes grpc#7643
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants