Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpcurl fails with "context deadline exceeded" after 10s if using plaintext when server expects TLS #387

Open
ucarion opened this issue Apr 20, 2023 · 7 comments

Comments

@ucarion
Copy link

ucarion commented Apr 20, 2023

Bottom line up front, here's how you reproduce this issue:

$ grpcurl -version
grpcurl 1.8.6

$ time grpcurl -plaintext grpcb.in:9001 list
Failed to dial target host "grpcb.in:9001": context deadline exceeded
grpcurl -plaintext grpcb.in:9001 list  0.02s user 0.03s system 0% cpu 10.082 total

For context, grpcb.in:9001 wants TLS; -plaintext is the problem. But the fact that grpcurl hangs for 10 seconds, and does not produce an informative error, is the subject of this GitHub issue. I suspect the issue may be use of grpc.WithBlock() prevents an error from bubbling up, but I assume there's a good reason for the use of that dialopt for some other purpose.

@jhump
Copy link
Contributor

jhump commented Apr 21, 2023

I suspect the issue may be use of grpc.WithBlock() prevents an error from bubbling up,

I doubt that. That's actually how you get any error to bubble up. Otherwise, you never get any sort of feedback from Dial as it does the actual TCP connection setup completely asynchronous and only returns an error if there is some other configuration problem with the options.

The issue here is where it fails. In grpcurl.BlockingDial, we try to control both dialing and a potential TLS handshake so that we can intercept any errors (which the underlying gRPC Go runtime library hides from the application), in order to give a decent error to the user.

The issue here is actually that the connections are setup just fine -- all a plaintext connection cares about is getting the TCP connection. The other direction (using TLS in the client to a server that does not expect it) fails more cleanly because the error does bubble up from dialing because the connections cannot be established because the TLS handshake fails.

So the actual error is happening inside the gRPC runtime when it tries to send the HTTP/2 preface to the server. In this case, the server is expecting a TLS handshake, but doesn't receive one. So the server immediately closes the connection. We're providing a grpc. FailOnNonTempDialError(true) dial option, in the hopes that something like this would be bubbled up from the dial call. But apparently the server suddenly closing the connection (without any known reason) is interpreted as a temporary error. So the runtime keeps re-trying, creating a new connection over and over, never getting a healthy one that can be used for sending an RPC.

A fix is possible, but it isn't simple. The custom dialer in grpc.BlockingDial will need to wrap the returned net.Conn so it has more visibility into connection closures. So it could (for example) fail fast if it sees repeated inexplicable hang-ups from the server all before the grpc.Dial call completes (and it would have some sort of error to report, likely just "connection closed by peer").

@ucarion
Copy link
Author

ucarion commented Apr 26, 2023

The presence of a custom dialer does make things more unique here. In the past, I've just used the default dialer and matched against the returned error message, but I presume the custom dialer must remain as-is for other reasons.

@jhump
Copy link
Contributor

jhump commented Apr 26, 2023

I've just used the default dialer and matched against the returned error message

The custom dialer is actually only here to provide decent error messages. The "context deadline exceeded" error is what is coming from the grpc.Dial call, so "matched against the returned error message" wouldn't really help here. The custom dialers are only in place to intercept underlying network errors, so that we can use them to provide better error messages. The specific issue here is that the dialer is not instrumented to intercept all network errors -- we're missing out on whatever error is occurring after the connection is established, due to the server immediately closing the connection.

@ucarion
Copy link
Author

ucarion commented Apr 27, 2023

Yeah, sorry, I misspoke -- in the past I've matched against the RPC call error, rather than the dial error, for this situation. Whether an error is from dialing versus calling an RPC has always been confusing to me, and I suspect it's not even something stable across grpc-go versions.

@anitgandhi
Copy link

we often ran into this problem with grpc-go clients, and the newer WithReturnConnectionError dial option is a nice alternative to WithBlock and FailOnNonTempDialError, because it bubbles up the underlying connection error. combined with some other recent improvements to the grpc-go client (i believe in v1.54.x), TLS handshake errors also show up now.

@kumarniraj01
Copy link

When attempting to use grpcurl to access a service deployed on an EC2 instance through a load balancer and target, using the following command: grpcurl -plaintext test.dev.xyz:9090 list, I encounter an error. The error message states: "Failed to dial target host 'test.dev.xyz:9090': context deadline exceeded."

can anyone help me to resolve this ?

@hayyaun
Copy link

hayyaun commented Jan 25, 2024

we often ran into this problem with grpc-go clients, and the newer WithReturnConnectionError dial option is a nice alternative to WithBlock and FailOnNonTempDialError, because it bubbles up the underlying connection error. combined with some other recent improvements to the grpc-go client (i believe in v1.54.x), TLS handshake errors also show up now.

This answer saved my day, thank you.
In my case the error was made because of cert expiration, and I couldn't even retrieve it correctly, WithBlock simply stops in case of any error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants