Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rare but persistent blocked or missing stream messages on established connections #3980

Closed
ianwilkes opened this issue Oct 22, 2020 · 8 comments · Fixed by #3993
Closed

Rare but persistent blocked or missing stream messages on established connections #3980

ianwilkes opened this issue Oct 22, 2020 · 8 comments · Fixed by #3993
Assignees

Comments

@ianwilkes
Copy link

What version of gRPC are you using?

1.32.0

What version of Go are you using (go version)?

1.15.2

What operating system (Linux, Windows, …) and version?

linux/ARM

What did you do?

My servers establish long-lived connections to each other, and re-use them for concurrent streaming calls. A typical stream involves several small messages in each direction, and usually only lasts for a fraction of a second. In aggregate we're moving ~10K messages/second between ~100 hosts.

All messages are passed using the generated Send(...) and Recv() methods, finished with a CloseSend() on the client side. Generally this is near-100% reliable, except as described below. The grpc connections themselves rarely report a bad state.

What did you expect to see?

After Send() returns with no error, Recv() on the other side returns with either a valid message, or an error.

What did you see instead?

After Send() is called, Recv() sometimes blocks until context expiration, which is about a minute. The pattern of occurrence is fairly specific. Normally, it doesn't happen at all and there is low latency in message passing. However, seemingly randomly a condition will develop with a specific pair of hosts, where this blocking begins to occur about 0.1% of the time. (0.1% doesn't seem like much, but the long timeouts cascade into larger systemic unreliability) The problem is always stream-local, where other streams on the same connection work fine, concurrent with the broken one.

Once it starts, this will continue to happen for that pair of hosts until process restart (the only time, currently, when all grpc connections are rebuilt). This happens a few times per day, often after the cluster has been communicating trouble-free for several hours.

If the problem weren't sticky, I would write it off as a glitch. If it happened consistently across all connections, I would blame my own code or usage pattern, or at least have a hope of debugging it. As it stands, it appears that the problem lies internal to grpc, but I can't repro it locally so don't know how to begin debugging.

I'm attempting to work around the problem my detecting anomalous blocking times - hard to do reliably in a system with plenty of legitimate reasons to block - but if there is any debugging information I can extract when that happens, I can try to do that.

@menghanl menghanl self-assigned this Oct 22, 2020
@menghanl
Copy link
Contributor

Thanks for reporting this issue. This does sound quite tricky to debug.

The first thought I have is, there's a connection problem between these two hosts.

  • Do the hosts connect directly to each other? Or do they go through proxies?

One thing to try out is keepalive (doc, example). This should help rule out those connection problems.

For further debugging: I would suggest turning on logs for both gRPC and golang http2 (GODEBUG=http2debug=2).
The concern is, this will generate a lot of logs. Given that the issue is hard to reproduce, there might be too many to process. But they could contain valuable information.

@ianwilkes
Copy link
Author

@menghanl There's no proxy; all connections are direct. I don't think this is a connection-wide issue, in part because it appears that other streams on the same connection at the same time work fine. Also, this happens mid-stream, so the client and server successfully exchange several messages (within a few ms) before one suddenly blocks.

I'll see what I can do with logging; handling the volume is one thing, finding the problem case in all those logs is another. Is there a way to map a stream in my own code to an id used by the logs?

@menghanl
Copy link
Contributor

because it appears that other streams on the same connection at the same time work fine

By "the same connection" do you mean the same grpc ClientConn? Or the same TCP connection?
The ClientConn would create new connections if it detects the old connection is broken. Thus it's possible that the streams on the same ClientConn are actually using different connections.

this happens mid-stream

This doesn't rule out the possibility that this is a connection problem though. If the TCP connection for some reason starts to drop packets (may include FIN). The receiving side will never see the connection problem until TCP timeout (which can be 30 minutes). The purpose of keepalive is to detect this failure and recover.

@menghanl
Copy link
Contributor

Is there a way to map a stream in my own code to an id used by the logs?

Unfortunately no.. The go http2 logs should include the http2 stream ID. But gRPC doesn't assign IDs to each stream.

@ianwilkes
Copy link
Author

By "the same connection" do you mean the same grpc ClientConn? Or the same TCP connection?

I mean a single ClientConn. Didn't realize it would use multiple TCP connections under the hood.

I'll enable keepalives and report back.

@ianwilkes
Copy link
Author

@menghanl I haven't seen a recurrence of this problem since enabling KeepAlive, so that seems to have done the trick. Thanks for the tip.

This is an interesting pitfall since I never saw any errors indicating an underlying connection problem (probably because context always expired first), and wouldn't have suspected this solution on my own. I wonder if the defaults should include a more aggressive keepalive, or if the docs might benefit from an explanation like what you wrote, above. I don't see anywhere where they explain why keepalive is a good idea, even on a high-traffic, high-reliability network.

@menghanl
Copy link
Contributor

menghanl commented Oct 27, 2020

Great to know that it works!

We didn't want to change the default keepalive configuration because it does have side effect (it sends pings).

Good point that the purpose of keepalive isn't explained well. It's briefly mentioned in the gRFC, but I don't think most people will look at that.

I will update the godoc and keepalive.md to explain a bit more.

@menghanl
Copy link
Contributor

Sent #3993 to update the doc.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants