Rare but persistent blocked or missing stream messages on established connections #3980

ianwilkes · 2020-10-22T20:18:13Z

What version of gRPC are you using?

1.32.0

What version of Go are you using (`go version`)?

1.15.2

What operating system (Linux, Windows, …) and version?

linux/ARM

What did you do?

My servers establish long-lived connections to each other, and re-use them for concurrent streaming calls. A typical stream involves several small messages in each direction, and usually only lasts for a fraction of a second. In aggregate we're moving ~10K messages/second between ~100 hosts.

All messages are passed using the generated Send(...) and Recv() methods, finished with a CloseSend() on the client side. Generally this is near-100% reliable, except as described below. The grpc connections themselves rarely report a bad state.

What did you expect to see?

After Send() returns with no error, Recv() on the other side returns with either a valid message, or an error.

What did you see instead?

After Send() is called, Recv() sometimes blocks until context expiration, which is about a minute. The pattern of occurrence is fairly specific. Normally, it doesn't happen at all and there is low latency in message passing. However, seemingly randomly a condition will develop with a specific pair of hosts, where this blocking begins to occur about 0.1% of the time. (0.1% doesn't seem like much, but the long timeouts cascade into larger systemic unreliability) The problem is always stream-local, where other streams on the same connection work fine, concurrent with the broken one.

Once it starts, this will continue to happen for that pair of hosts until process restart (the only time, currently, when all grpc connections are rebuilt). This happens a few times per day, often after the cluster has been communicating trouble-free for several hours.

If the problem weren't sticky, I would write it off as a glitch. If it happened consistently across all connections, I would blame my own code or usage pattern, or at least have a hope of debugging it. As it stands, it appears that the problem lies internal to grpc, but I can't repro it locally so don't know how to begin debugging.

I'm attempting to work around the problem my detecting anomalous blocking times - hard to do reliably in a system with plenty of legitimate reasons to block - but if there is any debugging information I can extract when that happens, I can try to do that.

The text was updated successfully, but these errors were encountered:

menghanl · 2020-10-22T20:37:25Z

Thanks for reporting this issue. This does sound quite tricky to debug.

The first thought I have is, there's a connection problem between these two hosts.

Do the hosts connect directly to each other? Or do they go through proxies?

One thing to try out is keepalive (doc, example). This should help rule out those connection problems.

For further debugging: I would suggest turning on logs for both gRPC and golang http2 (GODEBUG=http2debug=2).
The concern is, this will generate a lot of logs. Given that the issue is hard to reproduce, there might be too many to process. But they could contain valuable information.

ianwilkes · 2020-10-22T21:24:31Z

@menghanl There's no proxy; all connections are direct. I don't think this is a connection-wide issue, in part because it appears that other streams on the same connection at the same time work fine. Also, this happens mid-stream, so the client and server successfully exchange several messages (within a few ms) before one suddenly blocks.

I'll see what I can do with logging; handling the volume is one thing, finding the problem case in all those logs is another. Is there a way to map a stream in my own code to an id used by the logs?

menghanl · 2020-10-22T22:10:48Z

because it appears that other streams on the same connection at the same time work fine

By "the same connection" do you mean the same grpc ClientConn? Or the same TCP connection?
The ClientConn would create new connections if it detects the old connection is broken. Thus it's possible that the streams on the same ClientConn are actually using different connections.

this happens mid-stream

This doesn't rule out the possibility that this is a connection problem though. If the TCP connection for some reason starts to drop packets (may include FIN). The receiving side will never see the connection problem until TCP timeout (which can be 30 minutes). The purpose of keepalive is to detect this failure and recover.

menghanl · 2020-10-22T22:12:15Z

Is there a way to map a stream in my own code to an id used by the logs?

Unfortunately no.. The go http2 logs should include the http2 stream ID. But gRPC doesn't assign IDs to each stream.

ianwilkes · 2020-10-22T22:45:09Z

By "the same connection" do you mean the same grpc ClientConn? Or the same TCP connection?

I mean a single ClientConn. Didn't realize it would use multiple TCP connections under the hood.

I'll enable keepalives and report back.

ianwilkes · 2020-10-26T18:43:06Z

@menghanl I haven't seen a recurrence of this problem since enabling KeepAlive, so that seems to have done the trick. Thanks for the tip.

This is an interesting pitfall since I never saw any errors indicating an underlying connection problem (probably because context always expired first), and wouldn't have suspected this solution on my own. I wonder if the defaults should include a more aggressive keepalive, or if the docs might benefit from an explanation like what you wrote, above. I don't see anywhere where they explain why keepalive is a good idea, even on a high-traffic, high-reliability network.

menghanl · 2020-10-27T01:00:27Z

Great to know that it works!

We didn't want to change the default keepalive configuration because it does have side effect (it sends pings).

Good point that the purpose of keepalive isn't explained well. It's briefly mentioned in the gRFC, but I don't think most people will look at that.

I will update the godoc and keepalive.md to explain a bit more.

menghanl · 2020-10-27T23:11:48Z

Sent #3993 to update the doc.

ianwilkes added the Type: Bug label Oct 22, 2020

menghanl self-assigned this Oct 22, 2020

menghanl added the Status: Requires Reporter Clarification label Oct 22, 2020

menghanl removed the Status: Requires Reporter Clarification label Oct 27, 2020

menghanl mentioned this issue Oct 27, 2020

doc: update keepalive.md to add why #3993

Merged

menghanl closed this as completed in #3993 Oct 29, 2020

klizhentas mentioned this issue Nov 5, 2020

Slow auth server can hang client connections in async recording mode gravitational/teleport#4695

Closed

github-actions bot locked as resolved and limited conversation to collaborators May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rare but persistent blocked or missing stream messages on established connections #3980

Rare but persistent blocked or missing stream messages on established connections #3980

ianwilkes commented Oct 22, 2020

menghanl commented Oct 22, 2020

ianwilkes commented Oct 22, 2020

menghanl commented Oct 22, 2020

menghanl commented Oct 22, 2020

ianwilkes commented Oct 22, 2020

ianwilkes commented Oct 26, 2020

menghanl commented Oct 27, 2020 •

edited

menghanl commented Oct 27, 2020

Rare but persistent blocked or missing stream messages on established connections #3980

Rare but persistent blocked or missing stream messages on established connections #3980

Comments

ianwilkes commented Oct 22, 2020

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

menghanl commented Oct 22, 2020

ianwilkes commented Oct 22, 2020

menghanl commented Oct 22, 2020

menghanl commented Oct 22, 2020

ianwilkes commented Oct 22, 2020

ianwilkes commented Oct 26, 2020

menghanl commented Oct 27, 2020 • edited

menghanl commented Oct 27, 2020

What version of Go are you using (`go version`)?

menghanl commented Oct 27, 2020 •

edited