-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keepalive is not activated. GRPC channels do not respond within 15-20 minutes after switching the network #32435
Comments
Also, I wrote another client in Golang to the same server, with the following settings for the client. Golang client go.mod file
Golang client code example:
And the problem is not reproduced more after switching the network with client implementation in golang. And also I see keepalive using the command -
|
I don't think
This, however, seems worrying. I don't see any issue with the channel args you pasted above, so I'm going to take some time to see if I can reproduce. |
Hi, @gnossen. Thanks for your help. :-) I saw in the gRPC core implementations where the gRPC settings parameter are override TCP_USER_TIMEOUT (https://man7.org/linux/man-pages/man7/tcp.7.html) to avoid issues where the TCP packets are not ACKed in a reasonable time. That should be solved all the edge conditions regarding channels getting blocked indefinitely (and keepalive too). In this doc - https://man7.org/linux/man-pages/man7/tcp.7.html (socket options TCP_USER_TIMEOUT) I understand that is socket options (not part of the Linux kernel). Perhaps the option is not overridden?
My waits also take 15 to 20 minutes. And they are treated either by waiting or by an instant restart of the client or server. Therefore, I decided to write the client in another language (Golnag) and thereby exclude the server part. |
@gnossen, We see network switching between regions on the network metrics. Also, I created a Python service that just sends ping-pong GRPC requests (between client and servers) and stores each request (latency and error rate) to the Prometheus metrics. Losses and Python service errors I see a very long time (15-20 minutes) after switching networks. I think there is definitely a problem here. When I do the same thing in Golang with Keepalive, then everything is fine. I see network switching on network metrics, and a small number of errors in the Golang service, and very fast connection recovery. |
Is it possible that this is related to this old issue caused by bad TCP keepalive settings? In particular, note this comment:
The solution that was implemented tied the TCP-timeout to the grpc-keepalive settings: gRFC A18. I'm wondering if possibly the Python gRPC library isn't correctly configuring timeout options; I've observed separately that |
What version of gRPC and what language are you using?
Client: grpc-python-asyncio/1.51.1 grpc-c/29.0.0 (linux; chttp2)
What operating system (Linux, Windows,...) and version?
Client: Debian GNU/Linux 11 (bullseye). 5.4.110-1.el7.elrepo.x86_64
Server: Windows Server 2019
What runtime / compiler are you using (e.g. python version or version of GCC)
Client: Python 3.9.7
What did you do?
I have my gRPC client implementation in which I enable gRPC Keepalive
What did you expect to see?
netstat -ctown
. I want to see - keepalive timer statusWhat did you see instead?
netstat -ctown
. I don't see the expected status - keepalive. The actual status is - on / off (but not keepalive)I see such logs:
The text was updated successfully, but these errors were encountered: