Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keepalive is not activated. GRPC channels do not respond within 15-20 minutes after switching the network #32435

Open
evgmoskalenko opened this issue Feb 20, 2023 · 5 comments

Comments

@evgmoskalenko
Copy link

evgmoskalenko commented Feb 20, 2023

What version of gRPC and what language are you using?

Client: grpc-python-asyncio/1.51.1 grpc-c/29.0.0 (linux; chttp2)

  • grpcio = "1.51.1"
  • protobuf = "4.22.0"

What operating system (Linux, Windows,...) and version?

Client: Debian GNU/Linux 11 (bullseye). 5.4.110-1.el7.elrepo.x86_64
Server: Windows Server 2019

What runtime / compiler are you using (e.g. python version or version of GCC)

Client: Python 3.9.7

What did you do?

I have my gRPC client implementation in which I enable gRPC Keepalive

channel_options = [
    ('grpc.keepalive_time_ms', 5_000),
    ('grpc.keepalive_timeout_ms', 20_000),
    ('grpc.http2.max_pings_without_data', 0),
    ('grpc.keepalive_permit_without_calls', 1),
]

grpc.aio.secure_channel(
    target=f'{srv_conf.HOST}:{srv_conf.GRPC_PORT}',
    credentials=grpc.ssl_channel_credentials(
        settings.CERT.encode()
    ),
    options=channel_options,
),

What did you expect to see?

  1. Fast connection recovery after the network switching
  2. When I execute the command netstat -ctown. I want to see - keepalive timer status
netstat -c --timer

keepalive - when the keepalive timer is ON for the socket
on - when the retransmission timer is ON for the socket
off - none of the above is ON

What did you see instead?

  1. After switching the network, GRPC channels do not respond within approximately 15-20 minutes.
  2. Also when I exec to the Docker container and execute the command netstat -ctown. I don't see the expected status - keepalive. The actual status is - on / off (but not keepalive)
netstat -c --timer

keepalive - when the keepalive timer is ON for the socket
on - when the retransmission timer is ON for the socket
off - none of the above is ON
  1. After setting verbose for logs
# TROUBLESHOOTING: https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md
# * gRPC environment variables: https://github.com/grpc/grpc/blob/master/doc/environment_variables.md
GRPC_VERBOSITY: "DEBUG"
GRPC_TRACE: "http_keepalive
#GRPC_TRACE: "all,list_tracers,-timer,-timer_check"

I see such logs:

│ I0220 10:48:51.480487718 23 writing.cc:178]           CLIENT: Ping sent [ipv4:<server_host>:<server_port>]: 0/0
│ I0220 10:48:51.480602944 23 chttp2_transport.cc:2617] ipv4:<server_host>:<server_port>: Keepalive ping cancelled. Resetting timer.
│ I0220 10:48:51.541533370 23 writing.cc:178]           CLIENT: Ping sent [ipv4:<server_host>:<server_port>]: 0/0
│ I0220 10:48:51.541648382 23 chttp2_transport.cc:2617] ipv4:<server_host>:<server_port>: Keepalive ping cancelled. Resetting timer.
│ I0220 10:48:51.550680086 23 writing.cc:178]           CLIENT: Ping sent [ipv4:<server_host>:<server_port>]: 0/0
│ I0220 10:48:51.550783571 23 chttp2_transport.cc:2617] ipv4:<server_host>:<server_port>: Keepalive ping cancelled. Resetting timer.
@evgmoskalenko
Copy link
Author

evgmoskalenko commented Feb 23, 2023

Also, I wrote another client in Golang to the same server, with the following settings for the client.

Golang client go.mod file

module ...

go 1.19

require (
        ...
	google.golang.org/grpc v1.53.0
	google.golang.org/protobuf v1.28.1
)

require (
        ...
	github.com/golang/protobuf v1.5.2 // indirect
	google.golang.org/genproto v0.0.0-20230216225411-c8e22ba71e44 // indirect
        ...
)

Golang client code example:

import (
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials"
	"google.golang.org/grpc/credentials/insecure"
	"google.golang.org/grpc/keepalive"
	"google.golang.org/grpc/metadata"
	"google.golang.org/protobuf/proto"
)

var keepaliveParams = keepalive.ClientParameters {
	Time:                5 * time.Second,  // send pings every N seconds if there is no activity
	Timeout:             20 * time.Second, // wait seconds for ping ack before considering the connection dead
	PermitWithoutStream: true,             // send pings even without active streams
}

func main() {
    var dialOpts []grpc.DialOption
    dialOpts = append(
        dialOpts,
        grpc.WithTransportCredentials(tlsCredentials),
        grpc.WithKeepaliveParams(keepaliveParams),
    )

    conn, err := grpc.DialContext(ctx, options.Address, dialOpts...)
    if err != nil {
        return nil, fmt.Errorf("error connecting to GRPC server: %v", err)
    }

  ...
}

And the problem is not reproduced more after switching the network with client implementation in golang. And also I see keepalive using the command - netstat -ctown:

root@:/# netstat -ctown | grep <my_grpc_port>
tcp        0      0 <ip>:56924      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (12.86/0/0)
tcp        0      0 <ip>:42904      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (13.85/0/0)
tcp        0      0 <ip>:45724      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (9.85/0/0)
tcp        0      0 <ip>:35064      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (8.86/0/0)
tcp        0      0 <ip>:39710      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (8.83/0/0)
tcp        0      0 <ip>:40580      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (3.75/0/0)
tcp        0      0 <ip>:42582      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (2.54/0/0)
tcp        0      0 <ip>:47130      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (6.82/0/0)
tcp        0      0 <ip>:57764      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (1.53/0/0)
tcp        0      0 <ip>:42150      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (11.85/0/0)
tcp        0      0 <ip>:32832      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (12.83/0/0)
tcp        0      0 <ip>:44566      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (4.89/0/0)
tcp        0      0 <ip>:52138      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (7.52/0/0)

@veblush veblush assigned gnossen and unassigned veblush Feb 26, 2023
@gnossen
Copy link
Contributor

gnossen commented Mar 1, 2023

I don't think netstat -ctown is sound here. The keepalive referred to here is not TCP keepalive, but HTTP/2 ping frame-based keepalive.

After switching the network, GRPC channels do not respond within approximately 15-20 minutes.

This, however, seems worrying. I don't see any issue with the channel args you pasted above, so I'm going to take some time to see if I can reproduce.

@evgmoskalenko
Copy link
Author

evgmoskalenko commented Mar 2, 2023

Hi, @gnossen. Thanks for your help. :-)

I saw in the gRPC core implementations where the gRPC settings parameter are override TCP_USER_TIMEOUT (https://man7.org/linux/man-pages/man7/tcp.7.html) to avoid issues where the TCP packets are not ACKed in a reasonable time. That should be solved all the edge conditions regarding channels getting blocked indefinitely (and keepalive too).

In this doc - https://man7.org/linux/man-pages/man7/tcp.7.html (socket options TCP_USER_TIMEOUT) I understand that is socket options (not part of the Linux kernel). Perhaps the option is not overridden?

Increasing user timeouts allows a TCP connection to
survive extended periods without end-to-end connectivity.
Decreasing user timeouts allows applications to "fail
fast", if so desired. Otherwise, failure may take up to
20 minutes with the current system defaults in a normal
WAN environment
.

My waits also take 15 to 20 minutes. And they are treated either by waiting or by an instant restart of the client or server. Therefore, I decided to write the client in another language (Golnag) and thereby exclude the server part.

@evgmoskalenko
Copy link
Author

evgmoskalenko commented Apr 5, 2023

@gnossen, We see network switching between regions on the network metrics. Also, I created a Python service that just sends ping-pong GRPC requests (between client and servers) and stores each request (latency and error rate) to the Prometheus metrics. Losses and Python service errors I see a very long time (15-20 minutes) after switching networks.

I think there is definitely a problem here. When I do the same thing in Golang with Keepalive, then everything is fine. I see network switching on network metrics, and a small number of errors in the Golang service, and very fast connection recovery.

@BatmanAoD
Copy link

Is it possible that this is related to this old issue caused by bad TCP keepalive settings? In particular, note this comment:

The client doesn't actually hang indefinitely as originally reported, but it will roughly take 15-20 minutes for it to time out.

The solution that was implemented tied the TCP-timeout to the grpc-keepalive settings: gRFC A18. I'm wondering if possibly the Python gRPC library isn't correctly configuring timeout options; I've observed separately that keepalive_time_ms seems to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants