Keepalive is not activated. GRPC channels do not respond within 15-20 minutes after switching the network #32435

evgmoskalenko · 2023-02-20T12:00:32Z

What version of gRPC and what language are you using?

Client: grpc-python-asyncio/1.51.1 grpc-c/29.0.0 (linux; chttp2)

grpcio = "1.51.1"
protobuf = "4.22.0"

What operating system (Linux, Windows,...) and version?

Client: Debian GNU/Linux 11 (bullseye). 5.4.110-1.el7.elrepo.x86_64
Server: Windows Server 2019

What runtime / compiler are you using (e.g. python version or version of GCC)

Client: Python 3.9.7

What did you do?

I have my gRPC client implementation in which I enable gRPC Keepalive

channel_options = [
    ('grpc.keepalive_time_ms', 5_000),
    ('grpc.keepalive_timeout_ms', 20_000),
    ('grpc.http2.max_pings_without_data', 0),
    ('grpc.keepalive_permit_without_calls', 1),
]

grpc.aio.secure_channel(
    target=f'{srv_conf.HOST}:{srv_conf.GRPC_PORT}',
    credentials=grpc.ssl_channel_credentials(
        settings.CERT.encode()
    ),
    options=channel_options,
),

What did you expect to see?

Fast connection recovery after the network switching
When I execute the command netstat -ctown. I want to see - keepalive timer status

netstat -c --timer

keepalive - when the keepalive timer is ON for the socket
on - when the retransmission timer is ON for the socket
off - none of the above is ON

What did you see instead?

After switching the network, GRPC channels do not respond within approximately 15-20 minutes.
Also when I exec to the Docker container and execute the command netstat -ctown. I don't see the expected status - keepalive. The actual status is - on / off (but not keepalive)

netstat -c --timer

keepalive - when the keepalive timer is ON for the socket
on - when the retransmission timer is ON for the socket
off - none of the above is ON

After setting verbose for logs

# TROUBLESHOOTING: https://github.com/grpc/grpc/blob/master/TROUBLESHOOTING.md
# * gRPC environment variables: https://github.com/grpc/grpc/blob/master/doc/environment_variables.md
GRPC_VERBOSITY: "DEBUG"
GRPC_TRACE: "http_keepalive
#GRPC_TRACE: "all,list_tracers,-timer,-timer_check"

I see such logs:

│ I0220 10:48:51.480487718 23 writing.cc:178]           CLIENT: Ping sent [ipv4:<server_host>:<server_port>]: 0/0
│ I0220 10:48:51.480602944 23 chttp2_transport.cc:2617] ipv4:<server_host>:<server_port>: Keepalive ping cancelled. Resetting timer.
│ I0220 10:48:51.541533370 23 writing.cc:178]           CLIENT: Ping sent [ipv4:<server_host>:<server_port>]: 0/0
│ I0220 10:48:51.541648382 23 chttp2_transport.cc:2617] ipv4:<server_host>:<server_port>: Keepalive ping cancelled. Resetting timer.
│ I0220 10:48:51.550680086 23 writing.cc:178]           CLIENT: Ping sent [ipv4:<server_host>:<server_port>]: 0/0
│ I0220 10:48:51.550783571 23 chttp2_transport.cc:2617] ipv4:<server_host>:<server_port>: Keepalive ping cancelled. Resetting timer.

The text was updated successfully, but these errors were encountered:

evgmoskalenko · 2023-02-23T06:23:52Z

Also, I wrote another client in Golang to the same server, with the following settings for the client.

Golang client go.mod file

module ...

go 1.19

require (
        ...
	google.golang.org/grpc v1.53.0
	google.golang.org/protobuf v1.28.1
)

require (
        ...
	github.com/golang/protobuf v1.5.2 // indirect
	google.golang.org/genproto v0.0.0-20230216225411-c8e22ba71e44 // indirect
        ...
)

Golang client code example:

import (
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials"
	"google.golang.org/grpc/credentials/insecure"
	"google.golang.org/grpc/keepalive"
	"google.golang.org/grpc/metadata"
	"google.golang.org/protobuf/proto"
)

var keepaliveParams = keepalive.ClientParameters {
	Time:                5 * time.Second,  // send pings every N seconds if there is no activity
	Timeout:             20 * time.Second, // wait seconds for ping ack before considering the connection dead
	PermitWithoutStream: true,             // send pings even without active streams
}

func main() {
    var dialOpts []grpc.DialOption
    dialOpts = append(
        dialOpts,
        grpc.WithTransportCredentials(tlsCredentials),
        grpc.WithKeepaliveParams(keepaliveParams),
    )

    conn, err := grpc.DialContext(ctx, options.Address, dialOpts...)
    if err != nil {
        return nil, fmt.Errorf("error connecting to GRPC server: %v", err)
    }

  ...
}

And the problem is not reproduced more after switching the network with client implementation in golang. And also I see keepalive using the command - netstat -ctown:

root@:/# netstat -ctown | grep <my_grpc_port>
tcp        0      0 <ip>:56924      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (12.86/0/0)
tcp        0      0 <ip>:42904      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (13.85/0/0)
tcp        0      0 <ip>:45724      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (9.85/0/0)
tcp        0      0 <ip>:35064      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (8.86/0/0)
tcp        0      0 <ip>:39710      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (8.83/0/0)
tcp        0      0 <ip>:40580      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (3.75/0/0)
tcp        0      0 <ip>:42582      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (2.54/0/0)
tcp        0      0 <ip>:47130      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (6.82/0/0)
tcp        0      0 <ip>:57764      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (1.53/0/0)
tcp        0      0 <ip>:42150      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (11.85/0/0)
tcp        0      0 <ip>:32832      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (12.83/0/0)
tcp        0      0 <ip>:44566      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (4.89/0/0)
tcp        0      0 <ip>:52138      <ip2>:<my_grpc_port>      ESTABLISHED keepalive (7.52/0/0)

gnossen · 2023-03-01T21:57:08Z

I don't think netstat -ctown is sound here. The keepalive referred to here is not TCP keepalive, but HTTP/2 ping frame-based keepalive.

After switching the network, GRPC channels do not respond within approximately 15-20 minutes.

This, however, seems worrying. I don't see any issue with the channel args you pasted above, so I'm going to take some time to see if I can reproduce.

evgmoskalenko · 2023-03-02T09:09:04Z

Hi, @gnossen. Thanks for your help. :-)

I saw in the gRPC core implementations where the gRPC settings parameter are override TCP_USER_TIMEOUT (https://man7.org/linux/man-pages/man7/tcp.7.html) to avoid issues where the TCP packets are not ACKed in a reasonable time. That should be solved all the edge conditions regarding channels getting blocked indefinitely (and keepalive too).

In this doc - https://man7.org/linux/man-pages/man7/tcp.7.html (socket options TCP_USER_TIMEOUT) I understand that is socket options (not part of the Linux kernel). Perhaps the option is not overridden?

Increasing user timeouts allows a TCP connection to
survive extended periods without end-to-end connectivity.
Decreasing user timeouts allows applications to "fail
fast", if so desired. Otherwise, failure may take up to
20 minutes with the current system defaults in a normal
WAN environment.

My waits also take 15 to 20 minutes. And they are treated either by waiting or by an instant restart of the client or server. Therefore, I decided to write the client in another language (Golnag) and thereby exclude the server part.

evgmoskalenko · 2023-04-05T23:09:12Z

@gnossen, We see network switching between regions on the network metrics. Also, I created a Python service that just sends ping-pong GRPC requests (between client and servers) and stores each request (latency and error rate) to the Prometheus metrics. Losses and Python service errors I see a very long time (15-20 minutes) after switching networks.

I think there is definitely a problem here. When I do the same thing in Golang with Keepalive, then everything is fine. I see network switching on network metrics, and a small number of errors in the Golang service, and very fast connection recovery.

BatmanAoD · 2024-04-25T22:35:14Z

Is it possible that this is related to this old issue caused by bad TCP keepalive settings? In particular, note this comment:

The client doesn't actually hang indefinitely as originally reported, but it will roughly take 15-20 minutes for it to time out.

The solution that was implemented tied the TCP-timeout to the grpc-keepalive settings: gRFC A18. I'm wondering if possibly the Python gRPC library isn't correctly configuring timeout options; I've observed separately that keepalive_time_ms seems to be ignored.

evgmoskalenko added kind/bug lang/core priority/P2 untriaged labels Feb 20, 2023

grpc-bot assigned veblush Feb 20, 2023

veblush assigned gnossen and unassigned veblush Feb 26, 2023

veblush added the lang/Python label Feb 26, 2023

yashykt removed the untriaged label Mar 7, 2023

evgmoskalenko mentioned this issue Apr 5, 2023

gRPC server TRANSIENT_FAILURE after a certain time point #32756

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keepalive is not activated. GRPC channels do not respond within 15-20 minutes after switching the network #32435

Keepalive is not activated. GRPC channels do not respond within 15-20 minutes after switching the network #32435

evgmoskalenko commented Feb 20, 2023 •

edited

evgmoskalenko commented Feb 23, 2023 •

edited

gnossen commented Mar 1, 2023

evgmoskalenko commented Mar 2, 2023 •

edited

evgmoskalenko commented Apr 5, 2023 •

edited

BatmanAoD commented Apr 25, 2024

Keepalive is not activated. GRPC channels do not respond within 15-20 minutes after switching the network #32435

Keepalive is not activated. GRPC channels do not respond within 15-20 minutes after switching the network #32435

Comments

evgmoskalenko commented Feb 20, 2023 • edited

What version of gRPC and what language are you using?

What operating system (Linux, Windows,...) and version?

What runtime / compiler are you using (e.g. python version or version of GCC)

What did you do?

What did you expect to see?

What did you see instead?

evgmoskalenko commented Feb 23, 2023 • edited

gnossen commented Mar 1, 2023

evgmoskalenko commented Mar 2, 2023 • edited

evgmoskalenko commented Apr 5, 2023 • edited

BatmanAoD commented Apr 25, 2024

evgmoskalenko commented Feb 20, 2023 •

edited

evgmoskalenko commented Feb 23, 2023 •

edited

evgmoskalenko commented Mar 2, 2023 •

edited

evgmoskalenko commented Apr 5, 2023 •

edited