tcp_keep_alive_timeout seems not set for MySQL connection #63659

baolinhuang · 2024-05-11T11:33:32Z

Hi guys:

Our production environment encountered a case. There are many semi-connection found on the ClickHouse server.
These connections are created by our Load-Balancing component when performing health detection.
These connection has exited on the client side. However, they remain on the clickhouse side.

I looked through the 23.8 open source code and found that keep_alive is not set for the socket in the tcp and mysql protocols, even though I configured tcp_keep_alive_timeout.

baolinhuang · 2024-05-20T09:11:15Z

In fact, this problem does not only exist on the MySQL connection, TCP protocol also has problems.

If there is packet loss between client and server during disconnect connection.
This will result in a half-connection on the server side.
Because of tcp_keep_alive_timeout does not take effect, the connection will always exist.

CheSema · 2024-05-21T12:44:48Z

It is good guess about packet loss when client closes the connection. But I doubt that it could happen often.

FIN packages are retransmitted.
https://stackoverflow.com/questions/35387206/does-tcp-treats-fin-retransmission-like-a-normal-segement-retransmission

It more likely could be incorrect client shutdown, like sigfault.
Or some intermedia hub, like NAT, which could close connections according its own logic, which could be inconsistent with client/server tcp settings.

I would appreciate to see logs from client and server and some diagnostic if there is a NAT between client and server.

baolinhuang · 2024-05-21T14:43:20Z

It is good guess about packet loss when client closes the connection. But I doubt that it could happen often.

FIN packages are retransmitted. https://stackoverflow.com/questions/35387206/does-tcp-treats-fin-retransmission-like-a-normal-segement-retransmission

It more likely could be incorrect client shutdown, like sigfault. Or some intermedia hub, like NAT, which could close connections according its own logic, which could be inconsistent with client/server tcp settings.

I would appreciate to see logs from client and server and some diagnostic if there is a NAT between client and server.

The situation we encountered was that the half-connection was happening when load balancing(LB) doing the health detection.
The process of health detection is first create a tcp connection, and then send a reset packet to close the connection.
We think that the reset packet from LB to CK is lost, resulting in a half-connection.
We have tried to get the evidence of packet-lost by tcpdump. However the problem triggering probability is random and the probability is small (but the risk is very high). So I cannot give you more logs.

Regardless, half-connection can occur for a variety of reasons.
What needs to be confirmed now is whether the implementation of tcp keep alive in clickhouse has a bug.

Seems that the tcp_keep_alive_timeout is not configured to the tcp socket.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Server/TCPHandler.cpp#L252-L254

CheSema · 2024-05-21T17:34:07Z

Seems that the tcp_keep_alive_timeout is not configured to the tcp socket.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Server/TCPHandler.cpp#L252-L254

We handle timeout for tcp connection here:

ClickHouse/src/Server/TCPHandler.cpp

Lines 322 to 330 in 0ebca16

    
           UInt64 timeout_ms = std::min(poll_interval, idle_connection_timeout) * 1000000; 
        
           while (tcp_server.isOpen() && !server.isCancelled() && !static_cast<ReadBufferFromPocoSocket &>(*in).poll(timeout_ms)) 
        
           { 
        
               if (idle_time.elapsedSeconds() > idle_connection_timeout) 
        
               { 
        
                   LOG_TRACE(log, "Closing idle connection"); 
        
                   return; 
        
               } 
        
           }

tcp_keep_alive_timeout does not bring any value to track timeout if idle_connection_timeout is set up correctly.

Could you provide actual settings value for idle_connection_timeout from you server? (select from system.settings is a best way.)

baolinhuang · 2024-05-22T02:52:01Z

Thanks for the quick reply

We configured idle_connection_timeout=3600, it did not take effect neither.

Regarding the reason why idle_connection_timeout did not take effect. In this case, LB (the load-balancing component mentioned earlier) will only ① create TCP and ② send RESET.

It will not go through the whole CK connection creation process (such as account/password verification etc.).
Therefore, the idle_connection_timeout mechanism in CK will not take effect.

For this situation, we can only relay on tcp_keep_alive, let operation system help us to clean the old connections.

CheSema · 2024-05-23T13:59:12Z

Could you tell me please actual value of receive_timeout on your cluster?

After accept call server calls receiveBytes on socket. That operations could hung up to receive_timeout.

CheSema · 2024-05-23T15:44:55Z

I have made a simple experiment.

cat > /dev/tcp/localhose/9000 -- that command connects to the server and remains silent. That connection is timeouted after receive_timeout when server tries to read hello packet.

So I do not understand your case. Are you sure that idle_connection_timeout and receive_timeout are not overridden?

CheSema · 2024-05-23T15:55:10Z

What needs to be confirmed now is whether the implementation of tcp keep alive in clickhouse has a bug.

We just do not use tcp keep alive for the server. We rely on idle_connection_timeout and receive_timeout. I did not find here a bug.

tcp keep alive could be useful in case when server wants to detect lost connections during idle_connection_timeout.
However it is not for free: some packets are send and received for each connection. That action checks only that the host and its OS is alive, it does not checks that client is responsive.

baolinhuang added the unexpected behaviour label May 11, 2024

Algunenano added the comp-mysql label May 13, 2024

CheSema self-assigned this May 21, 2024

CheSema added st-need-info We need extra data to continue (waiting for response) st-need-repro We were not able to reproduce the problem, please help us. labels May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tcp_keep_alive_timeout seems not set for MySQL connection #63659

tcp_keep_alive_timeout seems not set for MySQL connection #63659

baolinhuang commented May 11, 2024

baolinhuang commented May 20, 2024 •

edited

CheSema commented May 21, 2024

baolinhuang commented May 21, 2024 •

edited

CheSema commented May 21, 2024 •

edited

baolinhuang commented May 22, 2024 •

edited

CheSema commented May 23, 2024 •

edited

CheSema commented May 23, 2024 •

edited

CheSema commented May 23, 2024

tcp_keep_alive_timeout seems not set for MySQL connection #63659

tcp_keep_alive_timeout seems not set for MySQL connection #63659

Comments

baolinhuang commented May 11, 2024

baolinhuang commented May 20, 2024 • edited

CheSema commented May 21, 2024

baolinhuang commented May 21, 2024 • edited

CheSema commented May 21, 2024 • edited

baolinhuang commented May 22, 2024 • edited

CheSema commented May 23, 2024 • edited

CheSema commented May 23, 2024 • edited

CheSema commented May 23, 2024

baolinhuang commented May 20, 2024 •

edited

baolinhuang commented May 21, 2024 •

edited

CheSema commented May 21, 2024 •

edited

baolinhuang commented May 22, 2024 •

edited

CheSema commented May 23, 2024 •

edited

CheSema commented May 23, 2024 •

edited