Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcp_keep_alive_timeout seems not set for MySQL connection #63659

Open
baolinhuang opened this issue May 11, 2024 · 8 comments
Open

tcp_keep_alive_timeout seems not set for MySQL connection #63659

baolinhuang opened this issue May 11, 2024 · 8 comments
Assignees
Labels
comp-mysql st-need-info We need extra data to continue (waiting for response) st-need-repro We were not able to reproduce the problem, please help us. unexpected behaviour

Comments

@baolinhuang
Copy link

Hi guys:

Our production environment encountered a case. There are many semi-connection found on the ClickHouse server.
These connections are created by our Load-Balancing component when performing health detection.
These connection has exited on the client side. However, they remain on the clickhouse side.

I looked through the 23.8 open source code and found that keep_alive is not set for the socket in the tcp and mysql protocols, even though I configured tcp_keep_alive_timeout.

@baolinhuang
Copy link
Author

baolinhuang commented May 20, 2024

In fact, this problem does not only exist on the MySQL connection, TCP protocol also has problems.

If there is packet loss between client and server during disconnect connection.
This will result in a half-connection on the server side.
Because of tcp_keep_alive_timeout does not take effect, the connection will always exist.

@CheSema CheSema self-assigned this May 21, 2024
@CheSema
Copy link
Member

CheSema commented May 21, 2024

It is good guess about packet loss when client closes the connection. But I doubt that it could happen often.

FIN packages are retransmitted.
https://stackoverflow.com/questions/35387206/does-tcp-treats-fin-retransmission-like-a-normal-segement-retransmission

It more likely could be incorrect client shutdown, like sigfault.
Or some intermedia hub, like NAT, which could close connections according its own logic, which could be inconsistent with client/server tcp settings.

I would appreciate to see logs from client and server and some diagnostic if there is a NAT between client and server.

@baolinhuang
Copy link
Author

baolinhuang commented May 21, 2024

It is good guess about packet loss when client closes the connection. But I doubt that it could happen often.

FIN packages are retransmitted. https://stackoverflow.com/questions/35387206/does-tcp-treats-fin-retransmission-like-a-normal-segement-retransmission

It more likely could be incorrect client shutdown, like sigfault. Or some intermedia hub, like NAT, which could close connections according its own logic, which could be inconsistent with client/server tcp settings.

I would appreciate to see logs from client and server and some diagnostic if there is a NAT between client and server.

  1. The situation we encountered was that the half-connection was happening when load balancing(LB) doing the health detection.
  2. The process of health detection is first create a tcp connection, and then send a reset packet to close the connection.
  3. We think that the reset packet from LB to CK is lost, resulting in a half-connection.
  4. We have tried to get the evidence of packet-lost by tcpdump. However the problem triggering probability is random and the probability is small (but the risk is very high). So I cannot give you more logs.

Regardless, half-connection can occur for a variety of reasons.
What needs to be confirmed now is whether the implementation of tcp keep alive in clickhouse has a bug.

Seems that the tcp_keep_alive_timeout is not configured to the tcp socket.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Server/TCPHandler.cpp#L252-L254

@CheSema
Copy link
Member

CheSema commented May 21, 2024

Seems that the tcp_keep_alive_timeout is not configured to the tcp socket.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Server/TCPHandler.cpp#L252-L254

We handle timeout for tcp connection here:

UInt64 timeout_ms = std::min(poll_interval, idle_connection_timeout) * 1000000;
while (tcp_server.isOpen() && !server.isCancelled() && !static_cast<ReadBufferFromPocoSocket &>(*in).poll(timeout_ms))
{
if (idle_time.elapsedSeconds() > idle_connection_timeout)
{
LOG_TRACE(log, "Closing idle connection");
return;
}
}

tcp_keep_alive_timeout does not bring any value to track timeout if idle_connection_timeout is set up correctly.

Could you provide actual settings value for idle_connection_timeout from you server? (select from system.settings is a best way.)

@baolinhuang
Copy link
Author

baolinhuang commented May 22, 2024

Thanks for the quick reply

We configured idle_connection_timeout=3600, it did not take effect neither.

Regarding the reason why idle_connection_timeout did not take effect. In this case, LB (the load-balancing component mentioned earlier) will only ① create TCP and ② send RESET.

It will not go through the whole CK connection creation process (such as account/password verification etc.).
Therefore, the idle_connection_timeout mechanism in CK will not take effect.

For this situation, we can only relay on tcp_keep_alive, let operation system help us to clean the old connections.

@CheSema
Copy link
Member

CheSema commented May 23, 2024

Could you tell me please actual value of receive_timeout on your cluster?

After accept call server calls receiveBytes on socket. That operations could hung up to receive_timeout.

@CheSema
Copy link
Member

CheSema commented May 23, 2024

I have made a simple experiment.

cat > /dev/tcp/localhose/9000 -- that command connects to the server and remains silent. That connection is timeouted after receive_timeout when server tries to read hello packet.

So I do not understand your case. Are you sure that idle_connection_timeout and receive_timeout are not overridden?

@CheSema
Copy link
Member

CheSema commented May 23, 2024

What needs to be confirmed now is whether the implementation of tcp keep alive in clickhouse has a bug.

We just do not use tcp keep alive for the server. We rely on idle_connection_timeout and receive_timeout. I did not find here a bug.

tcp keep alive could be useful in case when server wants to detect lost connections during idle_connection_timeout.
However it is not for free: some packets are send and received for each connection. That action checks only that the host and its OS is alive, it does not checks that client is responsive.

@CheSema CheSema added st-need-info We need extra data to continue (waiting for response) st-need-repro We were not able to reproduce the problem, please help us. labels May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp-mysql st-need-info We need extra data to continue (waiting for response) st-need-repro We were not able to reproduce the problem, please help us. unexpected behaviour
Projects
None yet
Development

No branches or pull requests

3 participants