Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Coccurency rpc operation might lead tcp connections leak. #15983

Open
2 of 3 tasks
Musknine opened this issue May 11, 2024 · 9 comments
Open
2 of 3 tasks

[BUG] Coccurency rpc operation might lead tcp connections leak. #15983

Musknine opened this issue May 11, 2024 · 9 comments
Labels
bug Something isn't working priority:high

Comments

@Musknine
Copy link

Musknine commented May 11, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

#15954 there are detailed infomation here.
there are too many 127.0.0.1:50052 connections, and they lead tcp connection leak. i think it also has the same problem at 3.1.x, but i didn't test it.

What you expected to happen

no

How to reproduce

no

Anything else

No response

Version

3.1.x

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@Musknine Musknine added bug Something isn't working Waiting for reply Waiting for reply labels May 11, 2024
@Musknine Musknine changed the title there is coccurency problem that lead tcp connections leak. there is a coccurency problem that lead tcp connections leak. May 12, 2024
@wangxj3
Copy link
Contributor

wangxj3 commented May 13, 2024

How many connections build on alert-server? Is this equal to the number of tasks running after restarting the service? How many worker-servers do you have?

@Musknine
Copy link
Author

Musknine commented May 14, 2024

1 There are three worker servers.
2 The number of connections can vary and it primarily depends on the length of time the server has been running.
3 connection number doesn't always equal the number of tasks.

2,3 scenarios i got ,i thought becuase connection leaks don't consistently occur。I customized the log print,only when channels.put(host,channel) return result is not a null,then a leak is occured.
image
image

@wangxj3
Copy link
Contributor

wangxj3 commented May 14, 2024

Mybe we need to further check the problem, print the connection port at the close channel, and check whether the ports in the system include closed ports.Analyze whether the closing method is called but the channel is not closed。

@Musknine
Copy link
Author

image
look at here.the connection is runnig here for many days(I obersevered many connections that as same as this).because the channels(a map that store all the channels) lose the quote of the channel forever, it can't close because we cann't get the channel anymore as the channel losed when channels.put(host,channel) called.

@wangxj3
Copy link
Contributor

wangxj3 commented May 14, 2024

image look at here.the connection is runnig here for many days(I obersevered many connections that as same as this).because the channels(a map that store all the channels) lose the quote of the channel forever, it can't close because we cann't get the channel anymore as the channel losed when channels.put(host,channel) called.

When tasks are concurrent (tasks running on the same worker-server), the tasks are A and B. Concurrency occurs when creating channels. NettyRemotingClient.createChannel is called. A creates channelA, B creates channelB, and A uses channelA to send messages, B uses channelB to send messages, but when A closes channel, it takes the channel by host,so get the channel of channelB, and channelB is closed. channelA is free.

当任务出现并发(同一个worker-server运行的任务),任务为A,B,在创建channel的时候出现了并发,调用NettyRemotingClient.createChannel,A创建了channelA,B创建了channelB,A用channelA发送消息,B用channelB发送消息,但是A在关闭的时候根据host拿channel,关闭的是channelB。channelA就游离了。
如果方便的话可以做一个验证,任务串行,看看是不是还有为关闭的连接。如果问题没有复现可能是这个原因,解决方式需要修改建立链接的部分,直接关闭channel,而不是在调用函数在获取一次关闭。或者直接复用channel,干脆不关闭了(需要评估风险)

@Musknine
Copy link
Author

Musknine commented May 14, 2024

I have reproduced this issue. In version 3.2.x, it appears that the channel is reused and simply not closed. Keeping the connection was stored in a map seems to be intended for reusing the connection, but closing it after each use negates the purpose of reusing it and only adds complexity. We can add a logic to a handler to close the connection if no data has been transferred for a while, extending this period, for example, over 24 hours. However, upon a preliminary review of the code, versions 3.0 and 3.1 also seem to have this issue.
image

复现了这个问题。我看3.2.x好像就是复用channel,干脆不关闭了,将连接保存在map中应该是为了复用连接,但是每次调用就关闭根本就没有复用的意义了,反而引入了复杂度(可以加一个逻辑到一个handler里面,如果超过多久没有真正传输数据就关闭,将这个时间调大一点,比如超过24小时). 但3.0和3.1初看代码都会有这个问题.

@SbloodyS SbloodyS removed the Waiting for reply Waiting for reply label May 14, 2024
@Musknine
Copy link
Author

Musknine commented May 14, 2024

I will test and observe it at version of 3.1.9, if there is a problem too, I'll dicuss it here and then submit a pr to fix this.

@wangxj3
Copy link
Contributor

wangxj3 commented May 14, 2024

I will test and observe it at version of 3.1.9, if there is a problem too, I'll dicuss it here and then submit a pr to fix this.
+1

@ruanwenjun
Copy link
Member

Good catch, this is due to the concurrent problem in NettyRemotingClient, if multiple operations belong to one host come at the same time, might create multiple channels. I submit #16021 to fix this.

@ruanwenjun ruanwenjun changed the title there is a coccurency problem that lead tcp connections leak. [BUG] Coccurency rpc operation might lead tcp connections leak. May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:high
Projects
None yet
Development

No branches or pull requests

4 participants