Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netty takes 5 mins 30 seconds to close the channel in case of timeout #13888

Open
gouthamp174 opened this issue Mar 6, 2024 · 3 comments
Open

Comments

@gouthamp174
Copy link

Expected behavior

In our company, we use ODL Netconf library, which in turn uses IO Netty library, to connect to a NETCONF protocol based network element (NE).

Issue-1: NioSocketChannel.closeFuture() takes 5 mins 30 seconds to complete,
ODL Netconf library calls io.netty.bootstrap.Bootstrap.connect() to create a new NioSocketChannel to connect to the NE. After connecting to 400 NEs we lost connectivity to all of them. We periodically reattempted to connect to these NEs. New attempts timed out because these NEs were still not reachable.
If an attempt to connect to an NE times out, we await 60 seconds for its Channel.closeFuture() to complete indicating that all the resources were released and the Channel was closed. After that we reattempt to connect to the same NE. We expect Channel.closeFuture() to complete within 60 seconds.

Issue-2: With 400 NEs in comm-loss, channel for a new NE takes 2 mins 30 seconds to get created.
Further, while 400 NEs are in communication loss, we can discover a new NE into our software. Discovery process also calls io.netty.bootstrap.Bootstrap.connect() to create a new NioSocketChannel for the new NE. We expect NioSocketChannel to get created in 30 seconds for the new NE.

Actual behavior

Issue-1: NioSocketChannel.closeFuture() takes 5 mins 30 seconds to complete.
For 400 NEs in communication loss, we observed that the closeFuture took around 5 mins 30 seconds to complete. This delayed our next attempt to reconnect to the NE. Also when the network connectivity for comm-loss NE was restored, we had to wait for 5 mins 30 seconds to reconnect. This delayed our communication recovery process for the NE.

Issue-2: With 400 NEs in comm-loss, channel for a new NE takes 2 mins 30 seconds to get created.
While 400 NEs were in communication loss, when a new NE was discovered on our software, we observed that NioSocketChannel was created after 2 mins 30 seconds. Since we wait for 30 seconds, our code timed out and we were not able to connect to new NE.

We raised both these issue with ODL Netconf library team. They replied that their code is a wrapper around IO Netty code and it is the IO Netty library that is taking long time to close timed out NioSocketChannels or create new NioSocketChannels.

Since both these issues occur when 400 or more NEs are in communication loss, we believe they may both be related. So I've included both of them here. Can you please take a look into both these issues?

Steps to reproduce

  1. Create NioSocketChannels towards 400 network elements (NEs) successfully.
  2. Simulate loss of communication by blocking the IP addresses towards 400 NEs using iptables on the server.
  3. Periodically reconnect to 400 NEs using io.netty.bootstrap.Bootstrap.connect().
  4. While periodically reconnecting to 400 NEs, attempt to add a new NE by calling io.netty.bootstrap.Bootstrap.connect() for the new NE.

Minimal yet complete reproducer code (or URL to code)

Our company code calls ODL Netconf library which in turn calls IO Netty library. Currently, I don't have access to ODL Netconf library code that calls io.netty.bootstrap.Bootstrap.connect(). If it is a mandatory requirement, then please let me know. I can request ODL Netconf library team to share their code snippets.

Netty version

4.1.104.Final

JVM version (e.g. java -version)

openjdk version "11.0.2" 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)

OS version (e.g. uname -a)

CentOS Linux 7 (Core)
Linux rtxvdvlp405.net.local 3.10.0-1160.102.1.el7.x86_64 #1 SMP Tue Oct 17 15:42:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

@normanmaurer
Copy link
Member

If the closeFuture takes 5 minutes to complete it means that closure was not detected for 5 minutes. I suspect you will need to implement some sort of "keep alive" messages in your protocol to detect "dead" connections fast. There is nothing that we can do here.

@gouthamp174
Copy link
Author

gouthamp174 commented Mar 13, 2024

Hi @normanmaurer,
Thanks for your inputs. The time taken by closeFuture to complete varies around 5mins 30 seconds. At times it takes 4 mins 30 seconds and at other times it takes 6 mins. So I suspect it is related to number of concurrent client channels that are being handled by Netty. Since we have multiple sessions for 400 NEs, channels for some sessions are waiting to be timed out while channels for other sessions have already timed out, for which we attempt to establish new channels once again.

Could it be related to the event thread pool and the number of concurrent channels it can simultaneously handle? If yes, is there any configuration parameter that I can modify so that event thread pool can take less time to handle these concurrent channels?

Also we do have "keep-alive" messages to detect dead connections in our protocol. In our scenario:

  1. First we discover all 400 NEs. Netty established channels for all 400 NEs. So we have active sessions to all 400 NEs.
  2. Network goes down and all 400 NEs are unreachable. Due to "keep-alive" messages we detect communication loss towards all these NEs. Netty channels are closed.
  3. We periodically attempt to reconnect to these NEs every 60 seconds. So we request Netty to create channels for these NEs. Since the network is down, channel creation times out. This is where we are observing that Netty takes around 5 mins 30 seconds to report time out.
    1. At this moment, if we try to discover a new NE, Netty takes around 2 mins 30 seconds to create a new session for the new NE.
  4. Later, if the network comes back up and all 400 NEs become reachable, then since Netty takes around 5 mins 30 seconds to time out, we need to wait until this time to retry and establish connection to the NEs. This is a long time for our module to recover connectivity to the NEs. I would like to reduce this time to 30 seconds.

@normanmaurer
Copy link
Member

Maybe check if you block the event loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants