Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observing frequent connection drops and reconnection to Redis from windows worker. #2297

Open
khushiluharuwalla opened this issue Jan 17, 2023 · 6 comments

Comments

@khushiluharuwalla
Copy link

khushiluharuwalla commented Jan 17, 2023

Bug Report

We are observing io.lettuce.core.RedisCommandTimeoutException: Command timed out after 10 second(s) intermittently while connecting to redis from windows docker.Lettuce is trying to reconnect to redis and for some of the cases it successfully reconnects with redis while for some it fails.The initial connection to redis takes around 30 seconds after a connection drop.

i.l.core.protocol.ConnectionWatchdog `Cannot reconnect to redis: connection timed out: i.l.core.protocol.ConnectionWatchdog - Reconnecting, last destination

Why are the connections breaking and reconnections happening?

We tried setting the keepAliveOptions and following is the code snippet we used :

SocketOptions socketOptions = SocketOptions.builder()
                .keepAlive(SocketOptions.KeepAliveOptions.builder()
                        .enable(true)
                        .idle(Duration.ofSeconds(90))
                        .count(3)
                        .interval(Duration.ofSeconds(90))
                        .build())
                .build();

This seems to not work for us and we are getting the following errors:

After restarting the container

- Redis connection created successfully. 
io.netty.bootstrap.Bootstrap             - Unknown channel option 'TCP_KEEPCOUNT' for channel
io.netty.bootstrap.Bootstrap             - Unknown channel option 'TCP_KEEPIDLE' for channel
io.netty.bootstrap.Bootstrap             - Unknown channel option 'TCP_KEEPINTERVAL' for channel
Unknown channel option 'TCP_KEEPCOUNT' for channel 
Unknown channel option 'TCP_KEEPIDLE' for channel 
Unknown channel option 'TCP_KEEPINTERVAL' for channel 
Unknown channel option 'TCP_KEEPCOUNT' for channel 
Unknown channel option 'TCP_KEEPIDLE' for channel 
Unknown channel option 'TCP_KEEPINTERVAL' for channel  
io.netty.bootstrap.Bootstrap             - Unknown channel option 'TCP_KEEPCOUNT' 

And requests are failing with the following error:

io.lettuce.core.RedisCommandTimeoutException: Command timed out after 10 second(s)
i.l.core.protocol.ConnectionWatchdog  `Cannot reconnect to redis: connection timed out: 
 i.l.core.protocol.ConnectionWatchdog     - Reconnecting, last destination was 
 io.netty.bootstrap.Bootstrap             - Unknown channel option 'TCP_KEEPCOUNT' for channel
io.netty.bootstrap.Bootstrap             - Unknown channel option 'TCP_KEEPIDLE' for channel 
 io.netty.bootstrap.Bootstrap             - Unknown channel option 'TCP_KEEPINTERVAL' for channel
 i.l.core.protocol.ReconnectionHandler    - Reconnected to 
io.lettuce.core.protocol.CommandHandler  - null Unexpected exception during request: java.io.IOException: An existing connection was forcibly closed by the remote host 
java.io.IOException: An existing connection was forcibly closed by the remote host 
 at java.base/sun.nio.ch.SocketDispatcher.read0(Native Method) 
 at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43) 
 at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) 
 at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:233) 
 at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:223) 
 at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:356) 
 at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) 
 at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132) 
 at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) 
 at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) 
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722) 
 at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658) 
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584) 
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) 
 at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) 
 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
 at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) 
 at java.base/java.lang.Thread.run(Thread.java:829) 

There are no ConnectionWatchdog -reconnecting/cannot connect logs coming for the linux docker while it is persistent for the windows docker.For linux the TCP/IP parameters are :
tcp_keepalive_time-90
tcp_keepalive_intvl-90
tcp_keepalive_probes-3
In case of windows it is picking up the default values
tcp_keepalive_time-7200
tcp_keepalive_intvl-75
tcp_keepalive_probes-9

Expected behaviour/code

The expectation is to not see any reconnection logs in case of the windows docker and for lettuce to override the default values of tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes so that a stable connection can be made.

Environment

  • Lettuce version(s): 6.1.0.RELEASE

Additional context

We also tried to set these options explicitly using BootStrap options as mentioned. #1428. We are still getting this issue.

@khushiluharuwalla khushiluharuwalla changed the title Redis Connectivity issue from windows engine-Unknown channel option 'TCP_KEEPCOUNT' Observing frequent connection drops and reconnection to Redis from windows worker. Jan 18, 2023
@umang92
Copy link

umang92 commented Jan 19, 2023

@mp911de I looked at #861 . For the current issue following are the settings in our Redis cluster:
timeout 0
maxCients 65000

Based on the above, even if the client is idle, Redis server must not remove the connection right? And this behavior is only seen when client is running in windows worker.

Some more code snippet of how we create the RedisClusterClient

    ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
            .enableAdaptiveRefreshTrigger(ClusterTopologyRefreshOptions.RefreshTrigger.MOVED_REDIRECT, ClusterTopologyRefreshOptions.RefreshTrigger.PERSISTENT_RECONNECTS)
            .adaptiveRefreshTriggersTimeout(60, TimeUnit.SECONDS)
            .build();

    redisClusterClient.setOptions(clientOptionsBuilder
            .topologyRefreshOptions(topologyRefreshOptions)
            .build());
    redisClusterClient.setDefaultTimeout(Duration.ofMillis(properties.getRedisConnectionTimeout()));

@mp911de
Copy link
Collaborator

mp911de commented Jan 23, 2023

IOException: An existing connection was forcibly closed by the remote host and connection timed out failures are infrastructure issues. Lettuce requires a properly working TCP/IP infrastructure. I'm not sure how keep-alive plays into it, also any delays in connecting (you mentioned 30 seconds) are subject to your infrastructure. 30 seconds sounds as if it was related to some DNS resolution. You might want to configure the JVM DNS resolver via ClientResources.

@umang92
Copy link

umang92 commented Jan 23, 2023

@mp911de By infrastructure you mean the Redis cluster or the infra on the client side? As the same Redis cluster is being used for both Linux and Windows-based clients. We are not seeing any issues with Linux.

Our Redis cluster is an AWS elasticache cluster with multiple shards and the client side is a service running inside Linux/Windows dockers on a Kubernetes cluster.

@umang92
Copy link

umang92 commented Jan 23, 2023

Also another thing is why do we see following errors/warning when we try to set TCP flags using socker options?

io.netty.bootstrap.Bootstrap - Unknown channel option 'TCP_KEEPCOUNT' for channel
io.netty.bootstrap.Bootstrap - Unknown channel option 'TCP_KEEPIDLE' for channel
io.netty.bootstrap.Bootstrap - Unknown channel option 'TCP_KEEPINTERVAL' for channel

@umang92
Copy link

umang92 commented Feb 6, 2023

@mp911de Please help us figure out the reason behind the above-mentioned warnings when setting TCP options through ExtendedSocketOptions. As per this, we believe setting TCP keepalive options at the application level can help with this issue. It mentions :

Even if TCP KeepaliveTime and TCPKeepAliveInterval registry keys are set to a specific value (TCPIP driver uses the deafult values even if we don't set these registry keys from the registry), TCPIP driver won't start sending TCP Keepalives until Keepalives are enabled via various methods at upper layers (layers above TCPIP driver).

Also attaching debugs logs during 5 mins of one of the load tests we did. You can see connections being dropped by searching for
"An existing connection was forcibly closed by the remote host"
lettuce_redis_export.txt

We also looked at netstat -an outputs during one of the load tests. We observe patterns like the following intermttently:

at Tn:
TCP 10.152.37.207:49606 10.217.22.102:9999 ESTABLISHED
TCP 10.152.37.207:49593 10.217.22.102:9999 ESTABLISHED

at Tn+5s:
TCP 10.152.37.207:49606 10.217.22.102:9999 ESTABLISHED
TCP 10.152.37.207:49633 10.217.22.102:9999 SYN_SENT

@khushiluharuwalla
Copy link
Author

khushiluharuwalla commented Feb 10, 2023

We have set the TCP keepAlive options at the client level as well the os level.But we are observing "Unknown channel option" warnings when we try to connect to redis.

Client level settings:
SocketOptions socketOptions = SocketOptions.builder()
.keepAlive(true)
.keepAlive(SocketOptions.KeepAliveOptions.builder().idle(Duration.ofSeconds(90)).interval(Duration.ofSeconds(90)).count(3).build())
.keepAlive(SocketOptions.KeepAliveOptions.builder().idle(Duration.ofSeconds(90)).interval(Duration.ofSeconds(90)).count(3).enable(true).build())
.build();

Os level settings:
RUN Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters" -Name "KeepAliveTime" -Value 90000 -Type DWORD -Force
RUN Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters" -Name "KeepAliveInterval" -Value 90000 -Type DWORD -Force
RUN Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters" -Name "TcpMaxDataRetransmissions" -Value 3 -Type DWORD -Force

Warnings:
Unknown channel option 'TCP_KEEPCOUNT' for channel
Unknown channel option 'TCP_KEEPIDLE' for channel
Unknown channel option 'TCP_KEEPINTERVAL' for channel
Unknown channel option 'TCP_KEEPCOUNT' for channel
Unknown channel option 'TCP_KEEPIDLE' for channel
Unknown channel option 'TCP_KEEPINTERVAL' for channel

We also tried setting the TCP options using ExtendedSOcketOptions and EpollChannelOption and simulaneously set them at the os level.We are still observing the warnings.

.nettyCustomizer(new NettyCustomizer() {
@OverRide
public void afterBootstrapInitialized(Bootstrap bootstrap) {
bootstrap.option(ChannelOption.SO_KEEPALIVE, true);
bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPIDLE), 90);
bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPINTERVAL), 90);
bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPCOUNT), 3);
}
})

In the microsoft documentation it is mentioned that for TCP options to take effect it should be set at the registry as well as the upper level (layers above TCPIP driver).https://learn.microsoft.com/en-us/archive/blogs/nettracer/things-that-you-may-want-to-know-about-tcp-keepalives

Further we were observing many TIME_WAIT logs when we ran "netstat -an".To fix that we made the following code changes and also changed the value of TcpTimedWaitDelay to 60 seconds at the os level.
.nettyCustomizer(new NettyCustomizer() {
@OverRide
public void afterBootstrapInitialized(Bootstrap bootstrap) {
bootstrap.option(ChannelOption.SO_REUSEADDR,true);
bootstrap.option(ChannelOption.SO_KEEPALIVE, true);
bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPIDLE), 90);
bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPINTERVAL), 90);
bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPCOUNT), 3);
}
});

RUN Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters" -Name "TcpTimedWaitDelay" -Value 60 -Type DWORD -Force

We are getting "io.lettuce.core.protocol.CommandHandler - null Unexpected exception during request: java.io.IOException: An existing connection was forcibly closed by the remote host
java.io.IOException: An existing connection was forcibly closed by the remote host" exception which is causing many request failures.

How do we set the TCP keepAlive options and make sure netty recognises it?What could be a possible reason for the reconnections happenening and a stable connection to redis not getting maintained .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants