After intentional shutdown of a node of the cluster, the other nodes are still attempting to reconnect to the shutdown node #26319

federicasuriano · 2024-04-24T12:02:54Z

Bug
I encountered this issue regarding a particular behavior in the Hazelcasts cluster setup. After intentionally shutting down one of the cluster nodes, I noticed that the remaining nodes received the following logs:

WARN […] com.hazelcast.internal.server.tcp.TcpServerConnectionErrorHandler - Removing connection to endpoint […] Cause => java.io.IOException {Connection refused to address /[…]}, Error-Count: 5
WARN […] com.hazelcast.internal.cluster.impl.MembershipManager - […] Member […] is suspected to be dead for reason: No connection

The remaining nodes are detecting the failure of the shut-down node. However, despite the intentional shutdown, the other nodes are still attempting to reconnect to the shut-down node.
I tested it on Hazelcast version 5.3.2, 5.0.2 and 4.0.3 and it always produces the same logs.

Expected behavior
I expect that when a node is intentionally shutdown, the other nodes do not attempt to reconnect to the shutdown node.

How to reproduce
I created a test to reproduce the error.

package com.nm.test.hazelcast.shutdown;

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertTrue;

import com.hazelcast.config.Config;
import com.hazelcast.config.TcpIpConfig;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.internal.cluster.impl.MembershipManager;
import com.hazelcast.spi.properties.ClusterProperty;
import com.nm.test.hazelcast.utils.StoreLoggedEventsAppender;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.TimeUnit;
import org.apache.log4j.Logger;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;

// Test handling intentional node shutdown.
public class TestShutDown6 {

	private List<HazelcastInstance> instances;

	private final String[] targetLoggers = { "com.hazelcast.internal.server.tcp.TcpServerConnectionErrorHandler", MembershipManager.class.getName() };

	private StoreLoggedEventsAppender tcpServerConnectionErrorHandlerAppender;
	private StoreLoggedEventsAppender membershipManagerAppender;

	private Logger tcpServerConnectionErrorHandlerLogger;
	private Logger membershipManagerLogger;

	@BeforeEach
	public void setUp() throws Exception {

		// create individual appenders for target loggers
		tcpServerConnectionErrorHandlerAppender = new StoreLoggedEventsAppender();
		membershipManagerAppender = new StoreLoggedEventsAppender();

		// add appenders to the respective loggers
		tcpServerConnectionErrorHandlerLogger = Logger.getLogger(targetLoggers[0]);
		tcpServerConnectionErrorHandlerLogger.setAdditivity(true);
		tcpServerConnectionErrorHandlerLogger.addAppender(tcpServerConnectionErrorHandlerAppender);

		membershipManagerLogger = Logger.getLogger(targetLoggers[1]);
		membershipManagerLogger.setAdditivity(true);
		membershipManagerLogger.addAppender(membershipManagerAppender);

		instances = new ArrayList<>();
	}

	@AfterEach
	public void tearDown() {

		// remove the appenders
		tcpServerConnectionErrorHandlerLogger.removeAppender(tcpServerConnectionErrorHandlerAppender);
		membershipManagerLogger.removeAppender(membershipManagerAppender);

		// shutdown all Hazelcast instances
		for (HazelcastInstance instance : instances) {
			instance.getLifecycleService().terminate();
		}
	}

	@Test
	public void testNoReconnectAfterNode1Shutdown() throws InterruptedException {

		// create config and start a 2 node cluster
		configAndStart2NodeCluster();

		HazelcastInstance hcInstance1 = instances.get(0);
		HazelcastInstance hcInstance2 = instances.get(1);

		// shut down hcInstance1 intentionally
		hcInstance1.getLifecycleService().shutdown();

		// wait for some time to ensure any reconnection attempts would have happened
		TimeUnit.SECONDS.sleep(10);

		// ensure only one member remains in the cluster after shutting down hcInstance1
		assertEquals(1, hcInstance2.getCluster().getMembers().size());

		// ensure no reconnection attempts are made by hcInstance:
		// assert that there were no WARN messages from the target loggers
		assertTrue(tcpServerConnectionErrorHandlerAppender.getWarnLogs().isEmpty());
		assertTrue(membershipManagerAppender.getWarnLogs().isEmpty());
	}

	private void configAndStart2NodeCluster() {

		// create config
		Config config = new Config();

		// configure Log4j logging
		config.setProperty(ClusterProperty.LOGGING_TYPE.getName(), "log4j2");

		// enable TCP-IP config
		TcpIpConfig tcpIpConfig = config.getNetworkConfig().getJoin().getTcpIpConfig();
		tcpIpConfig.setEnabled(true);
		tcpIpConfig.setMembers(List.of("127.0.0.1"));

		HazelcastInstance hcInstance1 = Hazelcast.newHazelcastInstance(config);
		HazelcastInstance hcInstance2 = Hazelcast.newHazelcastInstance(config);

		instances.add(hcInstance1);
		instances.add(hcInstance2);
	}
}

package com.nm.test.hazelcast.utils;

import java.util.ArrayList;
import java.util.List;
import org.apache.log4j.AppenderSkeleton;
import org.apache.log4j.Level;
import org.apache.log4j.spi.LoggingEvent;

public class StoreLoggedEventsAppender extends AppenderSkeleton {

	private List<String> debugLogs = new ArrayList<>();

	private List<String> infoLogs = new ArrayList<>();

	private List<String> warnLogs = new ArrayList<>();

	private List<String> errorLogs = new ArrayList<>();

	@Override
	protected void append(LoggingEvent loggingEvent) {

		if (Level.DEBUG.equals(loggingEvent.getLevel())) {
			debugLogs.add(loggingEvent.getRenderedMessage());
		} else if (Level.INFO.equals(loggingEvent.getLevel())) {
			infoLogs.add(loggingEvent.getRenderedMessage());
		} else if (Level.WARN.equals(loggingEvent.getLevel())) {
			warnLogs.add(loggingEvent.getRenderedMessage());
		} else if (Level.ERROR.equals(loggingEvent.getLevel())) {
			errorLogs.add(loggingEvent.getRenderedMessage());
		}
	}

	@Override
	public void close() {
	}

	@Override
	public boolean requiresLayout() {
		return false;
	}

	public List<String> getDebugLogs() {
		return debugLogs;
	}

	public List<String> getInfoLogs() {
		return infoLogs;
	}

	public List<String> getWarnLogs() {
		return warnLogs;
	}

	public List<String> getErrorLogs() {
		return errorLogs;
	}
}

The text was updated successfully, but these errors were encountered:

mhevolit · 2024-05-28T09:03:02Z

This is also reproducible with 5.1.5.

The problem with this behavior is that while the reconnection attempts are done, the cluster members basically become unresponsive, e.g. any calls to IMap.get() hang.

Let's say we have cluster members A and B, as in the reproducer above. When member A shuts down, member B receives the shutdown request of A and does repartitioning, migrations, etc.:
DEV 2024-05-28T10:51:51.695 [hz.20240528-105058-REvOR.priority-generic-operation.thread-0] INFO c.h.internal.partition.impl.MigrationManager - [169.254.19.191]:8220 [testCluster] [5.1.5] Shutdown request of Member [169.254.19.191]:8221 - a530f505-b13e-4394-869b-9718f1fa36fa is handled DEV 2024-05-28T10:51:51.698 [hz.20240528-105058-REvOR.migration] INFO c.h.internal.partition.impl.MigrationManager - [169.254.19.191]:8220 [testCluster] [5.1.5] Repartitioning cluster data. Migration tasks count: 135 DEV 2024-05-28T10:51:51.932 [hz.20240528-105058-REvOR.migration] INFO c.h.internal.partition.impl.MigrationManager - [169.254.19.191]:8220 [testCluster] [5.1.5] All migration tasks have been completed. (repartitionTime=Tue May 28 10:51:51 CEST 2024, plannedMigrations=135, completedMigrations=135, remainingMigrations=0, totalCompletedMigrations=406)

So far, this is as expected. However, right after that, this happens:
DEV 2024-05-28T10:51:51.941 [hz.20240528-105058-REvOR.IO.thread-in-0] INFO c.h.internal.server.tcp.TcpServerConnection - [169.254.19.191]:8220 [testCluster] [5.1.5] Connection[id=1, /169.254.19.191:8220->/169.254.19.191:59494, qualifier=null, endpoint=[169.254.19.191]:8221, remoteUuid=a530f505-b13e-4394-869b-9718f1fa36fa, alive=false, connectionType=MEMBER, planeIndex=0] closed. Reason: Connection closed by the other side DEV 2024-05-28T10:51:51.948 [hz.20240528-105058-REvOR.cached.thread-11] INFO c.h.internal.server.tcp.TcpServerConnector - [169.254.19.191]:8220 [testCluster] [5.1.5] Connecting to /169.254.19.191:8221, timeout: 10000, bind-any: true DEV 2024-05-28T10:51:53.972 [hz.20240528-105058-REvOR.cached.thread-11] INFO c.h.internal.server.tcp.TcpServerConnector - [169.254.19.191]:8220 [testCluster] [5.1.5] Could not connect to: /169.254.19.191:8221. Reason: IOException[Connection refused: no further information to address /169.254.19.191:8221] DEV 2024-05-28T10:51:54.073 [hz.20240528-105058-REvOR.cached.thread-11] INFO c.h.internal.server.tcp.TcpServerConnector - [169.254.19.191]:8220 [testCluster] [5.1.5] Connecting to /169.254.19.191:8221, timeout: 10000, bind-any: true DEV 2024-05-28T10:51:56.106 [hz.20240528-105058-REvOR.cached.thread-11] INFO c.h.internal.server.tcp.TcpServerConnector - [169.254.19.191]:8220 [testCluster] [5.1.5] Could not connect to: /169.254.19.191:8221. Reason: IOException[Connection refused: no further information to address /169.254.19.191:8221] DEV 2024-05-28T10:51:56.208 [hz.20240528-105058-REvOR.cached.thread-14] INFO c.h.internal.server.tcp.TcpServerConnector - [169.254.19.191]:8220 [testCluster] [5.1.5] Connecting to /169.254.19.191:8221, timeout: 10000, bind-any: true DEV 2024-05-28T10:51:58.241 [hz.20240528-105058-REvOR.cached.thread-14] INFO c.h.internal.server.tcp.TcpServerConnector - [169.254.19.191]:8220 [testCluster] [5.1.5] Could not connect to: /169.254.19.191:8221. Reason: IOException[Connection refused: no further information to address /169.254.19.191:8221] DEV 2024-05-28T10:51:58.241 [hz.20240528-105058-REvOR.cached.thread-14] WARN c.h.i.s.tcp.TcpServerConnectionErrorHandler - [169.254.19.191]:8220 [testCluster] [5.1.5] Removing connection to endpoint [169.254.19.191]:8221 Cause => java.io.IOException {Connection refused: no further information to address /169.254.19.191:8221}, Error-Count: 5 DEV 2024-05-28T10:51:58.242 [hz.20240528-105058-REvOR.cached.thread-14] INFO c.h.internal.cluster.impl.MembershipManager - [169.254.19.191]:8220 [testCluster] [5.1.5] Removing Member [169.254.19.191]:8221 - a530f505-b13e-4394-869b-9718f1fa36fa
I.e. member B tries to reconnect to member A and during this timeframe (about 6.3 seconds in the logs above), IMap operations hang.

This especially hurts in our production environment where we have a rolling update mechanism and aren't able to serve requests during this time window.

federicasuriano added the Type: Defect label Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After intentional shutdown of a node of the cluster, the other nodes are still attempting to reconnect to the shutdown node #26319

After intentional shutdown of a node of the cluster, the other nodes are still attempting to reconnect to the shutdown node #26319

federicasuriano commented Apr 24, 2024

mhevolit commented May 28, 2024

After intentional shutdown of a node of the cluster, the other nodes are still attempting to reconnect to the shutdown node #26319

After intentional shutdown of a node of the cluster, the other nodes are still attempting to reconnect to the shutdown node #26319

Comments

federicasuriano commented Apr 24, 2024

mhevolit commented May 28, 2024