fix: always allow selecting a new node for cluster mode subscriptions when the current one fails #1589

headlessme · 2022-05-25T23:11:10Z

We ran into an issue where some of our redis clients would miss messages from the pubsub bus following a network issue / crash in the redis cluster. This PR changes the reconnect logic in the ClusterSubscriber such that it can pick to connect to a different node when the current connection closes. The idea being that if a single node is having issues, the subscriber can connect immediately to another member of the cluster so it doesn't miss as many messages.

… when the current one fails

headlessme · 2022-06-09T13:02:47Z

@luin any thoughts on this fix?

luin · 2022-06-12T07:48:21Z

Hey @headlessme ,

Thanks for the PR. I think the current code already handles subscriber reselecting: When a node is lost (either it is down or there are network issues), and it happens to be the subscriber, we will perform reselecting: https://github.com/luin/ioredis/blob/main/lib/cluster/ClusterSubscriber.ts#L22. Is there anything not working so we need to catch it as this PR does?

headlessme · 2022-06-12T09:55:01Z

@luin Thanks for looking into this. Yes I believe there is a case that this doesn't catch that we have seen in production. This happens when a node dies without a clean shutdown (e.g. via docker kill / hardware failure) and then the node fails to come back up for whatever reason. The connection won't get an end event as there's a default retry function on the connection, so therefore no -node event happens in the connection pool. The ClusterSubscriber will therefore keep trying to reconnect to only the failed node (which isn't coming back up). In this time all published messages are being missed, even though the cluster is still actually operational, just missing one node. This PR handles this case by always allowing selection of a new node on any connection failure, not just end (-node) events.

The existing logic works in the case of a clean shutdown, where the connection is manually closed as the end event triggers on the connection in that case, and therefore triggers the -node event as expected. This case we're seeing is only related to unexpected connection failures.

We managed to recreate it consistently as follows:

Run a 6 node redis cluster using the bitnami cluster compose file (3 master)
Set up a client to publish messages periodically, and one to subscribe for those messages.
Kill the node that the ClusterSubscriber picked in a non-clean way (docker compose kill worked for us)
The subscriber node will stop receiving messages and is stuck in a loop trying to reconnect to the dead node.

Another solution might be to define a more appropriate retry strategy for the subscriber connection, which fails quickly so the end event triggers, however this will still introduce an extra delay to in re-subscription for this particular case if it actually retries at all.

luin · 2022-06-12T11:47:58Z

@headlessme Thanks for the details!

I just tested locally with Redis's built-in create-cluster script and everything works for me. I kill the subscriber node with kill {PID} so not sure whether is the non-clean way.

Reconnection is disabled for all cluster nodes: https://github.com/luin/ioredis/blob/main/lib/cluster/ConnectionPool.ts#L75, so although the subscriber may try to reconnect, the node will emit a close event and then immediately an end event.

So seems to me, that if this.subscriber.once("close", will be called, the corresponding node in the cluster pool should always emit an end event, which triggers subscriber re-selecting.

headlessme · 2022-06-12T11:53:38Z

The reconnection is disabled on the cluster connections, but the ClusterSubscriber uses it's own connection, it does not re-use the pooled connection:

https://github.com/luin/ioredis/blob/975336df7b92304533dd7804c382a5e25b8bbd17/lib/cluster/ClusterSubscriber.ts#L91-L100

headlessme · 2022-06-12T11:55:13Z

Also note the comment above the connection constructor:

https://github.com/luin/ioredis/blob/975336df7b92304533dd7804c382a5e25b8bbd17/lib/cluster/ClusterSubscriber.ts#L83-L84

luin · 2022-06-12T12:01:05Z

Yeah, I realized that, but we have a listener for the -node event: https://github.com/luin/ioredis/blob/main/lib/cluster/ClusterSubscriber.ts#L23-L24.

So here is how it works:

Say we have a 6-node cluster, 3 masters (M1, M2, M3), and 3 slaves.
M1 is selected as the subscriber. So there are 7 connections at the moment, 6 for nodes in the pool, 1 for the subscriber.
If we kill M1, both the subscriber and the connection to M1 in the pool emit "close" event. The subscriber will try to reconnect, whereas the connection to M1 in the pool emit "end" event immediately. The pool listens to the "end" event, and emits "-node" event, which triggers the subscriber manager to re-select a new subscriber.

headlessme · 2022-06-12T12:27:16Z

Here's some logs with DEBUG=ioredis:* when the node for the subscriber is killed. I'm publishing a counter on the hello channel from one client, the subscriber client just print it out. You can see to start with all is good, each publish is received and logged by the subscriber. Then I kill the node, and only the publishes are logged from then onwards.

publish 553
2022-06-12T12:19:01.114Z ioredis:dataHandler receive reply "message" in subscriber mode
553
publish 554
554
2022-06-12T12:19:02.102Z ioredis:dataHandler receive reply "message" in subscriber mode
publish 555
2022-06-12T12:19:03.102Z ioredis:dataHandler receive reply "message" in subscriber mode
555
publish 556
2022-06-12T12:19:04.103Z ioredis:dataHandler receive reply "message" in subscriber mode
556
1:S 12 Jun 2022 12:19:04.869 # Connection with master lost.
1:S 12 Jun 2022 12:19:04.869 * Caching the disconnected master state.
1:S 12 Jun 2022 12:19:04.869 * Reconnecting to MASTER 172.19.0.6:6379
1:S 12 Jun 2022 12:19:04.869 * MASTER <-> REPLICA sync started
1:S 12 Jun 2022 12:19:04.870 # Error condition on socket for SYNC: Connection refused
redis node removed undefined
2022-06-12T12:19:04.872Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: ready -> close
2022-06-12T12:19:04.872Z ioredis:connection reconnect in 50ms
2022-06-12T12:19:04.873Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: close -> reconnecting
2022-06-12T12:19:04.924Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: reconnecting -> connecting
2022-06-12T12:19:04.928Z ioredis:connection error: Error: connect ECONNREFUSED 172.19.0.6:6379
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1187:16) {
  errno: -111,
  code: 'ECONNREFUSED',
  syscall: 'connect',
  address: '172.19.0.6',
  port: 6379
}
2022-06-12T12:19:04.931Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: connecting -> close
2022-06-12T12:19:04.931Z ioredis:connection reconnect in 100ms
2022-06-12T12:19:04.931Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: close -> reconnecting
2022-06-12T12:19:05.031Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: reconnecting -> connecting
publish 557
redis-node-0_1 exited with code 137
reS 12 Jun 2022 12:19:05.645 * Connecting to MASTER 172.19.0.6:6379
1:S 12 Jun 2022 12:19:05.645 * MASTER <-> REPLICA sync started
publish 558
publish 559
publish 560
publish 561
publish 562
publish 563
publish 564
publish 565
publish 566
2022-06-12T12:19:15.033Z ioredis:connection error: Error: connect ETIMEDOUT
    at Socket.<anonymous> (/app/node_modules/ioredis/built/Redis.js:168:41)
    at Object.onceWrapper (node:events:641:28)
    at Socket.emit (node:events:527:28)
    at Socket._onTimeout (node:net:516:8)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7) {
  errorno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect'
}
2022-06-12T12:19:15.034Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: connecting -> close
2022-06-12T12:19:15.034Z ioredis:connection reconnect in 150ms
2022-06-12T12:19:15.034Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: close -> reconnecting
publish 567
2022-06-12T12:19:15.185Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: reconnecting -> connecting
2022-06-12T12:19:15.245Z ioredis:connection error: Error: connect EHOSTUNREACH 172.19.0.6:6379
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1187:16) {
  errno: -113,
  code: 'EHOSTUNREACH',
  syscall: 'connect',
  address: '172.19.0.6',
  port: 6379
}
2022-06-12T12:19:15.246Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: connecting -> close
2022-06-12T12:19:15.246Z ioredis:connection reconnect in 200ms
2022-06-12T12:19:15.246Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: close -> reconnecting
2022-06-12T12:19:15.447Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: reconnecting -> connecting
publish 568
publish 569
publish 570

No matter how long I leave it, the subscriber never recovers, so there's definitely a bug here somewhere it seems, despite 5 of the 6 nodes still being alive.

headlessme · 2022-06-12T12:48:36Z

Seems it could be caused by the use of lazyConnect on the cluster connections? This would mean the -node event would never trigger if there was no command issued on that connection as it would still be in wait status.

https://github.com/luin/ioredis/blob/975336df7b92304533dd7804c382a5e25b8bbd17/lib/cluster/ConnectionPool.ts#L84

luin · 2022-06-12T12:52:14Z

Seems it could be caused by the use of lazyConnect on the cluster connections?

Hmm yeah, that could be the reason. Nice catch!

I don't recall why we rely on -node event. Seems we can just disable reconnection for the subscriber as well and perform reselecting whenever it disconnects. This way we don't need to listen to -node at all. Does it make sense?

headlessme · 2022-06-12T13:07:22Z

Yea, sounds like that should work. I'll try it out!

headlessme · 2022-06-12T13:22:39Z

Seems to work, although one thing to note when removing the -node listener is that previously the cluster subscriber would reconnect when the startup nodes resolve to IP addresses (e.g. in my test setup I use the hostname redis-node-0 as the only startup node, which then gets replaced by the IP addresses for the other nodes after discovery). I'm not sure if this is important in any context, if so, we should retain the -node listener.

luin · 2022-06-12T15:45:22Z

Yeah I think we should keep the -node listener then

…n in redis#1589

headlessme · 2022-06-13T11:22:40Z

@luin I've updated the PR – Does this match what you're expecting from our discussion?

luin · 2022-06-13T11:57:02Z

It does! I'll double check it later this week.

luin · 2022-06-15T11:33:53Z

Tests failed. Pretty strange. Haven't found out the cause.

headlessme · 2022-06-21T14:42:07Z

Any idea if the failures are specifically related to these changes?

luin · 2022-06-21T23:35:40Z

Any idea if the failures are specifically related to these changes?

I'm not sure but I don't think the changes break the functionalities covered by the failing tests. However we have to find out the cause to make sure.

luin · 2022-06-25T05:44:23Z

After some debuggings it turns out the new code creates a dead loop, which caused test failed.

The breakdown:

this.selectSubscriber() is called.
this.selectSubscriber() calls this.subscriber.disconnect().
this.subscriber.disconnect() triggers this.subscriber.once("end", () => { subscribed in this.selectSubscriber().
if (currentSub === this.subscriber) { passes so this.selectSubscriber() is called again.

I think a quick fix is to unset this.subscriber before calling this.subscriber.disconnect() but not sure there are other cases can cause the loop.

…r to prevent dead loop

headlessme · 2022-07-13T17:16:16Z

@luin I've changed the logic to remove the end listener on intentional disconnects which should be more robust.

Looks like the tests were failing because of some missing disconnect() calls in the test implementation, which were leaving zombie Cluster instances around which were trying to reconnect on end with these changes. Adding in the appropriate disconnect() calls has fixed the tests locally for me. Could you try re-run them through CI to verify?

luin

LGTM. Thanks!

## [5.2.1](v5.2.0...v5.2.1) (2022-07-16) ### Bug Fixes * always allow selecting a new node for cluster mode subscriptions when the current one fails ([#1589](#1589)) ([1c8cb85](1c8cb85))

github-actions · 2022-07-16T14:36:14Z

🎉 This PR is included in version 5.2.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

## [5.2.1](redis/ioredis@v5.2.0...v5.2.1) (2022-07-16) ### Bug Fixes * always allow selecting a new node for cluster mode subscriptions when the current one fails ([#1589](redis/ioredis#1589)) ([1c8cb85](redis/ioredis@1c8cb85))

headlessme force-pushed the fix/subscriber-reconnect branch from 71adbb8 to 2baccc8 Compare July 13, 2022 16:11

headlessme force-pushed the fix/subscriber-reconnect branch from 2baccc8 to ca54b37 Compare July 13, 2022 16:14

fix: fix tests by adding missing disconnect() calls

e06fc3f

luin approved these changes Jul 16, 2022

View reviewed changes

luin merged commit 1c8cb85 into redis:main Jul 16, 2022

github-actions bot added the released label Jul 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: always allow selecting a new node for cluster mode subscriptions when the current one fails #1589

fix: always allow selecting a new node for cluster mode subscriptions when the current one fails #1589

headlessme commented May 25, 2022

headlessme commented Jun 9, 2022

luin commented Jun 12, 2022

headlessme commented Jun 12, 2022 •

edited

Loading

luin commented Jun 12, 2022

headlessme commented Jun 12, 2022

headlessme commented Jun 12, 2022

luin commented Jun 12, 2022

headlessme commented Jun 12, 2022

headlessme commented Jun 12, 2022 •

edited

Loading

luin commented Jun 12, 2022

headlessme commented Jun 12, 2022

headlessme commented Jun 12, 2022

luin commented Jun 12, 2022

headlessme commented Jun 13, 2022

luin commented Jun 13, 2022

luin commented Jun 15, 2022

headlessme commented Jun 21, 2022

luin commented Jun 21, 2022

luin commented Jun 25, 2022

headlessme commented Jul 13, 2022

luin left a comment

github-actions bot commented Jul 16, 2022

fix: always allow selecting a new node for cluster mode subscriptions when the current one fails #1589

fix: always allow selecting a new node for cluster mode subscriptions when the current one fails #1589

Conversation

headlessme commented May 25, 2022

headlessme commented Jun 9, 2022

luin commented Jun 12, 2022

headlessme commented Jun 12, 2022 • edited Loading

luin commented Jun 12, 2022

headlessme commented Jun 12, 2022

headlessme commented Jun 12, 2022

luin commented Jun 12, 2022

headlessme commented Jun 12, 2022

headlessme commented Jun 12, 2022 • edited Loading

luin commented Jun 12, 2022

headlessme commented Jun 12, 2022

headlessme commented Jun 12, 2022

luin commented Jun 12, 2022

headlessme commented Jun 13, 2022

luin commented Jun 13, 2022

luin commented Jun 15, 2022

headlessme commented Jun 21, 2022

luin commented Jun 21, 2022

luin commented Jun 25, 2022

headlessme commented Jul 13, 2022

luin left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 16, 2022

headlessme commented Jun 12, 2022 •

edited

Loading

headlessme commented Jun 12, 2022 •

edited

Loading