Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: always allow selecting a new node for cluster mode subscriptions when the current one fails #1589

Merged
merged 4 commits into from Jul 16, 2022

Conversation

headlessme
Copy link
Contributor

We ran into an issue where some of our redis clients would miss messages from the pubsub bus following a network issue / crash in the redis cluster. This PR changes the reconnect logic in the ClusterSubscriber such that it can pick to connect to a different node when the current connection closes. The idea being that if a single node is having issues, the subscriber can connect immediately to another member of the cluster so it doesn't miss as many messages.

@headlessme
Copy link
Contributor Author

@luin any thoughts on this fix?

@luin
Copy link
Collaborator

luin commented Jun 12, 2022

Hey @headlessme ,

Thanks for the PR. I think the current code already handles subscriber reselecting: When a node is lost (either it is down or there are network issues), and it happens to be the subscriber, we will perform reselecting: https://github.com/luin/ioredis/blob/main/lib/cluster/ClusterSubscriber.ts#L22. Is there anything not working so we need to catch it as this PR does?

@headlessme
Copy link
Contributor Author

headlessme commented Jun 12, 2022

@luin Thanks for looking into this. Yes I believe there is a case that this doesn't catch that we have seen in production. This happens when a node dies without a clean shutdown (e.g. via docker kill / hardware failure) and then the node fails to come back up for whatever reason. The connection won't get an end event as there's a default retry function on the connection, so therefore no -node event happens in the connection pool. The ClusterSubscriber will therefore keep trying to reconnect to only the failed node (which isn't coming back up). In this time all published messages are being missed, even though the cluster is still actually operational, just missing one node. This PR handles this case by always allowing selection of a new node on any connection failure, not just end (-node) events.

The existing logic works in the case of a clean shutdown, where the connection is manually closed as the end event triggers on the connection in that case, and therefore triggers the -node event as expected. This case we're seeing is only related to unexpected connection failures.

We managed to recreate it consistently as follows:

  1. Run a 6 node redis cluster using the bitnami cluster compose file (3 master)
  2. Set up a client to publish messages periodically, and one to subscribe for those messages.
  3. Kill the node that the ClusterSubscriber picked in a non-clean way (docker compose kill worked for us)
  4. The subscriber node will stop receiving messages and is stuck in a loop trying to reconnect to the dead node.

Another solution might be to define a more appropriate retry strategy for the subscriber connection, which fails quickly so the end event triggers, however this will still introduce an extra delay to in re-subscription for this particular case if it actually retries at all.

@luin
Copy link
Collaborator

luin commented Jun 12, 2022

@headlessme Thanks for the details!

I just tested locally with Redis's built-in create-cluster script and everything works for me. I kill the subscriber node with kill {PID} so not sure whether is the non-clean way.

Reconnection is disabled for all cluster nodes: https://github.com/luin/ioredis/blob/main/lib/cluster/ConnectionPool.ts#L75, so although the subscriber may try to reconnect, the node will emit a close event and then immediately an end event.

So seems to me, that if this.subscriber.once("close", will be called, the corresponding node in the cluster pool should always emit an end event, which triggers subscriber re-selecting.

@headlessme
Copy link
Contributor Author

The reconnection is disabled on the cluster connections, but the ClusterSubscriber uses it's own connection, it does not re-use the pooled connection:

https://github.com/luin/ioredis/blob/975336df7b92304533dd7804c382a5e25b8bbd17/lib/cluster/ClusterSubscriber.ts#L91-L100

@headlessme
Copy link
Contributor Author

@luin
Copy link
Collaborator

luin commented Jun 12, 2022

Yeah, I realized that, but we have a listener for the -node event: https://github.com/luin/ioredis/blob/main/lib/cluster/ClusterSubscriber.ts#L23-L24.

So here is how it works:

  1. Say we have a 6-node cluster, 3 masters (M1, M2, M3), and 3 slaves.
  2. M1 is selected as the subscriber. So there are 7 connections at the moment, 6 for nodes in the pool, 1 for the subscriber.
  3. If we kill M1, both the subscriber and the connection to M1 in the pool emit "close" event. The subscriber will try to reconnect, whereas the connection to M1 in the pool emit "end" event immediately. The pool listens to the "end" event, and emits "-node" event, which triggers the subscriber manager to re-select a new subscriber.

@headlessme
Copy link
Contributor Author

Here's some logs with DEBUG=ioredis:* when the node for the subscriber is killed. I'm publishing a counter on the hello channel from one client, the subscriber client just print it out. You can see to start with all is good, each publish is received and logged by the subscriber. Then I kill the node, and only the publishes are logged from then onwards.

publish 553
2022-06-12T12:19:01.114Z ioredis:dataHandler receive reply "message" in subscriber mode
553
publish 554
554
2022-06-12T12:19:02.102Z ioredis:dataHandler receive reply "message" in subscriber mode
publish 555
2022-06-12T12:19:03.102Z ioredis:dataHandler receive reply "message" in subscriber mode
555
publish 556
2022-06-12T12:19:04.103Z ioredis:dataHandler receive reply "message" in subscriber mode
556
1:S 12 Jun 2022 12:19:04.869 # Connection with master lost.
1:S 12 Jun 2022 12:19:04.869 * Caching the disconnected master state.
1:S 12 Jun 2022 12:19:04.869 * Reconnecting to MASTER 172.19.0.6:6379
1:S 12 Jun 2022 12:19:04.869 * MASTER <-> REPLICA sync started
1:S 12 Jun 2022 12:19:04.870 # Error condition on socket for SYNC: Connection refused
redis node removed undefined
2022-06-12T12:19:04.872Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: ready -> close
2022-06-12T12:19:04.872Z ioredis:connection reconnect in 50ms
2022-06-12T12:19:04.873Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: close -> reconnecting
2022-06-12T12:19:04.924Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: reconnecting -> connecting
2022-06-12T12:19:04.928Z ioredis:connection error: Error: connect ECONNREFUSED 172.19.0.6:6379
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1187:16) {
  errno: -111,
  code: 'ECONNREFUSED',
  syscall: 'connect',
  address: '172.19.0.6',
  port: 6379
}
2022-06-12T12:19:04.931Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: connecting -> close
2022-06-12T12:19:04.931Z ioredis:connection reconnect in 100ms
2022-06-12T12:19:04.931Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: close -> reconnecting
2022-06-12T12:19:05.031Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: reconnecting -> connecting
publish 557
redis-node-0_1 exited with code 137
reS 12 Jun 2022 12:19:05.645 * Connecting to MASTER 172.19.0.6:6379
1:S 12 Jun 2022 12:19:05.645 * MASTER <-> REPLICA sync started
publish 558
publish 559
publish 560
publish 561
publish 562
publish 563
publish 564
publish 565
publish 566
2022-06-12T12:19:15.033Z ioredis:connection error: Error: connect ETIMEDOUT
    at Socket.<anonymous> (/app/node_modules/ioredis/built/Redis.js:168:41)
    at Object.onceWrapper (node:events:641:28)
    at Socket.emit (node:events:527:28)
    at Socket._onTimeout (node:net:516:8)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7) {
  errorno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect'
}
2022-06-12T12:19:15.034Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: connecting -> close
2022-06-12T12:19:15.034Z ioredis:connection reconnect in 150ms
2022-06-12T12:19:15.034Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: close -> reconnecting
publish 567
2022-06-12T12:19:15.185Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: reconnecting -> connecting
2022-06-12T12:19:15.245Z ioredis:connection error: Error: connect EHOSTUNREACH 172.19.0.6:6379
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1187:16) {
  errno: -113,
  code: 'EHOSTUNREACH',
  syscall: 'connect',
  address: '172.19.0.6',
  port: 6379
}
2022-06-12T12:19:15.246Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: connecting -> close
2022-06-12T12:19:15.246Z ioredis:connection reconnect in 200ms
2022-06-12T12:19:15.246Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: close -> reconnecting
2022-06-12T12:19:15.447Z ioredis:redis status[172.19.0.6:6379 (ioredis-cluster(subscriber))]: reconnecting -> connecting
publish 568
publish 569
publish 570

No matter how long I leave it, the subscriber never recovers, so there's definitely a bug here somewhere it seems, despite 5 of the 6 nodes still being alive.

@headlessme
Copy link
Contributor Author

headlessme commented Jun 12, 2022

Seems it could be caused by the use of lazyConnect on the cluster connections? This would mean the -node event would never trigger if there was no command issued on that connection as it would still be in wait status.

https://github.com/luin/ioredis/blob/975336df7b92304533dd7804c382a5e25b8bbd17/lib/cluster/ConnectionPool.ts#L84

@luin
Copy link
Collaborator

luin commented Jun 12, 2022

Seems it could be caused by the use of lazyConnect on the cluster connections?

Hmm yeah, that could be the reason. Nice catch!

I don't recall why we rely on -node event. Seems we can just disable reconnection for the subscriber as well and perform reselecting whenever it disconnects. This way we don't need to listen to -node at all. Does it make sense?

@headlessme
Copy link
Contributor Author

Yea, sounds like that should work. I'll try it out!

@headlessme
Copy link
Contributor Author

Seems to work, although one thing to note when removing the -node listener is that previously the cluster subscriber would reconnect when the startup nodes resolve to IP addresses (e.g. in my test setup I use the hostname redis-node-0 as the only startup node, which then gets replaced by the IP addresses for the other nodes after discovery). I'm not sure if this is important in any context, if so, we should retain the -node listener.

@luin
Copy link
Collaborator

luin commented Jun 12, 2022

Yeah I think we should keep the -node listener then

@headlessme
Copy link
Contributor Author

@luin I've updated the PR – Does this match what you're expecting from our discussion?

@luin
Copy link
Collaborator

luin commented Jun 13, 2022

It does! I'll double check it later this week.

@luin
Copy link
Collaborator

luin commented Jun 15, 2022

Tests failed. Pretty strange. Haven't found out the cause.

@headlessme
Copy link
Contributor Author

Any idea if the failures are specifically related to these changes?

@luin
Copy link
Collaborator

luin commented Jun 21, 2022

Any idea if the failures are specifically related to these changes?

I'm not sure but I don't think the changes break the functionalities covered by the failing tests. However we have to find out the cause to make sure.

@luin
Copy link
Collaborator

luin commented Jun 25, 2022

After some debuggings it turns out the new code creates a dead loop, which caused test failed.

The breakdown:

  1. this.selectSubscriber() is called.
  2. this.selectSubscriber() calls this.subscriber.disconnect().
  3. this.subscriber.disconnect() triggers this.subscriber.once("end", () => { subscribed in this.selectSubscriber().
  4. if (currentSub === this.subscriber) { passes so this.selectSubscriber() is called again.

I think a quick fix is to unset this.subscriber before calling this.subscriber.disconnect() but not sure there are other cases can cause the loop.

@headlessme
Copy link
Contributor Author

@luin I've changed the logic to remove the end listener on intentional disconnects which should be more robust.

Looks like the tests were failing because of some missing disconnect() calls in the test implementation, which were leaving zombie Cluster instances around which were trying to reconnect on end with these changes. Adding in the appropriate disconnect() calls has fixed the tests locally for me. Could you try re-run them through CI to verify?

Copy link
Collaborator

@luin luin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@luin luin merged commit 1c8cb85 into redis:main Jul 16, 2022
github-actions bot pushed a commit that referenced this pull request Jul 16, 2022
## [5.2.1](v5.2.0...v5.2.1) (2022-07-16)

### Bug Fixes

* always allow selecting a new node for cluster mode subscriptions when the current one fails ([#1589](#1589)) ([1c8cb85](1c8cb85))
@github-actions
Copy link

🎉 This PR is included in version 5.2.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

janus-dev87 added a commit to janus-dev87/ioredis-work that referenced this pull request Mar 1, 2024
## [5.2.1](redis/ioredis@v5.2.0...v5.2.1) (2022-07-16)

### Bug Fixes

* always allow selecting a new node for cluster mode subscriptions when the current one fails ([#1589](redis/ioredis#1589)) ([1c8cb85](redis/ioredis@1c8cb85))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants