-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: always allow selecting a new node for cluster mode subscriptions when the current one fails #1589
Conversation
… when the current one fails
@luin any thoughts on this fix? |
Hey @headlessme , Thanks for the PR. I think the current code already handles subscriber reselecting: When a node is lost (either it is down or there are network issues), and it happens to be the subscriber, we will perform reselecting: https://github.com/luin/ioredis/blob/main/lib/cluster/ClusterSubscriber.ts#L22. Is there anything not working so we need to catch it as this PR does? |
@luin Thanks for looking into this. Yes I believe there is a case that this doesn't catch that we have seen in production. This happens when a node dies without a clean shutdown (e.g. via docker kill / hardware failure) and then the node fails to come back up for whatever reason. The connection won't get an The existing logic works in the case of a clean shutdown, where the connection is manually closed as the We managed to recreate it consistently as follows:
Another solution might be to define a more appropriate retry strategy for the subscriber connection, which fails quickly so the |
@headlessme Thanks for the details! I just tested locally with Redis's built-in create-cluster script and everything works for me. I kill the subscriber node with Reconnection is disabled for all cluster nodes: https://github.com/luin/ioredis/blob/main/lib/cluster/ConnectionPool.ts#L75, so although the subscriber may try to reconnect, the node will emit a close event and then immediately an end event. So seems to me, that if |
The reconnection is disabled on the cluster connections, but the |
Also note the comment above the connection constructor: |
Yeah, I realized that, but we have a listener for the So here is how it works:
|
Here's some logs with
No matter how long I leave it, the subscriber never recovers, so there's definitely a bug here somewhere it seems, despite 5 of the 6 nodes still being alive. |
Seems it could be caused by the use of |
Hmm yeah, that could be the reason. Nice catch! I don't recall why we rely on |
Yea, sounds like that should work. I'll try it out! |
Seems to work, although one thing to note when removing the |
Yeah I think we should keep the |
@luin I've updated the PR – Does this match what you're expecting from our discussion? |
It does! I'll double check it later this week. |
Tests failed. Pretty strange. Haven't found out the cause. |
Any idea if the failures are specifically related to these changes? |
I'm not sure but I don't think the changes break the functionalities covered by the failing tests. However we have to find out the cause to make sure. |
After some debuggings it turns out the new code creates a dead loop, which caused test failed. The breakdown:
I think a quick fix is to unset |
71adbb8
to
2baccc8
Compare
…r to prevent dead loop
2baccc8
to
ca54b37
Compare
@luin I've changed the logic to remove the Looks like the tests were failing because of some missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
## [5.2.1](v5.2.0...v5.2.1) (2022-07-16) ### Bug Fixes * always allow selecting a new node for cluster mode subscriptions when the current one fails ([#1589](#1589)) ([1c8cb85](1c8cb85))
🎉 This PR is included in version 5.2.1 🎉 The release is available on: Your semantic-release bot 📦🚀 |
## [5.2.1](redis/ioredis@v5.2.0...v5.2.1) (2022-07-16) ### Bug Fixes * always allow selecting a new node for cluster mode subscriptions when the current one fails ([#1589](redis/ioredis#1589)) ([1c8cb85](redis/ioredis@1c8cb85))
We ran into an issue where some of our redis clients would miss messages from the pubsub bus following a network issue / crash in the redis cluster. This PR changes the reconnect logic in the ClusterSubscriber such that it can pick to connect to a different node when the current connection closes. The idea being that if a single node is having issues, the subscriber can connect immediately to another member of the cluster so it doesn't miss as many messages.