Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event subscriber clients won't reconnect / Jobs and events won't get triggered #2187

Closed
xairoo opened this issue Oct 28, 2021 · 8 comments
Closed
Labels

Comments

@xairoo
Copy link

xairoo commented Oct 28, 2021

Event subscriber clients won't reconnect after a timeout when redis removes the client based on tcp-keepalive setting (default is 300 seconds).

This is only related to a remote redis connection. Locally it works.

Bull uses 3 redis connections:

  • client
  • subscriber
  • bclient

When you run into this problem, you'll miss 1 or 2 clients (typically 2). The missing clients are required for yourQueue.on() and yourQueue.process(). So you won't receive events:

  • new jobs yourQueue.process()
  • progress updates yourQueue.on()

Reproduce

I'm still testing this, don't have much time right now, will update this if there is an easier/better way.

  • Use a remote redis server
  • Cut your uplink for at least 300 seconds (tcp-keepalive value) or put your system to sleep/hibernate
    • You can reduce this time when you reduce the tcp-keepalive value on your redis server

Not sure, still testing it: Looks like when you remove your network cable from your PC or disable the network card, blocking the port or so, the disconnect event will be triggered, so it will reconnect automatically. In other words: Prevent the disconnect event to reproduce this problem.

The redis client lists CLIENT LIST (or just use RedisInsight as a GUI) will show you all connected clients.

After start:
Screenshot 2021-10-25 at 08-37-22 Client List RedisInsight

After close/reconnect:
Screenshot 2021-10-25 at 08-37-29 Client List RedisInsight

This could be related to #1873.

I am still testing it, but it looks like pinging the server from each client restarts the connection:

setInterval(function () {
  yourQueue.clients.forEach((client) => {
    client.ping();
  });
}, 10000);
@manast
Copy link
Member

manast commented Oct 29, 2021

Can you do the same tests but instead of using Bull using ioredis directly? As if you create a subscriber connection and check if it is alive after the reconnection? if not, then this issue should be reported to the ioredis team.

@xairoo
Copy link
Author

xairoo commented Oct 29, 2021

This has nothing to do with bull. ioredis just doesn't reconnect the subscriber.
When you have such a disconnect, the normal clients won't reconnect if you don't send any commands. And if you send one, they'll reconnect.

Created a ticket: redis/ioredis#1451

But the ping trick works ;-) Maybe that should be used in bull, no idea if ioredis (or even node redis) will implement this.

Hope this will fix #1873 too,

I don't understand why these options should help:

maxRetriesPerRequest: null,
enableReadyCheck: false,

I think (haven't checked) maxRetriesPerRequest is only for sending (trying to send again if failed) commands like SET and so on.

But enableReadyCheck waits until the server is ready (when he finally loads all data from disk). That shouldn't be fix the issue in #1873.

In a large production scenario it would be really difficult to check all the connected, btw. the missing clients (subscribers!) to find the reason why the jobs are stuck.
But a step would be checking the job in redis. If they get added, than the subscriber must be disconnected. What else should it be?

@manast
Copy link
Member

manast commented Oct 30, 2021

I agree that enableReadyCheck shouldn't matter, however without it the reconnection will not work, I have tested it extensively.

@manast
Copy link
Member

manast commented Oct 30, 2021

In a large production scenario it would be really difficult to check all the connected, btw. the missing clients (subscribers!) to find the reason why the jobs are stuck.
But a step would be checking the job in redis. If they get added, than the subscriber must be disconnected. What else should it be?

In general you should not use the events other than for non critical notifications, since in the case of a disconnect you will loose events, so you cannot rely on them for bookkeeping.

@xairoo
Copy link
Author

xairoo commented Oct 30, 2021

In general you should not use the events other than for non critical notifications, since in the case of a disconnect you will loose events, so you cannot rely on them for bookkeeping.

Totally. In my (bad) case the bull worker does his work, but the event (on complete and so on) is finally received and handled by an other instance (socket.io) that will send some data to the user and the most important part is to store data from the received job in MongoDB.

I have done this to save connections. That wasn't a good idea.

The worker should handle the events directly and must store data to the DB. In general within the event loop before the job is completed. To handle also DB problems.

Sometimes we cannot see the forest for the trees.

Thanks! =)

@stale
Copy link

stale bot commented Dec 29, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Dec 29, 2021
@stale stale bot closed this as completed Jan 5, 2022
@alolis
Copy link
Contributor

alolis commented Apr 27, 2022

In a large production scenario it would be really difficult to check all the connected, btw. the missing clients (subscribers!) to find the reason why the jobs are stuck.
But a step would be checking the job in redis. If they get added, than the subscriber must be disconnected. What else should it be?

In general you should not use the events other than for non critical notifications, since in the case of a disconnect you will loose events, so you cannot rely on them for bookkeeping.

I use the events for bookkeeping as well. Much cleaner that way :)

After the change in #1873 Is it safe to assume this is not a problem and the events WILL fire after re-connect or am I wrong?

@manast
Copy link
Member

manast commented Apr 30, 2022

After the change in #1873 Is it safe to assume this is not a problem and the events WILL fire after re-connect or am I wrong?

They should. But you can also verify it to be 100% sure :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants