Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master lock getting out of sync in case the LUA script gets interrupted #553

Open
valo opened this issue Jun 7, 2016 · 6 comments · May be fixed by #734
Open

Master lock getting out of sync in case the LUA script gets interrupted #553

valo opened this issue Jun 7, 2016 · 6 comments · May be fixed by #734

Comments

@valo
Copy link

valo commented Jun 7, 2016

We had this production problem yesterday where we discovered that the scheduled tasks are not executed even though the scheduler is running. After investigation it turned out that the master lock key in redis is set to some value, but has no TTL set, essentially leading us to this function: https://github.com/resque/resque-scheduler/blob/master/lib/resque/scheduler/lock/resilient.rb#L54

The above inconsistency caused no master node to be elected (although we don't use multiple schedulers) and all the scheduled jobs got blocked.

I really believe the way this lock is set with 2 separate operations SETNX and EXPIRE is not atomic, even though it is executed in a LUA script. These 2 operations need to be atomic and this can be achieved using the SET NX PX. Even a better solution will be to use a lock implementation which is reviewed by the community, for example using the Redis guidelines for distributed locks: http://redis.io/topics/distlock

@carsonreinke
Copy link
Contributor

@valo
Copy link
Author

valo commented Jun 8, 2016

I really doubt that Redis can ensure that the script is not terminated in the middle in case something bad happens to the Redis server. And I guess something like this happen on our production Redis (we use ElasticCache from AWS).

I was thinking about this in the recent days and I think an easy fix could be to use:

def acquire!
  Resque.redis.set(key, value, nx: true, px: timeout)
end

instead of using a script. This should atomically set the key and the timeout and return true if the key was set.

@carsonreinke, what do you think?

@carsonreinke
Copy link
Contributor

Strange that even happened. Seriously, no TTL?

The return value from NX is slightly different from the script (true or nil). Also, Redis support would change from 2.6.0 to 2.6.12. Otherwise, it makes sense.

@valo
Copy link
Author

valo commented Jun 8, 2016

Yes. No TTL. The scheduled jobs were stuck for over a month before we noticed that something is not quite right. And while debugging the live system I found that the TTL of the master lock is -1 and it's value is of a host:pid which does not exist any more.

This Redis version thing sucks :(

It should be easy to reproduce the bug by putting a sleep in the Lua script and killing the server while the script is running. I am not sure that is possible for the Lua supported by Redis, though... (probably putting a long loop in there could simulate a delay too, so that you have time to kill the Redis process).

Let me know if you need a PR for this. I can try to prepare one.

@carsonreinke
Copy link
Contributor

That sucks, sorry to hear that happened.

Looks like EXPIRE can fail "if ... the timeout could not be set", the script should of accounted for that. Your approach is better though.

A PR would be great.

@iloveitaly
Copy link
Contributor

@carsonreinke Did you end up patching this in your installation? Could you submit a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants