-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates Resque::Job.destroy with a safer version #1788
base: master
Are you sure you want to change the base?
Conversation
Oh, I'd completely forgotten about Other caveats of this strategy are:
IMO, these caveats are still minor, but they might need to be documented. The current version can create a much much worse scenario: 1) noticing you have queue troubles, 2) attempt to deal with the queue troubles, 3) now you have OOM troubles on your ruby console and multiple second latency on your redis server. 4) Now you have two problems (or three, or more because killing your redis response time probably cascades into other areas of your system). |
If you want, I can update the PR to
|
Time to try out a completely different framework then. I noticed that a lot of the things we require in our project (like retrying, scheduling and a web interface) were present as stand-alone plugins, but a lot of these plugins haven't seen updates for 4-7 years, so I wasn't too hopeful from the get go. I ended up installing the base system just to see if I could get things up and running and I don't know, it almost works but: - Jobs are lost on app down, there is some work going on trying to implement RPOPLPUSH with Redis, but it relies on jobs timing out rather than connections being dropped for a longer time like RabbitMQ did with its unacked system. The entire functionality is still in development too resque/resque#1788. - There's a lot of own development required to actually make the retry logic respect waiting in some kind of Backoff system. That said, the project does come with - Scheduling - CRON - web interfaces and failure inspection - Airbrake & Newrelic support - some rudimentary form of retry But given it doesn't actually fix the base issue we have with Sidekiq right now this entire development path is going to be a no-go.
1101aed
to
d4bf94e
Compare
Retains the original implementation as `Resque::Job.destroy!`, in case any one actually depended on the unsafe behavior... (I hope not!)
My personal vote would be for renamenx strategy since it seems wasteful to spend all the time moving items 1 at a time from One other thing I can think of is that one of my past employers had utilized redis embedded lua scripts to do large batch processing of this sort. I'm not sure what sort of limitations there are on using lua scripts, but it should eliminate the round trip latency time between ruby server and redis. That said, my opinion on how this is developed or used doesn't really matter since my team has never run into these sorts of massive queue problems. So feel free to ignore me. |
The renamenx trick works in either direction. I.e. The versions that I've used the most all share a common
That might be very good idea. I'd definitely prefer my more complex queue sweepers in ruby, but the Job.destroy API is simple enough that it probably wouldn't be too hard. I use redis lua scripts in a few other places as well. One thing to be careful with: although it might be a lot faster, scripts are atomic. It would need to be batched, or it will destroy your server latency, maybe even worse than the current version! Because the redis server only runs one at a time (like a GVL for redis!), it also won't provide any benefits for concurrent processing (ruby's GVL is easily worked around). I'm remembering one particular scenario, when I ran at least 16 of these at once! (well, not this exact code... one of the other versions I mentioned.) I'd want to have some benchmarks that showed 1) the lua version didn't adversely impact latency, and 2) the lua version significantly outperforms a concurrent version. I suspect the optimal batch size would be very contextually sensitive. Another, simpler, approach that I'd forgotten. You can count the items in the queue and simply loop that many times, popping, inspecting, then tossing or requeueing. Very simple. But if anything else is also processing the queue, instead of finishing faster you'll process some jobs more than once and then leave some older jobs at the back of the queue behind newer jobs. But it's simple. |
I've used a version of this since ~ten years ago and I just pushed it to github so I can replace our monkeypatch with a fork. And I've had the following comment above it for half that time:
# TODO: send a patch back to resque
So here is that patch. 馃檪
It is, in my opinion, superior to the original in every way. Copying from the documentation I wrote:
Although this isn't fast, it isn't necessarily slower than the original version. The original will bring a redis server to its knees if the queue is too large; and the only time I ever need to run this command is when a queue is too large!. Each LREM is
O(n)
so the overall operation isO(n*m)
(m is number of deleted jobs) although (if I remember how this math works) it has an effective upper bound ofO(n log n)
if the queue gets smaller as you delete from it. 馃槈However I have a few caveats (which should probably be considered before merging):
This also retains the original implementation as
Resque::Job.destroy!
, in case any one actually depended on the unsafe behavior... (oh god, I hope not!) I'd be more than happy to just delete that version: It can use up all of your ruby process's available memory and create long pauses on the redis server and is thus completely unsuitable for a production environment, IMO.