Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates Resque::Job.destroy with a safer version #1788

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nevans
Copy link
Contributor

@nevans nevans commented Jan 27, 2022

I've used a version of this since ~ten years ago and I just pushed it to github so I can replace our monkeypatch with a fork. And I've had the following comment above it for half that time:
# TODO: send a patch back to resque
So here is that patch. 馃檪

It is, in my opinion, superior to the original in every way. Copying from the documentation I wrote:

# This looks at only one job at a time, so it isn't fast.  Also, it will
# temporarily drain the queue into a temporary queue, in order to check
# every job safely.  However, it never runs any slow redis commands, and its
# memory usage is bounded to a single job, and it uses an atomic rpoplpush
# to ensure that jobs which should not be removed are never removed from
# redis entirely.  If the method is interrupted or crashes, one or both of
# the following queues will need to be moved back into the actual queue:
#
#  * Holds each individual job while it is being checked:
#    "#{queue}:tmp:destroy:#{Time.now.to_i}"
#
#  * Holds all of the jobs which do not match and will be requeued:
#    "#{queue}:tmp:requeue"
#
# This is safe to run multiple times concurrently, so long as it isn't
# started more than once during a single second.

Although this isn't fast, it isn't necessarily slower than the original version. The original will bring a redis server to its knees if the queue is too large; and the only time I ever need to run this command is when a queue is too large!. Each LREM is O(n) so the overall operation is O(n*m) (m is number of deleted jobs) although (if I remember how this math works) it has an effective upper bound of O(n log n) if the queue gets smaller as you delete from it. 馃槈

However I have a few caveats (which should probably be considered before merging):

  • I made some minor tweaks just now, to tidy it up for this PR. I made sure it passed all of my local tests, but I haven't run this exact version in production yet.
  • I actually haven't called this method in production in years. I've been using a slightly more elaborate version several times a year, usually in semi-urgent maintenance situations (... or in emergencies). I've used it on 8GB+ queues with no problems. That version provides progress feedback but it has a slightly different API (that can be changed). And it doesn't use "rpoplpush" so it's less safe than the version I've posted here. I've posted a gist of that one here: https://gist.github.com/nevans/30025e2538762b9318d47960ef0e77c5. If there's interest, I can update this PR with the better parts of that implementation.
  • I have a far more elaborate version which also operates on the failure queue, has a DSL for automated rulesets, etc. I run that version on my failure queue from a cronjob, several times a day. But I need to get permission before I contribute that or convert it into an open source gem. :)
  • That "started multiple times per second" caveat mentioned in the rdoc for the method can be easily fixed by added microseconds, pid, thread id, and a random string. Unless people are calling this automatically, which they shouldn't be, none of that is really necessary. Neither the sweeper I posted in the above just nor my cron-job queue sweeper have this problem.

This also retains the original implementation as Resque::Job.destroy!, in case any one actually depended on the unsafe behavior... (oh god, I hope not!) I'd be more than happy to just delete that version: It can use up all of your ruby process's available memory and create long pauses on the redis server and is thus completely unsuitable for a production environment, IMO.

@nevans nevans marked this pull request as ready for review January 28, 2022 19:49
@nevans
Copy link
Contributor Author

nevans commented Jan 28, 2022

Oh, I'd completely forgotten about Resque.dequeue. Pretty useful for running those hooks, if you've got them. That should also work fine with this strategy.

Other caveats of this strategy are:

  1. other jobs in the queue are delayed until they can be moved back from the tmp queue.

    My other two strategies optimistically renamenx to reduce the time needed for that delay. The gist version would be better for this if it went the other direction by default: renamenx into tmp and stream back to queue, rather stream into tmp and renamenx back to queue. When renamenx from queue to tmp succeeds, are jobs immediately able to continue running, albeit bottlenecked by the streaming process.

  2. This can generate a lot more redis AOF.

    In my experience, disk space is cheap, memory is expensive. Redis nightmares involve running out of memory, not disk space. 馃槣 A big AOF also means slow restarts. Even so, needing to restart before a bgrewriteaof can complete should be rare, so this is generally not an issue.

IMO, these caveats are still minor, but they might need to be documented. The current version can create a much much worse scenario: 1) noticing you have queue troubles, 2) attempt to deal with the queue troubles, 3) now you have OOM troubles on your ruby console and multiple second latency on your redis server. 4) Now you have two problems (or three, or more because killing your redis response time probably cascades into other areas of your system).

@nevans
Copy link
Contributor Author

nevans commented Jan 28, 2022

If you want, I can update the PR to

  1. incorporate the renamenx strategy,
  2. provide feedback on stderr (SUPER USEFUL). It could be optional.
  3. acquire a setnx lock for the tmp inspection key, retrying with an incremented key until one is found. This enables multiple "sweepers" to safely run at the same time.
    • updated to add: There are several ways to deal with this. Just to name a couple:
      • use a "unique enough" key name and lock it Just To Be Sure, as mentioned above
      • instead of unlinking or deleting a key per worker after each job is processed, a single processing key can be shared and LREM can be used on that. Because the number of workers isn't expected to be in the thousands, this should be fine.
    • In case you're wondering: this seems like overkill to ensure such strong reliability when deleting jobs when resque doesn't even have strong guarantees for normal job processing, I agree! But that PR won't be ready any time soon, at least, not from me. :)
  4. kwarg option to skip the tmp inspection key rpoplpush step. This trades speed for the (miniscule) risk that a crashed or killed process (either ruby or redis) might lose a job.

HendrikPetertje added a commit to HendrikPetertje/workers_test_app that referenced this pull request Feb 15, 2022
Time to try out a completely different framework then.
I noticed that a lot of the things we require in our project (like retrying,
scheduling and a web interface) were present as stand-alone plugins, but a lot
of these plugins haven't seen updates for 4-7 years, so I wasn't too hopeful from
the get go. I ended up installing the base system just to see if I could get
things up and running and I don't know, it almost works but:

- Jobs are lost on app down, there is some work going on trying to implement
  RPOPLPUSH with Redis, but it relies on jobs timing out rather than connections
  being dropped for a longer time like RabbitMQ did with its unacked system. The
  entire functionality is still in development too
  resque/resque#1788.
- There's a lot of own development required to actually make the retry logic
  respect waiting in some kind of Backoff system.

That said, the project does come with

- Scheduling
- CRON
- web interfaces and failure inspection
- Airbrake & Newrelic support
- some rudimentary form of retry

But given it doesn't actually fix the base issue we have with Sidekiq right now
this entire development path is going to be a no-go.
@nevans nevans force-pushed the safer_delete branch 3 times, most recently from 1101aed to d4bf94e Compare June 21, 2022 17:14
Retains the original implementation as `Resque::Job.destroy!`, in case
any one actually depended on the unsafe behavior... (I hope not!)
@matthewhively
Copy link
Contributor

My personal vote would be for renamenx strategy since it seems wasteful to spend all the time moving items 1 at a time from q => tmp better to rename q into tmp and then move one at a time from tmp => q as the filter runs.

One other thing I can think of is that one of my past employers had utilized redis embedded lua scripts to do large batch processing of this sort. I'm not sure what sort of limitations there are on using lua scripts, but it should eliminate the round trip latency time between ruby server and redis.

That said, my opinion on how this is developed or used doesn't really matter since my team has never run into these sorts of massive queue problems. So feel free to ignore me.

@nevans
Copy link
Contributor Author

nevans commented Jul 29, 2022

My personal vote would be for renamenx strategy since it seems wasteful to spend all the time moving items 1 at a time from q => tmp better to rename q into tmp and then move one at a time from tmp => q as the filter runs.

The renamenx trick works in either direction. I.e. q => renamenx tmp; tmp => process => q and q => process => tmp; tmp => renamenx q are very similar behavior. In this code, where each worker gets its own tmpq, you need to sweep out and rename back or else you lose the concurrency benefits.

The versions that I've used the most all share a common tmpq key because the crash/failure mode is simpler. In that case it makes more sense to rename then sweep, to simplify the concurrent worker coordination. What we want to avoid is workers feeding jobs to each other in an endless loop. 馃檪

One other thing I can think of is that one of my past employers had utilized redis embedded lua scripts to do large batch processing of this sort. I'm not sure what sort of limitations there are on using lua scripts, but it should eliminate the round trip latency time between ruby server and redis.

That might be very good idea. I'd definitely prefer my more complex queue sweepers in ruby, but the Job.destroy API is simple enough that it probably wouldn't be too hard. I use redis lua scripts in a few other places as well.

One thing to be careful with: although it might be a lot faster, scripts are atomic. It would need to be batched, or it will destroy your server latency, maybe even worse than the current version! Because the redis server only runs one at a time (like a GVL for redis!), it also won't provide any benefits for concurrent processing (ruby's GVL is easily worked around). I'm remembering one particular scenario, when I ran at least 16 of these at once! (well, not this exact code... one of the other versions I mentioned.) I'd want to have some benchmarks that showed 1) the lua version didn't adversely impact latency, and 2) the lua version significantly outperforms a concurrent version. I suspect the optimal batch size would be very contextually sensitive.


Another, simpler, approach that I'd forgotten. You can count the items in the queue and simply loop that many times, popping, inspecting, then tossing or requeueing. Very simple. But if anything else is also processing the queue, instead of finishing faster you'll process some jobs more than once and then leave some older jobs at the back of the queue behind newer jobs. But it's simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants