Add iteration support for long-running jobs #6286

fatkodima · 2024-05-12T18:22:49Z

This is a native implementation of the https://github.com/fatkodima/sidekiq-iteration pattern.
Closes fatkodima/sidekiq-iteration#6.

At a high level, the API will look something like:

class NotifyUsersJob
  include Sidekiq::Job
  include Sidekiq::Job::Iterable

  def build_enumerator(cursor:)
    active_record_records_enumerator(User.all, cursor: cursor)
  end

  def each_iteration(user)
    user.notify_about_something
  end
end

fatkodima · 2024-05-12T18:30:07Z

lib/sidekiq/job/iterable/active_record_enumerator.rb

+  module Job
+    module Iterable
+      # @private
+      class ActiveRecordEnumerator


I have a custom implementation of the ActiveRecord enumerator in the gem - https://github.com/fatkodima/sidekiq-iteration/blob/master/lib/sidekiq_iteration/active_record_enumerator.rb

As you can see, gem's implementation is pretty complex, compared to the implementation in this file, and basically mimics the implementation from rails itself. The only difference between the gem's implementation and rails' implementation is to support multiple columns. But that use case is, I believe, is pretty rare, and most of the time people iterate just by the primary key.

Rails implementation is pretty good - it quite recently got a perf improvement (rails/rails#45414) to be fast and with even further improvement to that coming in (rails/rails#51243). I also plan to add support for multiple columns to it. So, we can just use the default from rails.

Wdyt about this?

Ideally you'll have a skeleton interface and people can develop their own Enumerators which work with a specific datastore. ActiveRecord is the primary datastore for 90% of Sidekiq apps though; it would make sense to support it well. I'm ok with a single column if it makes the code far simpler.

fatkodima · 2024-05-12T18:34:29Z

lib/sidekiq/job/iterable.rb

+          # TODO: determine a better way to detect if the server/capsule is stopping.
+          Sidekiq.server? && Sidekiq::CLI.instance.launcher.stopping?


When the long-running job runs, after each iteration it checks if it is time to stop the execution and reschedule itself. One of the reasons to stop and reschedule is when sidekiq is stopping. In the gem, I use :quiet hook for this - https://github.com/fatkodima/sidekiq-iteration/blob/ffdc784b7400092a25d6528ff883c1fb22384c42/lib/sidekiq_iteration.rb#L55-L58

Is there a better way and which works for standalone sidekiq and capsules too?

Nope, this is the best way to do it. I would use the global you set, rather than calling launcher.stopping?.

fatkodima · 2024-05-12T19:03:52Z

lib/sidekiq/job/iterable.rb

+        def extract_previous_runs_metadata(arguments)
+          options =
+            if arguments.last.is_a?(Hash) && arguments.last.key?("sidekiq_iteration")
+              arguments.pop["sidekiq_iteration"]
+            else
+              {}
+            end
+
+          @executions = options["executions"] || 0
+          @cursor_position = options["cursor_position"]
+          @times_interrupted = options["times_interrupted"] || 0
+          @total_time = options["total_time"] || 0
+        end


The iterable job keeps its state inside the job hash metadata itself (as the last item of the args, if to be concrete). So, if the interrupted job is run again, for example, we restore its attributes from the previously stored metadata. Same when the job raises and error - we save the current state into the job and push it to redis.

Another approach is to keep the iteration state as a separate structure in redis (as a hash, e.g.). But then we need to decide on expiration time of that hash, decide if we should update it after each iteration (+1 redis call per iteration) or once in a while, write a server middleware to set job's attributes based on that hash.

Which one would you suggest/prefer?

That's a really tough question. It's pretty fundamental to the design and I'm not sure I like the idea of mutating the job payload itself, as it's stored in Redis as a String which can't easily be modified without a JSON parse/dump round trip. I'm leaning toward defining a job iteration Hash, with a key like "it-{jid}". Redis has a number of useful H* commands to update/increment Hash values very quickly.

I think the default should be to update the iteration record in Redis only if the iteration is interrupted or 5 seconds have passed. It should not be updated after every record; if we are processing a million records we don't want a million Redis updates.

fatkodima · 2024-05-12T19:07:08Z

lib/sidekiq/job/iterable.rb

+
+          retry_backoff = self.class.get_sidekiq_options.dig("iteration", "retry_backoff") ||
+                          Sidekiq.default_configuration["retry_backoff"]
+          self.class.perform_in(retry_backoff, *arguments)


Is there a better way to reschedule itself than keeping arguments in the job and then explicitly calling perform_in?

For example, there is a helper in the ActiveJob for the similar purpose - https://api.rubyonrails.org/classes/ActiveJob/Exceptions.html#method-i-retry_job. Maybe we need something similar?

Good question, it might be time to consider a API for this purpose. Generally I don't like the idea of a job performing its own "meta-logic" around scheduling because it makes testing difficult. It's a massive violation of the Single Responsibility Principle. But this API wouldn't be used by the job itself, but rather your iteration logic.

fatkodima · 2024-05-13T15:14:32Z

@mperham Can you please answer questions, when you have time? 🙏

mperham · 2024-05-13T19:11:13Z

This is a totally new feature to me and not something I've used before so please give me some time to study the code, understand what it is doing and understand how Sidekiq might need to change to make it work well. I apologize for the delay. Test cases would also be helpful to understand how the APIs will be used.

fatkodima · 2024-05-19T13:10:53Z

lib/sidekiq/job/iterable.rb

+          Sidekiq.default_configuration.dig("iteration", "retry_backoff")
+
+        # Preserve original jid.
+        self.class.set(jid: jid).perform_in(retry_backoff, *@arguments)


This will rerun client middleware again. Need to decide, if this is a problem or not.

What do you think about treating it like a retry? Raise a Sidekiq::Job::Interrupted exception or similar and let the retry subsystem reschedule it automatically?

Yeah, I don't like the current approach too.
This is exactly how I implemented it initially, but in the added tests for this PR I now need to rescue this exception manually and do all the rescheduling logic for tests to pass. I will look at if I can somehow better solve this in tests.

fatkodima · 2024-05-19T13:11:47Z

@mperham This is ready now for review.

fatkodima · 2024-05-20T13:57:33Z

lib/sidekiq/job/iterable.rb

+        state = Sidekiq.redis { |conn| conn.get("it-#{jid}") }
+
+        if state
+          state_hash = Sidekiq.load_json(state)


Should we handle errors here?

No, exceptional cases should generate exceptions. I do think you should use native Redis hashes instead of JSON, so you'd use hgetall rather than get.

We currently store the iteration cursor as part of this metadata. When using json, we get the correct data type when loading from json. But with redis hashes we will always get strings as values and the user will need to remember to manually cast the string value to the correct datatype. When people will use some custom enumerators or we start supporting custom columns for active record enumerator in the future, we will get other types of cursors, other than integers, and this will be a problem.

Wdyt?

You can make the cursor JSON if it is user-supplied but Sidekiq's internal iteration data can be stored element-by-element in the hash and coerced in Ruby. JSON is meant to be used for interoperability but for our internal data, I'd rather the data be directly addressable in Redis, rather than requiring deserialization.

mperham · 2024-05-20T19:00:17Z

lib/sidekiq/job/iterable.rb

+        }
+
+        Sidekiq.redis do |conn|
+          conn.set("it-#{jid}", Sidekiq.dump_json(state), ex: STATE_TTL)


See hmset in Redis. https://redis.io/docs/latest/commands/hmset/

fatkodima · 2024-05-20T22:14:18Z

Updated with #6286 (comment) and converted to use redis hashes.

mperham · 2024-05-21T22:25:16Z

lib/sidekiq/job/iterable.rb

+module Sidekiq
+  module Job
+    module Iterable
+      class Interrupted < ::RuntimeError; end


I think Sidekiq::Job::Interrupted would read better, wdyt?

mperham · 2024-05-21T22:33:00Z

lib/sidekiq/job/iterable.rb

+      end
+
+      # @api private
+      def initialize


This module should include a stopping? method which checks @done on Sidekiq::Processor. Here we use a :lifecycle accessor to store the Processor on the Job.

# In Sidekiq::Processor... job = SomeJob.new job.lifecycle = self if job.is_a?(Sidekiq::Job::Iteratable) # Processor job.perform(*args) job.stopping? # => job.lifecycle.stopping?

It's a little more complex but I'm trying to avoid any more global APIs like Sidekiq.stopping? to be more Ractor-friendly.

mperham · 2024-05-21T22:38:42Z

lib/sidekiq/job/iterable/csv_enumerator.rb

+          filepath = @csv.path
+          return unless filepath
+
+          count = `wc -l < #{filepath}`.strip.to_i


This could be a security issue under the right circumstances. I think you can use Shellwords.shellescape for this filepath and I think you can use a simpler wc -l filepath without the redirection.

mperham · 2024-05-21T22:41:27Z

lib/sidekiq/job_retry.rb

@@ -134,6 +137,15 @@ def local(jobinst, jobstr, queue)

    private

+    def process_requeue(jobinst, jobstr)
+      retry_backoff = jobinst.class.get_sidekiq_options.dig("iteration", "retry_backoff") ||


Remember that sidekiq_options get persisted into the job payload when the job is created; you don't need to dig the value out of get_sidekiq_options, you'd use job.dig("iteration", "retry_backoff").

Oh, I see; you didn't add these to the Job default options. That makes sense if overriding these elements is rare or you want to keep the job payload size as small as possible. If the elements are overridden, the job payload will contain the values so you can do:

job.dig(...) || lifecycle.config.dig(...) # get from specific job or capsule's config

fatkodima · 2024-05-22T11:34:23Z

Addressed feedback.

mperham · 2024-05-22T15:41:32Z

lib/sidekiq/processor.rb

@@ -79,7 +83,10 @@ def run

    def process_one(&block)
      @job = fetch
-      process(@job) if @job
+      if @job
+        @job.lifecycle = self if @job.is_a?(Job::Iterable)


Here, job will always be a UnitOfWork so this will never execute. You want to set this right before middleware is called, right after JobClass.new.

Right 🤦 The unit tests are too unit (with mocking), weren't able to detect this.

mperham · 2024-05-22T16:43:37Z

Great work. I'll merge this and spend a little more time polishing it soon.

mperham · 2024-05-22T16:45:20Z

We'll need a new Iteration wiki page which explains this major new feature. If you have any docs to use as a starting point, please free feel to create it.

fatkodima · 2024-05-22T16:48:37Z

I have some existing docs at https://github.com/fatkodima/sidekiq-iteration/tree/master/guides, but I am bad at writing them.
I would appreciate if you can work on that.

mperham · 2024-05-22T16:53:05Z

No problem, I will take care of it.

mperham · 2024-05-22T17:40:55Z

Initial page here: https://github.com/sidekiq/sidekiq/wiki/Iteration

mperham · 2024-05-22T19:15:13Z

I'm refactoring the API a bit and wondering if you considered using to_enum or enum_for instead of build_enumerator. I don't understand Enumerators very well so I'm trying to understand what would be the most idiomatic API.

https://www.rubydoc.info/stdlib/core/Enumerator

fatkodima · 2024-05-22T19:21:58Z

build_enumerator is a method that user should define inside their job, which specifies what we should iterate over (and which should return an instance of an Enumerator).
The user can use one of the convenient helpers we provide or specify their own (using to_enum/enum_for, for example).
Some examples in https://github.com/fatkodima/sidekiq-iteration/blob/master/guides/custom-enumerator.md.

Let me know if I did not fully get the question.

mperham · 2024-05-22T19:39:02Z

Why not Job#to_enum, instead of Job#build_enumerator? Do the methods have different meanings?

fatkodima · 2024-05-22T19:45:39Z

If you mean why we use build_enumerator method name and not a to_enum, I don't know, with my english level, Job#to_enum sounds to me like we are converting this job to enum, which makes little sense. But Job#build_enumerator sounds like a method that builds an enumerator for this job. to_enum is a method already defined on Object, so we probably should not define our own.

You may provide some code example about how you imagine it to work, if that helps.

freemanoid · 2024-05-23T11:42:35Z

I'm surprised this kind of functionality was integrated into sidekiq. I thought you should always break down long jobs into smaller ones if they require sidekiq-iteration. If it was built for ActiveRecord, that’s even more surprising to me.

mperham · 2024-05-23T15:38:24Z

@freemanoid This provides an API for breaking down a long-running job into discrete chunks without having to fill Redis with thousands of small jobs. Some jobs have to remain serial and cannot be broken down into set of concurrent jobs.

The ActiveRecord portion is totally optional and does not force Rails upon the user.

fatkodima commented May 12, 2024

View reviewed changes

fatkodima force-pushed the iteration branch from 40f9bbe to 86269a0 Compare May 19, 2024 13:09

fatkodima marked this pull request as ready for review May 19, 2024 13:09

fatkodima commented May 19, 2024

View reviewed changes

fatkodima commented May 20, 2024

View reviewed changes

mperham reviewed May 20, 2024

View reviewed changes

Add iteration support for long-running jobs

85d1ecb

fatkodima force-pushed the iteration branch from 86269a0 to 85d1ecb Compare May 20, 2024 22:13

mperham reviewed May 21, 2024

View reviewed changes

Address feedback

ca23eea

mperham reviewed May 22, 2024

View reviewed changes

Assign lifecycle in the correct place

a55fbc2

mperham merged commit 21953dd into sidekiq:main May 22, 2024
16 checks passed

fatkodima deleted the iteration branch May 22, 2024 16:46

		# TODO: determine a better way to detect if the server/capsule is stopping.
		Sidekiq.server? && Sidekiq::CLI.instance.launcher.stopping?

Add iteration support for long-running jobs #6286

Add iteration support for long-running jobs #6286

Conversation

fatkodima commented May 12, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fatkodima commented May 13, 2024

mperham commented May 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fatkodima commented May 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fatkodima commented May 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fatkodima commented May 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mperham commented May 22, 2024

mperham commented May 22, 2024

fatkodima commented May 22, 2024

mperham commented May 22, 2024

mperham commented May 22, 2024

mperham commented May 22, 2024

fatkodima commented May 22, 2024

mperham commented May 22, 2024

fatkodima commented May 22, 2024

freemanoid commented May 23, 2024

mperham commented May 23, 2024

fatkodima commented May 12, 2024 •

edited