Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make gush handle wokflows with 1000s of jobs #55

Open
Saicheg opened this issue Jul 10, 2018 · 13 comments
Open

Make gush handle wokflows with 1000s of jobs #55

Saicheg opened this issue Jul 10, 2018 · 13 comments

Comments

@Saicheg
Copy link
Contributor

Saicheg commented Jul 10, 2018

Hello!

First of all, thank you for this great library. We've using this for couple of our project and it's been really great.

Issue i am facing right now is that my workflow with 1000s of jobs is dramatically slow because of gush. Let's put some example here:

class LinkFlow < Gush::Workflow
  def configure(batch_id, max_index)
    map_jobs = []

    100_000.times do
      map_jobs << run Map
    end

    run Reduce, after: map_jobs
  end
end

Playing around gush i found out that after each job completed it has to visit all dependent jobs ( which is Reduce in our case ) and try to enqueue them. But in order to understand if this job can be enqueued gush needs to understand that all dependent ( Map in our case ) jobs are finished.

https://github.com/chaps-io/gush/blob/master/lib/gush/job.rb#L90

Problem with this code right now is that for each Map job after every Map jobs is finished it will call Gush::Client#find_job.

This will produce massive number of SCAN operations and dramatically decrease a performance because of this line of code.

https://github.com/chaps-io/gush/blob/master/lib/gush/client.rb#L119

I am not sure what is best solution in this case. I've tried to solve this problem by changing way of how gush stores serialized jobs. Idea is instead of storing jobs individually to store them on hash for every workflow/job type. Already have my own implemenation, but have to play around with benchmarks around this:

rubyroidlabs@4ee1b15

@pokonski what do you think here?

@pokonski
Copy link
Contributor

You are right that it's slow. This is a known issue because (tl;dr) Gush was never intended for such big workflows.

A possible solution would be to allow multiple backends instead of just Redis. Ideally a graph database, since it's a graph. Then knowing if node can be executed would be a single query.

@Saicheg
Copy link
Contributor Author

Saicheg commented Jul 10, 2018

@pokonski yeah, i am agree that some-kind-of graph solution would be ideal

At this moment i am seeing 10x performance boost on my current solution by changing was of data is stored. Do you have a chance to review my changes? I can submit a PR if it does make sense for you.

@pokonski
Copy link
Contributor

pokonski commented Jul 10, 2018

10x boost is great! Can you create a PR with those changes?

The downside is, it will break compatibility so will require a major version release.

@xtagon
Copy link

xtagon commented Aug 24, 2018

@Saicheg I am very interested in your 10x boost PR as well 👍

@xtagon
Copy link

xtagon commented Aug 24, 2018

Scratch that, I didn't see that the PR was already merged! So instead let me say thank you for the PR :)

@schorsch
Copy link

schorsch commented Nov 6, 2018

i tested this merged version and the job times stayed the same for ~1500 jobs, instead of runtimes increasing with each job. I am creating highcharts images for ~300.000 charts ranging from 1-8 secs per chart (depending on db calls, remote chart server request times). Before with v 1.1.2 the times went up to 90secs per job, making Gush pretty useless.

Btw. I know that gush was not made to handle 1000' of jobs. But where is the reason to use such a lib just for small things, which could be handled in a single dedicated sidekiq job class. Obviously there are reasons, but you really really need a batch/workflow handling for big jobs. I dont see me paying sidekiq for the rest of my apps lifetime(10+ yrs), just bcs of one or two batch jobs. So yes i can live with some drawbacks and code fiddling.
From all alternative batch/workflow libs Gush looks the most promising, so thanks for your hard work!

@pokonski
Copy link
Contributor

pokonski commented Nov 7, 2018

I do have some idea about making it way faster for huge workflows but this would require running a separate process, I'll definitely experiment with it first.

Ideally having a graph database instead of Redis (or something like https://oss.redislabs.com/redisgraph/ ) would make it trivial to query for jobs which have fulfilled dependencies (this is the slowest part), but that is a big requirement for most users who don't need 1000s of jobs.

But where is the reason to use such a lib just for small things, which could be handled in a single dedicated sidekiq job class

I developed it for quite big, but still static workflows where parts could fail often (fetching from various APIs) but not block other jobs that don't depend on that particular one.

@pokonski
Copy link
Contributor

pokonski commented Nov 25, 2018

@Saicheg @xtagon @schorsch I have a WIP branch with an experimental feature using RedisGraph for dependency resolution. Its downside is it requires a custom module, but if you have time, please do check it out:

  1. Install RedisGraph module
  2. use Gush on redis-graph branch

I am super curious if it will help with your cases of huge workflows, I'll prepare benchmarks on my side, too.

@pokonski
Copy link
Contributor

pokonski commented Nov 25, 2018

Did a benchmark based on @Saicheg example workflow and results are in:

redis-graph:

➜  gush git:(redis-graph) ✗ be ruby \ benchmarks/workflows.rb
Warming up --------------------------------------
         BigWorkflow     1.000  i/100ms
Calculating -------------------------------------
         BigWorkflow      0.208  (± 0.0%) i/s -      7.000  in  33.680191s

master

➜  gush git:(master) ✗ be ruby \ benchmarks/workflows.rb
Warming up --------------------------------------
         BigWorkflow     1.000  i/100ms
Calculating -------------------------------------
         BigWorkflow      0.031  (± 0.0%) i/s -      1.000  in  32.736042s

So, it seems using RedisGraph makes the whole thing ~7 times faster 💥

This is just a first step because there is more places that could benefit from it, like reading payloads from incoming jobs (which also uses find_job in a loop)

@schorsch
Copy link

schorsch commented Dec 7, 2018

GREAT!
I ran test with a couple of thousands items and chained two other jobs after the main payload (create 5k imgages => run png_optimizer => clean upload cache folder).
No problems with the jobs runtimes, which stay the same for the whole run.

Two things to mention:

When reloading and getting the status of a running job i get an error: flow.reload & flow.status
When creating a new flow its status is running

At the end of the run i seem to have the optimize / cache clean executed multiple time BUT i dont know since i did not log / puts anything. I do only see the the job times in the sidekiq console with its job id. See those 30sec jobs

2018-12-07T09:37:19.956Z 17810 TID-ih53e Gush::Worker JID-989cb5b487e521ffdfce8b04 INFO: start
2018-12-07T09:37:19.957Z 17810 TID-j664e Gush::Worker JID-b3b946228ab0011614b3c5b8 INFO: done: 2.706 sec
2018-12-07T09:37:20.225Z 17810 TID-ih6de Gush::Worker JID-a39e917406fca1be2d317b04 INFO: start
2018-12-07T09:37:20.226Z 17810 TID-ih4gm Gush::Worker JID-a1876a4b106d772e68a71f9c INFO: done: 1.821 sec
2018-12-07T09:37:21.099Z 17810 TID-owcfa Gush::Worker JID-34ef67b11d4e6efbb57ea3b1 INFO: start
2018-12-07T09:37:21.194Z 17810 TID-ih6xm Gush::Worker JID-2274fdd8053896b914547655 INFO: done: 2.479 sec
2018-12-07T09:37:21.284Z 17810 TID-nydsi Gush::Worker JID-dc2002bb6d981a62f471751a INFO: done: 2.505 sec
2018-12-07T09:37:21.285Z 17810 TID-owa2e Gush::Worker JID-1956ca527824e235148cf1d5 INFO: start
2018-12-07T09:37:21.394Z 17810 TID-ih4ri Gush::Worker JID-301cc0983eaf2d0c7c9931fe INFO: start
2018-12-07T09:37:21.394Z 17810 TID-ow9ke Gush::Worker JID-81a664ebcfdf0c3a4ef28b34 INFO: done: 2.719 sec
2018-12-07T09:37:37.588Z 17810 TID-nydsi Gush::Worker JID-5248e04f4d5734ca541f1007 INFO: start
2018-12-07T09:37:37.589Z 17810 TID-ih6de Gush::Worker JID-a39e917406fca1be2d317b04 INFO: done: 17.364 sec
2018-12-07T09:37:37.668Z 17810 TID-ih53e Gush::Worker JID-989cb5b487e521ffdfce8b04 INFO: done: 17.711 sec
2018-12-07T09:37:37.669Z 17810 TID-kxtvy Gush::Worker JID-78735afa0bda9cda084ad145 INFO: start
2018-12-07T09:37:51.924Z 17810 TID-ih6xm Gush::Worker JID-4041a705315cbad40860730b INFO: start
2018-12-07T09:37:51.924Z 17810 TID-owbru Gush::Worker JID-0692a8fef15a48c5769c3519 INFO: done: 33.285 sec
2018-12-07T09:37:52.012Z 17810 TID-ih4gm Gush::Worker JID-eaa05e252692e40581b1ff6f INFO: start
2018-12-07T09:37:52.014Z 17810 TID-owa2e Gush::Worker JID-1956ca527824e235148cf1d5 INFO: done: 30.729 sec
2018-12-07T09:37:52.023Z 17810 TID-ih6de Gush::Worker JID-7230f63e5efb4bb6ad2b4047 INFO: start
2018-12-07T09:37:52.023Z 17810 TID-owcfa Gush::Worker JID-34ef67b11d4e6efbb57ea3b1 INFO: done: 30.924 sec
2018-12-07T09:37:52.315Z 17810 TID-ih6r6 Gush::Worker JID-61f41e756a4fe3ce60b3942e INFO: done: 33.509 sec
2018-12-07T09:37:52.316Z 17810 TID-ih4y6 Gush::Worker JID-584fd4625651c66bbb51af82 INFO: start
2018-12-07T09:37:52.342Z 17810 TID-ih4oq Gush::Worker JID-25d16ec4c95be53cebb617b0 INFO: start
2018-12-07T09:37:52.343Z 17810 TID-ih4ri Gush::Worker JID-301cc0983eaf2d0c7c9931fe INFO: done: 30.949 sec
2018-12-07T09:37:52.550Z 17810 TID-kxtvy Gush::Worker JID-78735afa0bda9cda084ad145 INFO: done: 14.881 sec
2018-12-07T09:37:52.552Z 17810 TID-nydsi Gush::Worker JID-5248e04f4d5734ca541f1007 INFO: done: 14.963 sec

Is there any (easy) method of finding out wich job the Gush::Worker JID-xy executed? i looked at the source but did not find anything directly.

Update: i added the good old puts into the jobs and in fact the jobs scheduled after the img creation are executed multiple times

@pokonski
Copy link
Contributor

Thanks for testing this @schorsch! I'll have a look at why they are executed multiple times, can you provide the workflow you have used? Just the workflow definition will do :)

@schorsch
Copy link

class CreateImagesWorkflow < Gush::Workflow

  def configure(klass_name)
    klass = "RI::Chart::Cmd::#{klass_name}".constantize
    # finds all codes and schedule single jobs ~3000-30.000 per kind
    cnt = 0
    img_jobs = klass.codes.map do |code|
      run RI::Chart::Job::CreateImage, params: {code: code, klass_name: klass_name}
    end
    # schedule optimizer, which figures out the files from given class
    run RI::Chart::Job::OptimizeImages, after: img_jobs, params: {klass_name: klass_name}
    run RI::Chart::Job::CleanUploadCache, after: RI::Chart::Job::OptimizeImages
  end
end

The called jobs all descend from Gush::Job, expect their params as string and are pretty small in terms of loc.

Btw. I am visualizing german health-care data, here is an example of a public view, where you can see some of the generated chart images https://app.reimbursement.info/icds/F10

@pokonski
Copy link
Contributor

Please see the discussion I started about potential way to improve the bottlenecks mentioned here: #95

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants