Add in-order scheduling policy #10902

MichaReiser · 2020-12-02T15:56:13Z

Summary

The jest worker implementation today uses round robin to load balance the work across workers if sticky workers aren't used.

This is problematic if the result from the worker is in the critical path to create more work that can be scheduled on the threadpool. Metro, for example, uses the workers to transform the JavaScript files and extract its dependencies and then schedules the transformation of the dependencies. It's, therefore, critical that the first few files are transformed as quick as possible to build up a queue with enough work to use all workers.

The problem with the use of round robin is that jest assigns each task to a new worker of which each pays the cost of requiring modules (if inline-requires are used) and JITing the hot paths which, in the case of Metro, is a serious overhead
of a couple of seconds.

This diff adds the in-order scheduling policy and makes it configurable for clients (keeping round robin as the default). The first come policy assigns the task to the first free worker (starting with worker 0) or puts it in the queue of each worker to be picked up by the first available worker.

Why call it workerSchedulingPolicy and not just schedulingPolicy. My motivation is to make it clear that it does not affect the order in which tasks are scheduled, e.g. fifo vs lifo or a priority queue. A taskSchedulingPolicy might be something we want to add in the future.

I tested this new scheduling with Metro:

transforming 2 changed files (B is a dependency of A):
Performance improved from 12.7s to 10.8s which is almost 2 seconds.
Main reason: Only one worker had to pay for the initialization cost
Full rebuild: No significant difference either way

Outdated
The round robin implementation was added in #4921 by @mjesun but unfortunately without an in-detail explanation why it's superior. My understanding is that jest-worker is different from e.g. cluster that jest-worker knows how busy each worker is. A worker is either busy or idle. However, it then doesn't make a difference if all the work is performed by one worker or by different workers (assuming all are hot) as long as it uses the highest possible concurrency. This isn't true if sticky workers are used because they can then make use of caches but sticky workers aren't affected by this change

Test plan

Empty

Round Robin

---------------------------------------------------------------------------
total worker-farm: { wFGT: 83718, wFPT: 83229 }
total jest-worker: { jWGT: 29023, jWPT: 27983 }
---------------------------------------------------------------------------
% improvement over 100000 calls (global time): 65.33242552378222
% improvement over 100000 calls (processing time): 66.3783056386596

In Order

total worker-farm: { wFGT: 83830, wFPT: 83278 }
total jest-worker: { jWGT: 29008, jWPT: 27962 }
---------------------------------------------------------------------------
% improvement over 100000 calls (global time): 65.39663604914708
% improvement over 100000 calls (processing time): 66.42330507456951

Load

Round Robin


---------------------------------------------------------------------------
total worker-farm: { wFGT: 97859, wFPT: 97345 }
total jest-worker: { jWGT: 57297, jWPT: 56187 }
---------------------------------------------------------------------------
% improvement over 100000 calls (global time): 41.44943234653941
% improvement over 100000 calls (processing time): 42.280548564384404

In Order

total worker-farm: { wFGT: 97806, wFPT: 97286 }
total jest-worker: { jWGT: 56403, jWPT: 55332 }
---------------------------------------------------------------------------
% improvement over 100000 calls (global time): 42.33175878780443
% improvement over 100000 calls (processing time): 43.12439611043727

mjesun · 2020-12-02T17:21:09Z

O hai! Initial implementor here :)

While I'm not opposed to the change, I think it should be configurable and definitely not the default. There are two things to consider here that may shed some light on why this was made this way:

The usual case for a farm of workers is that you expect to be pushing heavy tasks to them. The case you describe is that your farm is so unloaded that your second task is pushed after the first one has finished, and it's going to a new worker rather than to the previous one. I wouldn't say this is the default use case, but if it is in Metro it may be that you don't need the farm at all.
The cost of initialization is, as you say, improved only when using inline-requires. But the truth is that most projects built for Node from scratch don't do custom transformations, and even if they do they don't configure Babel's CommonJS plugin or FB's inline-require one to modify the default behavior. I agree this brings important improvements, but it isn't unfortunately what seems to be globally there. What that means is that as soon as a worker is brought to life, the whole cost is paid - i.e. an all-or-nothing cost. That said, it's surprising to me that on an 8-core machine (7 processes) you're getting 2 seconds improvement by avoiding to parse code on startup; it's definitely possible but I think as well it's not the default use case either.
Finally, there's also a shared resources thing here. If processes are caching information, and one of them dies, you want the cache to be shared across all of them. If you keep pushing the load to a single process, and this one turns out to die, you'll need to restart building your caches from scratch. Something that in the particular case of Babel and its huge caches may pose a problem. This is in part another reason why traffic load balancers are often configured with RR even if they don't handle load at all.

Hope that explains the reasoning that the initial PR was lacking 🙂

MichaReiser · 2020-12-02T17:30:35Z

Thank you @mjesun for these explanation. I'm open to making this configurable.

The usual case for a farm of workers is that you expect to be pushing heavy tasks to them. The case you describe is that your farm is so unloaded that your second task is pushed after the first one has finished, and it's going to a new worker rather than to the previous one. I wouldn't say this is the default use case, but if it is in Metro it may be that you don't need the farm at all.

Metro does use workers because it doesn't know ahead of time if it has to transpile 1 or e.g. 5000 files and there might be more use cases where the worker are the consumer and producers of the tasks in which case RR can be problematic initially.

The cost of initialization is, as you say, improved only when using inline-requires

It's not limited to inline requires. Every worker has to pay for the JIT overhead. A "hot" worker has the code loaded and JITed. The latter only happens after the worker function was executed at least once.

Finally, there's also a shared resources thing here. If processes are caching information, and one of them dies, you want the cache to be shared across all of them. If you keep pushing the load to a single process, and this one turns out to die, you'll need to restart building your caches from scratch. Something that in the particular case of Babel and its huge caches may pose a problem. This is in part another reason why traffic load balancers are often configured with RR even if they don't handle load at all.

That makes a lot of sense.

I'll go ahead and introduce a new configuration option that allows to choose the scheduling mode.

MichaReiser · 2020-12-02T19:34:53Z

Finally, there's also a shared resources thing here. If processes are caching information, and one of them dies, you want the cache to be shared across all of them. If you keep pushing the load to a single process, and this one turns out to die, you'll need to restart building your caches from scratch. Something that in the particular case of Babel and its huge caches may pose a problem. This is in part another reason why traffic load balancers are often configured with RR even if they don't handle load at all.

OK, it still doesn't make a 100% sense to me.

Pushing everything to the couple first workers is only the case if there are fewer tasks than workers
Having the cache warm in another worker means we had to pay for the cost of building it once. That cost should have been the same as when it needs to be rebuilt after a worker crashes (which is less likely because all processes are running on the same machine).
The cache we're talking about here are caches that aren't file dependent. If the cache needs to be present than it's probably best to use sticky workers and implement a work stealing functionality to distribute the load across the workers (and maybe change the internal map of task key -> worker to a simple hash function).

I'm not arguing against making this configurable and over different view points are probably the best reason why this should be configurable indeed :)

mjesun · 2020-12-02T21:20:32Z

Having the cache warm in another worker means we had to pay for the cost of building it once. That cost should have been the same as when it needs to be rebuilt after a worker crashes (which is less likely because all processes are running on the same machine).

That is correct as long as you have equivalent caches in all machines. If that is the case, then the building cost is the same no matter of the worker; and in this case I agree it is more beneficial to keep sending work to the same process as the cache is already warm in this one.

But if the cache isn't globally equivalent (which I personally think is the common case here based on how it's used in Jest and Metro), then losing the "fat" worker that contains all the warmed cache leaves you having to rebuild it the entire cache from scratch again elsewhere. If the cache is distributed losing one worker would only kill 1/n of the whole cache, so things can keep working -with degraded performance-, but it's not a 100 to 0 %.

Either way I get we're going to make it configurable, I'm just making the point to provide more information in case that helps 🙂

MichaReiser · 2020-12-26T14:28:23Z

@cpojer would you mind having a look at the update PR in the new year? Thanks!

packages/jest-worker/src/Farm.ts

MichaReiser · 2021-01-08T17:05:23Z

It's unlikely that @cpojer will be able to look at this any time soon. @SimenB do you feel comfortable merging this change without the approval from @cpojer ?

SimenB · 2021-01-11T16:21:31Z

Since it's opt-in I think this should be fine to merge 👍

github-actions · 2021-05-10T22:07:10Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Please note this issue tracker is not a help forum. We recommend using StackOverflow or our discord channel for questions.

Schedule on first available worker

7fbbde4

facebook-github-bot added the cla signed label Dec 2, 2020

Update changelog

2d2fde4

cpojer self-assigned this Dec 3, 2020

Make scheduling policy configurable

84d81f6

MichaReiser changed the title ~~Use first available worker instead of round robin~~ Add first-come scheduling policy Dec 3, 2020

Update documentation and rename first-come to in-order

f944dcf

MichaReiser changed the title ~~Add first-come scheduling policy~~ Add in-order scheduling policy Dec 3, 2020

Micha Reiser and others added 4 commits December 3, 2020 19:46

Update CHANGELOG.md

2e229ff

Format Readme

9f000d0

Merge branch 'master' into no-round-robin

a080bfd

Fix incorrect merge

03606ed

SimenB reviewed Dec 27, 2020

View reviewed changes

packages/jest-worker/src/Farm.ts Outdated Show resolved Hide resolved

packages/jest-worker/src/Farm.ts Outdated Show resolved Hide resolved

packages/jest-worker/src/Farm.ts Outdated Show resolved Hide resolved

Address review comments

111abcd

SimenB approved these changes Dec 27, 2020

View reviewed changes

SimenB merged commit f0dc993 into jestjs:master Jan 11, 2021

github-actions bot locked as resolved and limited conversation to collaborators May 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in-order scheduling policy #10902

Add in-order scheduling policy #10902

MichaReiser commented Dec 2, 2020 •

edited

mjesun commented Dec 2, 2020

MichaReiser commented Dec 2, 2020

MichaReiser commented Dec 2, 2020

mjesun commented Dec 2, 2020

MichaReiser commented Dec 26, 2020

MichaReiser commented Jan 8, 2021

SimenB commented Jan 11, 2021

github-actions bot commented May 10, 2021

Add in-order scheduling policy #10902

Add in-order scheduling policy #10902

Conversation

MichaReiser commented Dec 2, 2020 • edited

Summary

Test plan

Empty

Load

mjesun commented Dec 2, 2020

MichaReiser commented Dec 2, 2020

MichaReiser commented Dec 2, 2020

mjesun commented Dec 2, 2020

MichaReiser commented Dec 26, 2020

MichaReiser commented Jan 8, 2021

SimenB commented Jan 11, 2021

github-actions bot commented May 10, 2021

MichaReiser commented Dec 2, 2020 •

edited