Improve test runner parallelism #172

lightmare · 2021-08-24T08:42:41Z

This is an attempt to more evenly distribute testing work among child processes.

The first thing I did was get rid of the longRunningWorker. These tests take ages to complete, so I don't understand what's the point of running them all in one process.

Then I limited the number of tasks assigned to each worker at a time. Most of the tests were loaded and then buffered in the supervisor process anyway, once pipe buffers were full.

devsnek · 2021-08-24T12:34:46Z

so i gave this a few runs, and it seems to take an average of about seven seconds off my test run time. i also notice the ci time went up from 13 to 19 minutes. unless this is significantly faster for you i'd rather not merge it.

lightmare · 2021-08-24T14:17:53Z

When you run it without slow tests, it sometimes finishes a few seconds sooner — in particular when the original distribution gives some workers less work (time-wise) they finish early and you lose parallelism near the end of a run.

With slow tests, for me the original takes ~58 mins (3+1 workers), whereas this PR takes ~40 mins (3 workers).

I'll need to investigate what I messed up, because something's not right about the timings.
First, the CI seems to have stalled for twice as long near the end than it did previously.
Second, this PR didn't even print correct final counters (72323 + 3412 < 75737).
Third, summing up total CPU time spent gives me <100 min for the original, and ~120 min for this PR.

Instead, run "slow" tests on all workers, after they're done with all "fast" tests.

This has two effects: 1) Workers don't get assigned the same number of tests each, which could take different amounts of time, and so some workers ended too early, while others still had many tests queued up. 2) The supervisor process doesn't load almost all test sources at once, gobbling up 300MB of memory. Now it's less than 100MB.

lightmare · 2021-08-26T11:30:31Z

Update: the issue with CI taking longer was probably caused by running 2 workers (i.e. 3 processes including the supervisor) on a VM with 2 processors. Trying to keep a short, but always nearly-full queue, introduced a lot of expensive context switches. I've fixed this by decreasing the queue length below which scheduleTests() loop in the supervisor wakes up.

The second issue (incorrect final counters) was a bug in this PR, fixed.

The third issue (more CPU time used) is in the engine or node itself, not in this PR. It happens on the main branch as well, which you can confirm by commenting this line in slowlist:

#built-ins/{encode,decode}URI?(Component)/**/*.js

and running:

time NUM_WORKERS=1 node --predictable test/test262/test262.js -- test/built-ins/encodeURI/

#######################
 engine262 Test Runner
 Detected 4 CPUs
 Not running on CI
#######################

[01:29|:   62|+   62|-    0|»    0] (0.00/s)

real	1m30.051s
user	1m29.123s
sys	0m1.185s

versus:

time NUM_WORKERS=3 node --predictable test/test262/test262.js -- test/built-ins/encodeURI/

#######################
 engine262 Test Runner
 Detected 4 CPUs
 Not running on CI
#######################

[00:50|:   62|+   62|-    0|»    0] (0.00/s)

real	0m51.201s
user	1m50.333s
sys	0m1.664s

When running the above I also measured time spent in some parts of the runner, and found that the 20-second increase in user time matches a 20-second increase in total process.hrtime spent in this call:

engine262/test/test262/test262.js

Line 308 in 8993f98

completion = realm.evaluateScript(test.contents, { specifier });

lightmare marked this pull request as draft August 24, 2021 08:56

lightmare force-pushed the test-more-parallelism branch from 70f7497 to 20b2e2d Compare August 24, 2021 09:54

lightmare marked this pull request as ready for review August 24, 2021 10:20

lightmare added 3 commits August 26, 2021 12:22

test: remove longRunningWorker

3722ddf

Instead, run "slow" tests on all workers, after they're done with all "fast" tests.

test: don't accumulate slow test results in workers

a07b37f

lightmare force-pushed the test-more-parallelism branch from 5d64d94 to 950f338 Compare August 26, 2021 11:08

devsnek force-pushed the main branch from f14cb2c to 13d9d5a Compare July 23, 2022 02:51

bakkot mentioned this pull request Jul 28, 2022

suggestion: failing tests should be xfail'd, not skipped #195

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve test runner parallelism #172

Improve test runner parallelism #172

lightmare commented Aug 24, 2021

devsnek commented Aug 24, 2021

lightmare commented Aug 24, 2021 •

edited

lightmare commented Aug 26, 2021

Improve test runner parallelism #172

Are you sure you want to change the base?

Improve test runner parallelism #172

Conversation

lightmare commented Aug 24, 2021

devsnek commented Aug 24, 2021

lightmare commented Aug 24, 2021 • edited

lightmare commented Aug 26, 2021

lightmare commented Aug 24, 2021 •

edited