Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(perf): add benchmark for jest runner #2618

Merged
merged 15 commits into from Nov 26, 2020
Merged

Conversation

nicojs
Copy link
Member

@nicojs nicojs commented Nov 16, 2020

This adds a performance test for a jest project. A big one named lighthouse. I want to focus on the jest runner in the near future, so thought it made sense to add a benchmark first.

@nicojs
Copy link
Member Author

nicojs commented Nov 16, 2020

@Lakitna Still keeping high a.t.m. 😅

This adds a performance test for a jest project. A big one named lighthouse. I want to focus on the jest runner in the near future, so thought it made sense to add a benchmark first.

I still agree with the benefits of "dropping down" as specified in #2434. I might add some lower-level benchmark tests when I implement the improvements.

@Lakitna
Copy link
Contributor

Lakitna commented Nov 17, 2020

Sounds good :)

Let me know when you're ready for a comprehensive baseline on multiple concurrencies.

@nicojs
Copy link
Member Author

nicojs commented Nov 17, 2020

It runs on GH actions now (took some time, because I had to use yarn to install this project. The joy of nodejs development.

It only mutates "lighthouse-core/audits/**/*.js", because it already takes 2h 18min on GH actions (thats a --concurrency 2), did a test yesterday on dev laptop with --concurrency 4, that took ~1h. This is a big one.

Running performance tests on lighthouse (matched with glob pattern "lighthouse")
(lighthouse) exec "/home/runner/work/stryker/stryker/packages/core/bin/stryker run"
lighthouse: 609.987ms last log message: 09:56:12 (3366) INFO ConfigReader Using stryker.conf.json
lighthouse: 1:00.878 (m:ss.mmm) last log message: Mutation testing 0% (elapsed: <1m, remaining: ~28m) 53/8988 tested (48 survived, 0 timed out)
lighthouse: 2:00.889 (m:ss.mmm) last log message: Mutation testing 1% (elapsed: ~1m, remaining: ~1h 8m) 150/8988 tested (108 survived, 0 timed out)
[...}
lighthouse: 2:15:43.324 (h:mm:ss.mmm) last log message: Mutation testing 96% (elapsed: ~2h 14m, remaining: ~4m) 8694/8988 tested (3547 survived, 13 timed out)
lighthouse: 2:16:53.325 (h:mm:ss.mmm) last log message: Mutation testing 98% (elapsed: ~2h 16m, remaining: ~2m) 8854/8988 tested (3597 survived, 13 timed out)
lighthouse: 2:18:01.703 (h:mm:ss.mmm) last log message:
lighthouse: 2:18:02.116 (h:mm:ss.mmm)
all tests: 2:18:02.116 (h:mm:ss.mmm)

This is what the report looks like:

image

@nicojs
Copy link
Member Author

nicojs commented Nov 17, 2020

Sounds good :)

Let me know when you're ready for a comprehensive baseline on multiple concurrencies.

Yes, will do.

@nicojs
Copy link
Member Author

nicojs commented Nov 18, 2020

@Lakitna it took me some work, but I think it's ready for benchmarking. You can run locally with

cross-env PERF_TEST_GLOB_PATTERN=lighthouse npm run perf

Here is a job that runs all performance tests: https://github.com/stryker-mutator/stryker/runs/1420244888?check_suite_focus=true

@nicojs
Copy link
Member Author

nicojs commented Nov 19, 2020

Results with gh workflow:

angular-cli: 1:04.958
express: 20:49.812 
lighthouse: 2:38:14.446 

"noImplicitAny": true,
"noImplicitReturns": true,
"noImplicitThis": true,
"noImplicitAny": false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to switch off these options?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise I think it is good :)

Copy link
Member Author

@nicojs nicojs Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, not really important since Stryker disables type checking. I tried to run the tests locally en then the compilation failed. The reason for that is probably because it is really old typescript code (stems from angular 4 days).

@Lakitna
Copy link
Contributor

Lakitna commented Nov 19, 2020

I tried to run, but I got a short runtime and a mutation score of 0%...

image

Something is going wrong here. Any ideas what's happening?

@ commit: c7d1ea2

Edit: Could it maybe be the plugin stuff? I'm on Windows after all.

Running performance tests on lighthouse (matched with glob pattern "lighthouse")
(lighthouse) exec "C:\Data\stryker\packages\core\bin\stryker run --plugins C:\Data\stryker\packages\mocha-runner\src\index.js,C:\Data\stryker\packages\karma-runner\src\index.js,C:\Data\stryker\packages\jest-runner\src\index.js,C:\Data\stryker\packages\jasmine-runner\src\index.js,C:\Data\stryker\packages\mocha-runner\src\index.js,C:\Data\stryker\packages\typescript-checker\dist\src\index.js"

Update: The initial test run does not fail when I introduce a deliberate error. I guess there is an error reporting issue.

Update: I keep updating it seems. Turns out I introduced an error in the wrong test o.0 Initial test run does fail when it should.

@nicojs
Copy link
Member Author

nicojs commented Nov 19, 2020

Edit: Could it maybe be the plugin stuff? I'm on Windows after all.

Running performance tests on lighthouse (matched with glob pattern "lighthouse")
(lighthouse) exec "C:\Data\stryker\packages\core\bin\stryker run --plugins C:\Data\stryker\packages\mocha-runner\src\index.js,C:\Data\stryker\packages\karma-runner\src\index.js,C:\Data\stryker\packages\jest-runner\src\index.js,C:\Data\stryker\packages\jasmine-runner\src\index.js,C:\Data\stryker\packages\mocha-runner\src\index.js,C:\Data\stryker\packages\typescript-checker\dist\src\index.js"

Might be an issue. I can try on windows as well in a few hours.

@nicojs
Copy link
Member Author

nicojs commented Nov 19, 2020

Hmm I've got the same result as you did @Lakitna. I'm pretty sure this related to the way jest works on windows in combination with running the jest-runner from a different directory. Will look into it more, would be great if we can run the perf tests on windows as well.

@nicojs
Copy link
Member Author

nicojs commented Nov 19, 2020

image

Works now since #2623

Hope that didn't break anything for others 🤷‍♂️

@Lakitna
Copy link
Contributor

Lakitna commented Nov 19, 2020

Hope that didn't break anything for others 🤷‍♂️

What could possibly go wrong 🤷‍♂️

@Lakitna
Copy link
Contributor

Lakitna commented Nov 19, 2020

I'm running concurrency 15, 12, 8, and 4 right now. I think that should do it.

I am doing it on my desktop this time though. With the long runtimes, it's easier. It's still an 8 core 16 thread CPU, this time a Ryzen 3700X. Just note that it might cause slight differences with the previous Express bench results.

Update: Only slightly related result. The memory thing I mentioned before is very apparent with this test suite. I still think it's not a performance issue, but it is notable.

image

Whelp, I spoke too soon. This most definitely is a performance issue. You're looking at CPU slowdowns because of a lack of memory. Probably an issue with the test suite, not Stryker.

image

@nicojs
Copy link
Member Author

nicojs commented Nov 20, 2020

Whelp, I spoke too soon. This most definitely is a performance issue. You're looking at CPU slowdowns because of a lack of memory. Probably an issue with the test suite, not Stryker.

Maybe add --maxTestRunnerReuse 20 would help here? If you see a big difference then we know it has to do with something in the test suite.

@Lakitna
Copy link
Contributor

Lakitna commented Nov 20, 2020

Here are the results of this nights run. Ran at 1416611.

I made a mistake causing all runs to be at concurrency 15... Almost as if people are not very perceptive at night 🤭 On the bright side, we can see how stable default concurrency is.

Concurrency % score # killed # timeout # survived # no cov # error Avg tests/mutants Duration
15 (default) 59.62 5185 172 3628 0 3 9.87 00:44:27
15 (default) 59.73 5180 187 3618 0 3 9.86 00:43:57
15 (default) 59.63 5194 164 3627 0 3 9.99 00:42:41
15 (default) 59.71 5179 186 3620 0 3 9.88 06:59:20 <- Machine went to sleep

All in all, it looks to be pretty stable. There are some differences, but they are within 0.1 percentage point.

I'm running different concurrencies now. This time for real, I double-checked.

@Lakitna
Copy link
Contributor

Lakitna commented Nov 20, 2020

Here are the results we actually need:

Concurrency % score # killed # timeout # survived # no cov # error Avg tests/mutants Duration
15 (default) 59.63 5195 163 3627 0 3 10.01 00:43:29
12 59.60 5199 156 3630 0 3 10.11 00:45:57
8 59.59 5212 142 3631 0 3 10.58 00:54:41
7 59.57 5215 137 3633 0 3 10.98 00:59:20
6 59.21 5310 10 3665 0 3 21.17 01:07:24
5 59.21 5312 8 3665 0 3 21.33 01:18:03
4 59.22 5309 12 3664 0 3 21.32 02:24:17
4 59.21 5312 8 3665 0 3 21.33 01:30:19
3 59.21 5312 8 3665 0 3 21.33 02:04:57

Durations and test/mutants are interesting here. I'll run some missing concurrencies so I can make a duration graph as I did for Express. First impressions suggest a tipping point somewhere between concurrency 8 and 4. To that end, I'll run 7, 6, and 5 to fill in the gaps.

Scores seem very stable. Timeouts are manageable. All in all, it seems to be a lot more stable compared to the Express bench.

@Lakitna
Copy link
Contributor

Lakitna commented Nov 22, 2020

The results are in (previous comment), and stability is great! :) On all metrics but runtime...

image

The graph is the duration per concurrency. The concurrency 4 run is just such a weird outlier. It makes me think like there was an issue during 4 run or something. I'll run 4 and 3 to get some more data points.

@Lakitna
Copy link
Contributor

Lakitna commented Nov 23, 2020

I've updated the comment above once more. The 02:24:17 appears to have been a fluke. I image something like the Windows AntiMalwhere thingy running during the run.

I've updated the duration graph:

image

That looks a lot better :) The trend line actually fits this time!

@nicojs
Copy link
Member Author

nicojs commented Nov 24, 2020

Wow! This is amazing. Thanks so much for taking the time to run this. Do you want me to put this graph in the readme? Then we should update it once I've implemented some improvements 😅

@Lakitna
Copy link
Contributor

Lakitna commented Nov 25, 2020

We can definitely use the data here to find out how much of a performance delta there is between changes :)

It would also be neat to show the basic relationship between concurrency and runtime.

If you're interested, I have a similar graph for the Mocha runner in the Express bench. It has more data points but shows the same relation between the two metrics.

Edit: It would also be neat to make a similar graph for the relation between mutant count and duration. However, that would require a pretty specialized test setup. Currently, I do not have the time to create such a setup.

@nicojs nicojs merged commit 5964d55 into master Nov 26, 2020
@nicojs nicojs deleted the test/add-jest-runner-perf-test branch November 26, 2020 12:24
@nicojs
Copy link
Member Author

nicojs commented Nov 27, 2020

However, that would require a pretty specialized test setup. Currently, I do not have the time to create such a setup.

Do you mean that it would require a lot of scripting to automate it? That's true, and it would take a dedicated server to run. Right now we're using the free GH actions hardware, that won't do.

@Lakitna
Copy link
Contributor

Lakitna commented Nov 27, 2020

Hmm, not necessarily specialized hardware. We can run it on my machine as far as I'm concerned. Unless you want to automatically test for performance regression.

What I would like for this is a setup where we can test the same code with a variable number of mutations. To do that, however, we need a variable-sized codebase with a variable amount of tests tied to it. All so we can simulate a project growing. The metric for growth would be mutation count.

It's basically to find out how Stryker scales with the codebase it tests. If we want to find out how Stryker handles, all other changes must scale linearly.

Imagine results like:

Mutation count Duration
100 xx:xx:xx
200 xx:xx:xx
300 xx:xx:xx
400 xx:xx:xx
500 xx:xx:xx
600 xx:xx:xx
...

It's not a simple implementation o.0

And we might even want to make it more complex by adding the mutation score variable:

Mutation count Mutation score Duration
100 50-ish% xx:xx:xx
100 70-ish% xx:xx:xx
100 90-ish% xx:xx:xx
200 50-ish% xx:xx:xx
200 70-ish% xx:xx:xx
200 90-ish% xx:xx:xx
...

@bartekleon
Copy link
Member

bartekleon commented Nov 27, 2020

I started "funny, small experiment" about these tables. With simple code:

const vals = []; for(let i = 0; i < 10000; i++) {
    vals.push(`export const test${i} = (a: number, b: number) => {
  return a + b;
};
`)
}

I am generating 20000 mutants, so i can make 100/1000/10000/100000 mutants and check the speed :)
But also I dunno how should we store these functions / tests.
In 1 file? In multiple files? 10k per file? 100 per file? All of these could have test length varying :/
(also test cases should be a lil harder than what I have done but for simple test should be enough :D)

EDIT: actually with 20000 mutants in 1 file (10000 test functions and tests), I managed to crash VSC 🗡️

@Lakitna
Copy link
Contributor

Lakitna commented Nov 30, 2020

That's a great start, but it also shows how many variables there are :) In an ideal world, you would isolate a single variable for these kinds of tests. That will take quite a lot of effort I'm afraid.

That being said. I'm interested in running this to see what kind of results we get. Can you share it in such a way that I can set the number of mutations on the command line? (e.g. set MUTATIONS=100 or --mutations=100) That way I can easily queue the runs and make them output to a file for later processing.

actually with 20000 mutants in 1 file (10000 test functions and tests), I managed to crash VSC 🗡️

Awesome, I'd also be interested to see what makes it crash! Lack of RAM I assume. It'd be interesting to find out what that means for large codebases. And (Typescript) codebases that bundle during transpilation?

@bartekleon
Copy link
Member

That's a great start, but it also shows how many variables there are :) In an ideal world, you would isolate a single variable for these kinds of tests. That will take quite a lot of effort I'm afraid.

Yea, I am going to try single source/test file 100-500-1000-2500-5000-10000-20000, multifile - 100 mutants/tests per file, and random size - basically Math.random() with some tweaks :P

That being said. I'm interested in running this to see what kind of results we get. Can you share it in such a way that I can set the number of mutations on the command line? (e.g. set MUTATIONS=100 or --mutations=100) That way I can easily queue the runs and make them output to a file for later processing.

Sure, actually im thinking of making a repository for it so I could run everything at once HEHEHEHEHEHE

Awesome, I'd also be interested to see what makes it crash! Lack of RAM I assume. It'd be interesting to find out what that means for large codebases. And (Typescript) codebases that bundle during transpilation?

Yea, it seems that if you run it from VSC, VSC process gets more and more RAM usage (and actually leak I think XD) [I got over 8GB at the crash point], but from GIT I managed to get normal run without any significant RAM overflow - 11 runs up to 400mb per each

@Lakitna
Copy link
Contributor

Lakitna commented Nov 30, 2020

Yea, I am going to try single source/test file 100-500-1000-2500-5000-10000-20000, multifile - 100 mutants/tests per file, and random size - basically Math.random() with some tweaks :P

It'd be great if we can make a scatterplot with trendline for those results. Like the one above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants