Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance benchmark: Express #2417

Closed
Lakitna opened this issue Aug 25, 2020 · 98 comments
Closed

Performance benchmark: Express #2417

Lakitna opened this issue Aug 25, 2020 · 98 comments
Labels
☠ stale Marked as stale by the stale bot, will be removed after a certain time.

Comments

@Lakitna
Copy link
Contributor

Lakitna commented Aug 25, 2020

As discussed here #1514 (comment)

To get a better feel for the performance impact of changes in Stryker we should add some more benchmarks.

For this benchmark, we'll use https://github.com/expressjs/express.

This is a simple one for a straightforward node case:

  • Common JS with no instrumentation
  • Well tested (1147 tests)
  • Decent size (1800 lines of source code, 4064 lines total)
  • Mocha test framework

Baseline results

Stryker@3.3.1

% score # killed # timeout # survived # no cov # error duration
98.21 1432 492 35 0 155 00:08:15

Stryker@4.Beta.3

% score # killed # timeout # survived # no cov # error duration
89.01 1541 201 207 8 68 00:04:17

With "maxConcurrentTestRunners": 8:

% score # killed # timeout # survived # no cov # error duration
88.50 1623 109 217 8 68 00:05:55
@Lakitna
Copy link
Contributor Author

Lakitna commented Aug 25, 2020

I've tried to get a basic idea for how long this benchmark would take (on my pc anyway) and I actually ran into a very obscure bug.

It's so obscure that I'm not even sure if it should be fixed. I've detailed it in #2418

@nicojs
Copy link
Member

nicojs commented Aug 25, 2020

Thanks a lot for wanting to add the benchmark! And the issue is also a juicy one we should fix.

Beers for everyone! 🍻

@nicojs nicojs added this to the 4.0 milestone Aug 25, 2020
@Lakitna
Copy link
Contributor Author

Lakitna commented Aug 25, 2020

Beers for everyone! 🍻

But not yet in file paths! 🍻

@nicojs
Copy link
Member

nicojs commented Aug 27, 2020

The issue should be fixed @Lakitna so feel free to proceed 😉

Do you want me to release a new beta version?

@Lakitna
Copy link
Contributor Author

Lakitna commented Aug 27, 2020

Awesome!

I'll make it work without a new release. Though it is a shame that we can't make this benchmark work for Stryker@3 as a comparison.

@nicojs
Copy link
Member

nicojs commented Aug 27, 2020

Hmm good point. Maybe we can remove the tests that use the fixture?

@nicojs
Copy link
Member

nicojs commented Aug 27, 2020

Or we can backport this fix, your choice.

@Lakitna
Copy link
Contributor Author

Lakitna commented Aug 27, 2020

Maybe we can remove the tests that use the fixture?

Yeah... I had a bit of a brainfart there... I'm so used to never skipping tests that I didn't even consider it. But for a benchmark that's not an issue.

@Lakitna
Copy link
Contributor Author

Lakitna commented Aug 27, 2020

I'm running in an issue here. It feels like I'm missing a symlink or something.

I've created my branch in master so we can run this benchmark to compare Stryker@3 with Stryker@4. But following the Contributing docs do not allow me to run npm run perf or npm run e2e. Both fail on a require:

    "Error: Cannot find module '../src/StrykerCli'\r\n" +
    'Require stack:\r\n' +
    '- C:\\Data\\stryker\\e2e\\test\\angular-project\\node_modules\\@stryker-mutator\\core\\bin\\stryker\r\n' +

See also full logs:

e2e.log

perf.log

Am I wrong in assuming that the master branch contains correct code?

@nicojs
Copy link
Member

nicojs commented Aug 27, 2020

I see you've figured it out already (PR).

Question: what are your thoughts on running these tests? Just from time to time on our own (dev) pcs? Or on Github actions hardware? Want to use tools for it?

Any guidance here would be excellent, I have not a lot of experience in this field.

@Lakitna
Copy link
Contributor Author

Lakitna commented Aug 27, 2020

I see you've figured it out already (PR).

I sadly didn't. The PR is WIP because it currently pulls Stryker from NPM.

I also don't have a lot of experience with performance testing, especially in an open-source setting. We can't enforce a periodical performance test here. Not the way you can do it when you have a dedicated team anyway. However, this is a topic I'm interested in.

It might be good to create a quick perf test at some point that can be used to ensure that a new PR does not introduce a major slowdown. After that, I think we should test at least before every release. But that would either take a long time (run perf tests on the current version (from NPM) and on the new version (release candidate)) or require identical hardware every time. In any case, I'll read up on the subject and come back to you.

@nicojs
Copy link
Member

nicojs commented Aug 28, 2020

I'll be working on this today @Lakitna.

I'll be investigating if it makes sense to use git submodules for our performance test. I was pointed in this direction by @hugo-vrijswijk . It saves disk space, but also makes it easier to keep them up-to-date.

This will be my first endeavour with git submodules 🤞.

@Lakitna
Copy link
Contributor Author

Lakitna commented Aug 28, 2020

Git submodules would not work in this case, because we have to skip that one test that fails in Stryker@3. Other than that, it would be a good improvement.

but also makes it easier to keep them up-to-date

I'm not sure we should want to update benchmarks. Benchmarks are most useful if they're always the same. This is why I mention the specific commit on which the benchmark is based https://github.com/stryker-mutator/stryker/pull/2431/files?file-filters%5B%5D=.js#diff-c10b5c0440a9a6b469ebef610ebc860aR4

@nicojs
Copy link
Member

nicojs commented Aug 28, 2020

Git submodules would not work in this case, because we have to skip that one test that fails in Stryker@3.

Way ahead of you #2433 😎

(either that or initial test run failed locally and took me a while to figure out why and had a personal facepalm moment 🤦‍♂️)

@nicojs
Copy link
Member

nicojs commented Aug 28, 2020

My results:

Baseline results

Hardware

DELL Intel
Latitude 5500
Windows 10
Core™ i7-8665U
cores: 8
physicalCores: 4

Stryker@3.3.1

% score # killed # timeout # survived # no cov # error duration
82.40 1597 18 345 0 154 00:09:32

Stryker@4.Beta.3

% score # killed # timeout # survived # no cov # error duration
82.52 1592 23 334 8 68 00:07:42

So a performance improvement for me 😎

@Lakitna
Copy link
Contributor Author

Lakitna commented Aug 28, 2020

Wait, 82.40%? I got 98.2%... That's a big difference.

Your result

Stryker % score # killed # timeout # survived # no cov # error duration
3.3.1 82.40 1597 18 345 0 154 00:09:32

My result

Stryker % score # killed # timeout # survived # no cov # error duration
3.3.1 98.21 1432 492 35 0 155 00:08:15

Looks like I had a lot of timeouts where you did not. I'll take a look later today if I can pinpoint this. I think it might be maxConcurrency. I was running 16 instances on 16 cores.

@nicojs
Copy link
Member

nicojs commented Aug 28, 2020

I think it might be maxConcurrency. I was running 16 instances on 16 cores.

It might have to do with that, indeed. Old fashion process starvation. What did you do during testing? For example, browsing the internet is usually enough to screw up the run when running at max concurrency. I brought an additional laptop to work today 😎

@nicojs
Copy link
Member

nicojs commented Aug 28, 2020

I've updated my comment with the results. So 00:07:42 with Stryker@4.beta.3, a significant performance improvement. This is unfortunate in a way, I was hoping to reproduce your experience @Lakitna, but alas.

@Lakitna, could you maybe also try it out with the new instructions in #2402?

@Lakitna
Copy link
Contributor Author

Lakitna commented Aug 31, 2020

I tried to run but ran into an issue with the submodule. By now you might have noticed that git submodules is a feature that has not been given a lot of love. In this case, I ran into the fact that it defaults to using SSH over HTTPS. I have no SSH set up for Github, so it failed on git submodule update.

I think you're working on the branch fix/test-perf-tests at the moment, but I can't commit to that. All you have to do is change line 3 in .gitmodules:

[submodule "perf/test/express"]
	path = perf/test/express
	url = https://github.com/expressjs/express.git

then:

git submodule sync
git submodule init
git submodule update

Now the submodule if pulled over HTTPS, making it easier for everyone to clone Stryker.

Also: I got build errors on master, and fix/test-perf-tests is beta (and fails on the Angular project). Which branch should I be looking at?

Edit: I found the global PERF_TEST_GLOB_PATTERN and am now testing Express with Stryker@4.beta on fix/test-perf-tests :)

@Lakitna
Copy link
Contributor Author

Lakitna commented Aug 31, 2020

Hardware

CPU Intel Core i9-9880H
Physical cores 8
Logical cores 16
OS Windows 10 2004

I also have a pretty powerful GPU, but Stryker does not make use of it. Maybe an interesting one for the far future. :) Though I'm not sure how well it would work.

Stryker@4.Beta.3

% score # killed # timeout # survived # no cov # error duration
89.01 1541 201 207 8 68 00:04:17

That's some speed improvement! There are still timeouts, though I'm definitely not running on a clean environment. When running with "maxConcurrentTestRunners": 8 I run between 50-90% CPU utilisation (and over the machines cooling capacity):

% score # killed # timeout # survived # no cov # error duration
88.50 1623 109 217 8 68 00:05:55

Edit: I can get rid of the timeouts when I increase the timeout "timeoutMS": 30000 (without maxConcurrentTestRunners):

% score # killed # timeout # survived # no cov # error duration
86.65 1680 14 253 8 70 00:06:33

@nicojs
Copy link
Member

nicojs commented Aug 31, 2020

Intel Core i9-9880H

Have to admit, a bit jealous 😉

fix/test-perf-tests

It's since then merged to epic/mutation-switching (which is currently our "develop branch"). Will delete the branch. I see there are a lot open branches 😳 will do a cleanup 😅

Great to see you're having the same performance improvement. I'm curious to know how your specific use case differs from express 🤷‍♂️. Maybe very large source files? Which get even bigger with mutation switching?

I can't really explain the timeouts in your settings. Setting timeoutMS to 30 seconds is a kind of "shotgun approach". It will work, but it worsens the performance more than needed. There are 14 timeouts. I assume they are valid timeouts, as in mutants that result in an infinite loop for example. That means that 14 times, the test runner will hang for 30 seconds before it is killed.

Stryker's own calculation for the timeout per mutant should work: timeoutForTestRun = netTime * timeoutFactor + timeoutMS + overhead. Could you try to add a higher timeoutFactor? That might be a better fit for your use case.

@Lakitna
Copy link
Contributor Author

Lakitna commented Aug 31, 2020

Have to admit, a bit jealous 😉

I know right! I had to endure a few years of maddening slowness before I got it though 😒

The timeoutMS did feel like an imperfect solution. I tried it to see if we need a larger default timeout. I've taken a look at the remaining timeouts and they are legit. while loops and for loops mainly.

Compared to my (closed source) project express has a lot larger files.

Project Lines of source code # Source files Avg source lines per file # mutants (Stryker@3) # tests Test run duration as reported by Mocha
Express 1824 11 (11 mutated) 165.81 2114 1148 (0 skipped) 2581ms
My project 6514 110 (87 mutated) 59.22 3286 1070 (18 skipped) 1220ms

Are any of these numbers triggering anything for you?

@Lakitna
Copy link
Contributor Author

Lakitna commented Aug 31, 2020

Could you try to add a higher timeoutFactor?

timeoutFactor # timeouts runtime
1.5 (default) 205 00:04:14
3 205 00:05:03
6 96 00:06:06
12 18 00:06:50
24 17 00:06:53

As expected, the total runtime gradually increases. I am surprised how high I had to crank timeoutFactor.

I've never encountered anything remotely like this in my project. Has anyone?

edit: I've re-run beta on my project, and there too I got more timeouts. Though I only got 2 extras. They appeared in this code:

    const component = _.omit(someObject, [
        'a',
        'b',
        'c',
        'd',
        'e',
        'f',
        'g',
        'h',
        'i',
        'j',
        'k',
        'l',
        'm',
        'n',
    ]);

There are two timeouts reported here. One on the mutation [] (line 1) and on '' (line 4)

@nicojs
Copy link
Member

nicojs commented Aug 31, 2020

Do you have a hard disk by any chance (instead of SSD)? Maybe the disk is busy with I/O and that is causing the timeouts?

The @stryker-mutator/mocha-runner is clearing node's require cache between each run, so depending on how the OS handles these IO calls, it might get bottlenecked there. That would explain how your CPU is not at max while running with --concurrency equal to your logical core count.

Maybe some other reasons why timeouts appear? Encoding 4k video in the background? Mining bitcoin? There can be many reasons for these timeouts. Maybe there is tooling that can monitor your PC better?

Running with higher timeout settings is fine for Stryker, you're just waiting a bit longer than you probably strictly must. We will be implementing #1472 soon after Stryker 4, so you will be able to exclude the specific mutations with a comment. Not ideal, but better than nothing.

@Lakitna
Copy link
Contributor Author

Lakitna commented Sep 1, 2020

I'm sporting an SSD, and it's barely in use during a run (max 4%). So that would be unlikely.

I did observe the following in my project though. It is not as pronounced in the Express bench.

image

You're looking at available memory (RAM) during the run. It looks like something is causing a memory leak. You can also see that something panicked when it couldn't get more memory and quickly cleaned up at about 90% into my run. The vertical line on the end is the end of the run.

However, I do think this is unrelated to the timeouts issue.

@Lakitna
Copy link
Contributor Author

Lakitna commented Sep 1, 2020

Maybe some other reasons why timeouts appear? Encoding 4k video in the background? Mining bitcoin? There can be many reasons for these timeouts. Maybe there is tooling that can monitor your PC better?

I'm not running on a dedicated machine. I definitely have other programs open. But hardware usage is minimal. Below is the almost idle baseline in which I run Stryker.

image

With my kind of hardware and this much of it available I shouldn't have to worry about timeouts.

@Lakitna
Copy link
Contributor Author

Lakitna commented Sep 10, 2020

I didn't expect a lot of difference between beta.4 and beta.5. Just to be sure some runs:

In 4.0.0-beta.5:

Concurrency % score # killed # timeout # survived # no cov # error Avg tests/mutants Duration
15 (default) 90.05 1090 213 136 8 578 16.63 00:04:02
15 (default) 89.17 1451 294 204 8 68 23.63 00:05:22
15 (default) 88.71 1432 304 213 8 68 19.86 00:05:27
1 85.79 1550 129 270 8 68 90.78 00:32:04
1 86.82 1572 127 250 8 68 81.77 00:30:33
1 84.72 1524 134 291 8 68 100.08 00:33:58

Not sure what happened in the first run... But overall it seems to be pretty much the same result.

Currently, I'm running 15 runs with increasing concurrency. Might help with the default concurrency setting.

@Lakitna
Copy link
Contributor Author

Lakitna commented Sep 10, 2020

I've run the Express bench in 4.0.0-beta.5 for every concurrency from 1 to default for my machine.

Concurrency % score # killed # timeout # survived # no cov # error Avg tests/mutants Duration
1 85.74 1549 129 271 8 68 90.27 00:32:18
2 86.25 1555 133 261 8 68 79.01 00:18:06
3 89.06 1617 126 206 8 68 55.68 00:11:43
4 87.28 1577 131 241 8 68 61.21 00:10:06
5 88.20 1590 136 223 8 68 48.00 00:08:13
6 87.99 1579 143 227 8 68 54.59 00:07:52
7 88.50 1552 180 217 8 68 50.92 00:07:46
8 89.93 1592 168 189 8 68 43.11 00:06:27
9 89.42 1541 209 199 8 68 34.37 00:06:22
10 89.95 1088 210 137 8 582 22.65 00:05:16
11 90.24 1530 236 183 8 68 24.35 00:05:29
12 89.17 1494 251 204 8 68 28.48 00:05:45
13 89.61 1079 223 143 8 572 16.44 00:04:18
14 89.63 1487 267 195 8 68 21.43 00:05:06
15 89.17 1433 312 204 8 68 18.84 00:05:15

I created a Google Sheet to be able to draw charts out of this.

https://docs.google.com/spreadsheets/d/11dqDoxqbXVCQiBVtMq_eZpgMljTMPL-voI1MQdA-gDA/edit?usp=sharing

There is some interesting stuff in there I think.

@nicojs
Copy link
Member

nicojs commented Sep 10, 2020

Yeah, we should try to figure out why there are differences. I think the previous runs are influencing new runs in the same process. Mocha attaches global handlers to the process and reporting the current run as failed.

https://github.com/mochajs/mocha/blob/b3eb2a68d345aa9ce5791dddfea41a13be743b78/lib/runner.js#L1063-L1066

For example, the done of middleware might not be called because of a mutant, but the express timeout for it might occur in the next run (just guessing here).

@Lakitna I'm curious what happens if you use the command test runner. That one creates clean processes per mutant. Should be very consistent each run, right?

@Lakitna
Copy link
Contributor Author

Lakitna commented Sep 10, 2020

I'm curious what happens if you use the command test runner.

Let's find out 😄 I'll run a bunch of times with a few different concurrencies. I'll probably do that tonight after work.

The quest for stable runs continues!

Edit: I have a command lined up for 16 runs. 2 each for the following concurrencies: 1, 3, 5, 7, 9, 11, 13, 15.

@bartekleon
Copy link
Member

@Lakitna you have 8 cores right? Looking at it 8 is a threshold for "stable" tests. So i guess instead of logical cores - 1 we should use physical ones

@Lakitna
Copy link
Contributor Author

Lakitna commented Sep 10, 2020

I think the previous runs are influencing new runs in the same process.

This triggered me. I wonder if Mochas --bail can be the cause of this. Express uses it.

Looking at it 8 is a threshold for "stable" tests.

I still wouldn't call it fully stable, but it is a lot better.

@bartekleon
Copy link
Member

bartekleon commented Sep 10, 2020

"I still wouldn't call it fully stable, but it is a lot better."
yea.. But timeouts are not a big deal. The bigger problem is that sometimes mutants are marked as killed and sometimes as survived... that bothers me the most... :/
unless its that some timeouts are just these missing ones, and there are new timeouts 🤔

@Lakitna
Copy link
Contributor Author

Lakitna commented Sep 10, 2020

Exactly, if the score is stable I can live with some instabilities at this point. We've been working on this for quite a while now. But there is a 4.5 percentpoint difference between the highest scoring run and the lowest scoring run. That's a significant difference.

As a user I would find it annoying that the same line of code can be timeout one run, but killed the next. It's not good for the faith in Stryker.

@Lakitna
Copy link
Contributor Author

Lakitna commented Sep 11, 2020

Oh man, runs with the command runner take a very long time.

Here are the results:

{
  "$schema": "../../../packages/core/schema/stryker-schema.json",
  "testRunner": "command"
}
Concurrency % score # killed # timeout # survived # no cov # error Avg tests/mutants Duration
15 99.70 1338 681 6 0 0 0.66 00:15:11
15 100.00 1394 631 0 0 0 0.69 00:14:13
13 100.00 1384 641 0 0 0 0.68 00:16:04
13 100.00 1319 706 0 0 0 0.65 00:16:39
11 99.95 1367 657 1 0 0 0.68 00:18:40
11 100.00 1334 691 0 0 0 0.66 00:18:59
9 100.00 1470 555 0 0 0 0.73 00:20:15
9 99.90 1510 513 2 0 0 0.75 00:20:10
7 99.65 1778 240 7 0 0 0.88 00:20:28
7 99.56 1760 256 9 0 0 0.87 00:21:09
5 99.11 1907 100 18 0 0 0.95 00:22:14
5 99.01 1912 93 20 0 0 0.95 00:21:26
4 98.02 1890 95 40 0 0 0.95 00:26:56
4 97.68 1894 84 47 0 0 0.96 00:26:19
3 97.09 1867 99 59 0 0 0.95 00:32:29
3 97.23 1872 97 56 0 0 0.95 00:32:23
1 85.53 1587 145 293 0 0 0.93 01:54:31
1 85.98 1594 147 284 0 0 0.93 01:52:45

I've updated the Google sheet accordingly. See the second tab https://docs.google.com/spreadsheets/d/11dqDoxqbXVCQiBVtMq_eZpgMljTMPL-voI1MQdA-gDA/edit?usp=sharing

In this setup, the run becomes useless with >= 9 concurrency. After that, it becomes slightly more useful with every step down in concurrency. Timeouts stabilize with <= 5 concurrency but survived doesn't. At concurrency 1 things became less stable, but the system was used more during the 1 runs.

Execution times are also interesting here. There is a huge gap between concurrency 5 and anything below that. I'll queue up some concurrency 4 later to see how long those take. This one can have a major impact on CI. Update: Concurrency 4 actually clocks in smack in the middle of 3 and 5. The trendline seems to be pretty accurate.

I think it's weird we're not seeing the expected stability here. According to these numbers, the Mocha runner is actually more stable across concurrencies. It is notable that running multiple times with the same concurrency is very stable.

Finally a fun fact: These numbers correspond to a total runtime of 09:30:51. My poor CPU 😄

@nicojs nicojs removed this from the 4.0 milestone Sep 25, 2020
@nicojs
Copy link
Member

nicojs commented Sep 25, 2020

I've removed the milestone. We still want to find the root cause of the discrepancies in timeouts. Pretty sure it's something specific in express tests, but we don't want to block the 4.0 milestone for this. Hope you agree @Lakitna

@bartekleon
Copy link
Member

@Lakitna
Copy link
Contributor Author

Lakitna commented Sep 28, 2020

Hmm I get where you're coming from. I don't like it though. The simple fact that this is possible feels iffy to me. I at least want to verify that we can make things work properly with another project.

How about https://github.com/tj/commander.js? It's CommonJS with Jest. 358 tests running in 7.369 seconds.

@bartekleon
Copy link
Member

bartekleon commented Oct 3, 2020

@Lakitna I took a look at our serialization. Could you please run flamegraph again on branch https://github.com/kmdrGroch/stryker/tree/test? I would like to compare these 2.
(if interested in PR #2525)

@Lakitna
Copy link
Contributor Author

Lakitna commented Oct 6, 2020

I had some busy times last week, so it's a bit later than usual. Here we go:

Branch Concurrency % score # killed # timeout # survived # no cov # error Avg tests/mutants Duration Flamegraph
kmdrGroch/test 15 88.71 1406 330 213 8 68 22.67 00:05:11 flamegraph.zip
Tag: v4.0.0-beta.10 15 88.15 1401 324 224 8 68 24.24 00:04:59 beta10.zip
Tag: v4.0.0-beta.9 15 88.50 1417 315 217 8 68 24.33 00:05:02 beta9.zip

I'm not familiar enough with your changes to make any judgements here ;)

@bartekleon
Copy link
Member

bartekleon commented Oct 6, 2020

Ok so it seems it reduces the time on childProcessProxies significantly (from 35% time to 19%):
image
image
But core package is not the one which is holding the performance back... So most likely optimisations are only to be done in instrumenter or runner packages

@bartekleon
Copy link
Member

and i see you ran code on beta 10, i think beta 9 is more relevant as a base test :)

@Lakitna
Copy link
Contributor Author

Lakitna commented Oct 6, 2020

Yeah, I feel like it would be worth it to get profiling working on child processes. However, it would involve spawning the child processes in a different way. That stuff is finicky. I can get the child processes to show up in the Chrome inspector, but it won't actually create a profiling report due to the short life of the processes.

I've also added beta.9 to the results above ;)

@Lakitna
Copy link
Contributor Author

Lakitna commented Oct 6, 2020

I've actually managed to get some CPU profiles from child processes! 😃

When running node with --cpu-prof we get a file that can be interpreted by the Chrome inspector. However, I don't seem to be able to find the actual tests in the profile. On top of that, I only seem to get 1 per full run. Edit: I just finished a run without any profiles. Could maybe have to do with some processes exiting differently in some edge cases.

Do the runner children exit with SIGKILL? A process ending in SIGKILL will not generate a CPU profile, there might be other signals that cause it to bail.

Download, extract, and open with Chrome inspector: child-process-289.zip
Preview:
image

@bartekleon
Copy link
Member

I think we had some issue that childprocesses don't exit in Linux (or at least in some runner cases). I'm not sure about exiting tho. From what I saw previously I noticed that unless there is "timeout" in process childprocess doesn't exit! It restarts when there is a timeout (the ones you can see after running mutation tests).

@bartekleon
Copy link
Member

And I'm not exactly sure how this inspector works. It's just messy for me :/ are you maybe free in several days and could give me an me tour in these reports?

@nicojs
Copy link
Member

nicojs commented Oct 6, 2020

I think we had some issue that childprocesses don't exit in Linux (or at least in some runner cases).

Stryker itself will exit child processes when it runs to the end (either with error or not). However, I noticed hanging chrome browsers after running the karma integration tests on Linux, causing me to open #2519, but Gareth actually mentioned to me that he haven't noticed any dangling processes so that needs further investigation.

Do the runner children exit with SIGKILL

Yes. Stryker uses tree-kill here: https://github.com/stryker-mutator/stryker/blob/d330af98141613cebd34a95b9fe85583e9af3b2b/packages/core/src/utils/objectUtils.ts#L51. However, the exit strategy hasn't been thought out carefully, if you know of a better way I'm totally open for suggestions.

@Lakitna
Copy link
Contributor Author

Lakitna commented Oct 7, 2020

I wonder if it's as simple as using SIGTERM instead of SIGKILL. If we keep treeKill in place, it should be fine in theory. I'll try some different signals.

I don't have access to a Linux machine at the moment though. Is this is a thing that can be tested in CI?

@bartekleon
Copy link
Member

if you have spare time you could always try WSL, its faster to setup than pure linux (unless you work on Mac)

@Lakitna
Copy link
Contributor Author

Lakitna commented Oct 7, 2020

I've got it working! 🎉

It turns out SIGKILL wasn't the issue here. I had to call process.exit(0) when the child process gets its dispose message. This causes the child process to go through its exit procedures, which include generating the CPU profile. I'll make a PR in a bit ;)

These are all the CPU profiles I managed to generate with the Express bench. Notice how not every process generated a profile, probably due to crashing processes. Whoops, that's a big file! It's 38mb! Github doesn't like that, so here is a WeTransfer link: https://we.tl/t-V0M8XJ9WEF.

Since I didn't have to change SIGKILL it should all still work on Linux.

@Lakitna
Copy link
Contributor Author

Lakitna commented Oct 7, 2020

It's just messy for me :/ are you maybe free in several days and could give me an me tour in these reports?

I'm also not an expert in any way. You might be better off looking for something on Youtube. However, the new profiles are a lot easier to read. There is actually stuff in there that I recognize ;)

Edit: It should be possible to make a 0x-style flame graph out of these cpu profiles. It'll require some D3 fiddling.

@stale
Copy link

stale bot commented Oct 8, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the ☠ stale Marked as stale by the stale bot, will be removed after a certain time. label Oct 8, 2021
@stale stale bot closed this as completed Nov 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
☠ stale Marked as stale by the stale bot, will be removed after a certain time.
Projects
None yet
Development

No branches or pull requests

3 participants