Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geomean #16

Open
jfbastien opened this issue Nov 3, 2017 · 11 comments
Open

Geomean #16

jfbastien opened this issue Nov 3, 2017 · 11 comments
Assignees

Comments

@jfbastien
Copy link

It would be great for geomean to have a confidence interval.

Also, the calculation of geomean seems off?

@bmeurer
Copy link
Member

bmeurer commented Nov 3, 2017

The value shown is the geometric mean of the average runs/sec of the individual tests. An earlier version was computing the geometric mean of the times of the individual tests, which was obviously off, probably you had seen that.

We'll be looking into adding confidence interval for the geometric mean.

@bmeurer bmeurer self-assigned this Nov 3, 2017
@tebbi
Copy link
Contributor

tebbi commented Nov 3, 2017

The confidence intervals we compute for the individual tests is actually the relative standard error of the mean of the individual runtimes. This would make perfect sense if the individual samples were statistically independent. However, since the samples are iterations running in the same isolate, they share feedback, optimized code and GC. So they are not independent, especially the first iterations are always slower than the later iterations due to missing feedback and optimized code. So the reported error is systematically too big.

@tebbi
Copy link
Contributor

tebbi commented Nov 3, 2017

One way to fix this is to separate warm-up and only measure the steady-state of the system. However, this assumes that the steady-state of the VM is always the same, which is probably not the case.

Another way is to re-run the complete benchmark as independently as possible (in a new iframe/realm, or even better: a new process) and only compute confidence intervals based on results of complete runs. This is what the ARES-6 runner does.

@jfbastien
Copy link
Author

My team is obviously biased towards liking the ARES-6 approach, in part for the reasons that you state.

@bmeurer
Copy link
Member

bmeurer commented Nov 3, 2017

I guess we could remove the first N samples and do the average/stddev solely on the remaining samples.

iframe/realm is not really an option here, since Node doesn't support realms and this is primarily interesting for Node. There's only the vm module, but that does a lot of magic with the global object, which makes it unsuitable for representative performance measurement.

@bterlson
Copy link

bterlson commented Nov 3, 2017

Couple general comments, but no strong opinions:

  • Agree that testing in VM module is likely more trouble than it's worth (much work went in to eshost to make these realms work-ish, but it's not 100% representative)
  • Disagree with tossing out "outliers" unless it's just a single warmup iteration or something. Removing samples should only be done with very strong motivation and justification around why that datapoint is not statistically relevant.
  • High variance seems acceptable. Addressing high variance with more iterations takes time but likely gives richer data. Unfortunately variance depends on many factors and getting acceptably low variance may require wildly different number of samples on different implementations/environments. Not sure how to solve this beyond making sure the number of samples is high enough for the worst-case variance observed in practice?

@jfbastien
Copy link
Author

If anything, I'd think this benchmark would only really care about the first run because that's what the tools hit, no? We do multi-runs to get statistical significance, so maybe in a node context you actually want to run the tests from the command-line and do aggregation there, whereas in a web or browser shell context the ARES-6 approach of creating a new iframe/realm is appropriate?

@bmeurer
Copy link
Member

bmeurer commented Nov 3, 2017

The first run is almost completely irrelevant in practice. The common case is that you have webpack driving the build via loaders (i.e. Babel, TypeScript, PostCSS, Prepack, etc.) and those are run on hundreds to thousands of files. Oftentimes the build takes tens of seconds up to several minutes for a realistic web application. Same for tools like prettier that usually run on several files at once. All of this usually runs in a single Node process, in the same isolate.

@jfbastien
Copy link
Author

@bmeurer ah that's interesting! Shouldn't the benchmark try to capture this type of call graph? I'm guessing you think what you've come up with is highly correlated with that call graph? If so then your setup makes sense.

@bmeurer
Copy link
Member

bmeurer commented Nov 3, 2017

I've experimented with different setups, including just benchmarking an actual webpack build with all the loaders you needed for that project. But this usually didn't work out well, partially because of I/O and the dependency on Node, which made it impossible to run in the browser/shell. This current setup seems to be matching reality quite well, while mocking out any kind of I/O and creating representative versions for browser/shell.

I'm not saying this is the only possible solution, and I'm certainly open for potential changes here in the future.

@tebbi
Copy link
Contributor

tebbi commented Nov 7, 2017

After investigating more, it turned out that the confidence interval numbers we produced were pretty much useless (details and reasoning here). So we disabled the feature for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
@bterlson @jfbastien @bmeurer @tebbi and others