Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noise in Sandmark #198

Open
kayceesrk opened this issue Dec 3, 2020 · 4 comments
Open

Noise in Sandmark #198

kayceesrk opened this issue Dec 3, 2020 · 4 comments

Comments

@kayceesrk
Copy link
Contributor

Following the discussion in ocaml/ocaml#9934, I set out to quantify the noise in Sandmark macrobenchmark runs. Before asking complex questions about loop alignments and microarchitectural optimisations as was done in ocaml/ocaml#10039, I wanted to measure the noise between multiple runs of the same code. It is important to note that currently, we only run a single iteration of each variant.

The benchmarking was done on IITM "turing" which is a Intel Xeon Gold 5120 CPU machine with isolated cores, cpu governer set to performance, hyper threading disabled, turbo boost disabled, interrupts and rcu_callbacks directed to non-isolated cores but ASLR on [1]. The result on two runs of the latest commit from https://github.com/stedolan/ocaml/tree/sweep-optimisation is here:

image

The outlier is worrisome, but there is up to 2% difference in both directions. Moving forward, we should consider the following:

  1. Arrive at a measure for statistical significance on a given machine. What would be the minimum difference beyond which the result is statistically significant. This will vary based on the benchmark and the topic (running time, maxRSS).
  2. Run multiple iterations. Sandmark already has an ITER variable which runs the experiments for multiple runs. The notebooks need to be updated so that mean (and standard deviation) are computed first and the graphs are updated to include error bars. The downside is that the benchmarking will take significantly longer. We should choose a representative set of macro benchmarks for quick study and reserve the full macro benchmark run for the final result. Can we run the sequential macro benchmarks in parallel on different isolated cores? What would be the impact of this on individual benchmark runs?

[1] https://github.com/ocaml-bench/ocaml_bench_scripts#notes-on-hardware-and-os-settings-for-linux-benchmarking

@kayceesrk
Copy link
Contributor Author

kayceesrk commented Dec 4, 2020

Now with ASLR turned off

image

Still the noise is around 2%.

@gasche
Copy link

gasche commented Jan 5, 2021

It looks like most of the benchmarks are not actually very noisy (the noise observed is well below 1%), and a smaller group of benchmarks are noisier. This suggests that you could track per benchmark how many iterations to run and what's the expected noise level, and try to have more stable benchmarks and fewer unstable benchmarks to keep the total running time in check. In particular, most of the long-running benchmarks are not noisy in this test, so maybe they could be run a single time. knucleotide is the only >=10s noisy benchmark.

Regarding error bars: instead of error bars specific to a run (this requires several iterations to show error bars), you could store for each benchmark its typical "noise range" (the largest observed difference in past noise-detecting runs), and display those as error bars for all future runs. This gives good visual feedback when looking at benchmark graphs, without requiring several iterations.

@shakthimaan
Copy link
Contributor

Noise has been reported for the soli benchmark for 5.1.0+trunk.
Reference: ocaml/ocaml#11102 (comment)

@kayceesrk
Copy link
Contributor Author

kayceesrk commented Aug 29, 2022

Noise has been reported for the soli benchmark for 5.1.0+trunk.
Reference: ocaml/ocaml#11102 (comment)

As mentioned in the linked comment, soli running time is too small. Either we make it run longer or remove the macro benchmark tag. It was a mistake to have tagged it as a macro benchmark in the first place.

See #348. I believe I've fixed a number of these to run longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants