Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark order affects results #57

Open
mindplay-dk opened this issue Aug 14, 2023 · 11 comments
Open

Benchmark order affects results #57

mindplay-dk opened this issue Aug 14, 2023 · 11 comments

Comments

@mindplay-dk
Copy link

I have a benchmark here, in which the order of the tests seems to completely change the results.

That is, if the order is like this, sigma:defer wins by about 10-15% ...

  add('sigma:defer', () => parseSigmaDefer(SAMPLE)),
  add('sigma:grammar', () => parseSigmaGrammar(SAMPLE)),
  add('parjs', () => parseParjs(SAMPLE)),

Whereas, if the order is like this, sigma:grammar wins by the same 10-15% ...

  add('sigma:grammar', () => parseSigmaGrammar(SAMPLE)),
  add('sigma:defer', () => parseSigmaDefer(SAMPLE)),
  add('parjs', () => parseParjs(SAMPLE)),

So it would appear whatever runs first just wins.

I tried tweaking all the options as well, minimums, delay, etc. - nothing changes.

I wonder if benchmark is still reliable after 6 years and no updates? It's benchmark method was first described 13 years ago - a lot of water under the bridge since then, I'm sure?

To start with, I'd expect benchmarks should run in dedicated Workers, which I don't think existed back then?

Even then, they probably shouldn't run one after the other (111122223333) but rather round-robin (123123123123) or perhaps even randomly, to make sure they all get equally affected by the garbage collector, run-time optimizations, and other side-effects? Ideally, they probably shouldn't even run in the same process though.

@StreetStrider
Copy link

I had that in my experience.
Try adding dummy test as a first test. This what is mine looks like:

		var n = 1
		var emit = (m) => { n = (n * m) }
		return () =>
		{
			emit(-1)
		}

It seemes to heat up things, then the order become irrelevant.
https://github.com/StreetStrider/perf/blob/2a8f49534a2fc8629a4a8494432de776c0acc15d/perf.js#L30-L54

You can also try to run tests as 1-2-1-2 scheme.

@mindplay-dk
Copy link
Author

mindplay-dk commented Aug 14, 2023

It seemes to heat up things, then the order become irrelevant.

Not in my case.

  add('sigma:defer:1', () => parseSigmaDefer(SAMPLE)),
  add('sigma:grammar:1', () => parseSigmaGrammar(SAMPLE)),
  add('sigma:defer:2', () => parseSigmaDefer(SAMPLE)),
  add('sigma:grammar:2', () => parseSigmaGrammar(SAMPLE)),
  sigma:defer:1:
    1 363 ops/s, ±0.51%   | fastest

  sigma:grammar:1:
    1 198 ops/s, ±0.85%   | 12.11% slower

  sigma:defer:2:
    1 351 ops/s, ±0.36%   | 0.88% slower

  sigma:grammar:2:
    1 237 ops/s, ±0.79%   | 9.24% slower

  parjs:
    260 ops/s, ±0.67%     | slowest, 80.92% slower
  add('sigma:grammar:1', () => parseSigmaGrammar(SAMPLE)),
  add('sigma:defer:1', () => parseSigmaDefer(SAMPLE)),
  add('sigma:grammar:2', () => parseSigmaGrammar(SAMPLE)),
  add('sigma:defer:2', () => parseSigmaDefer(SAMPLE)),
  sigma:grammar:1:
    1 389 ops/s, ±0.34%   | fastest

  sigma:defer:1:
    1 152 ops/s, ±0.86%   | 17.06% slower

  sigma:grammar:2:
    1 381 ops/s, ±0.35%   | 0.58% slower

  sigma:defer:2:
    1 168 ops/s, ±0.75%   | 15.91% slower

  parjs:
    261 ops/s, ±0.47%     | slowest, 81.21% slower

As you can see, the first and second run of the same function give about the result in both cases - however, the one that gets to go first is faster in both cases.

But wait, there's more.

parseSigmaGrammar(SAMPLE) // 👈 grammar first
parseSigmaDefer(SAMPLE)

suite(
  'JSON :: sigma vs parjs',

  add('sigma:defer', () => parseSigmaDefer(SAMPLE)), // 👈 defer first
  add('sigma:grammar', () => parseSigmaGrammar(SAMPLE)),
  add('parjs', () => parseParjs(SAMPLE)),

  ...handlers
)
  sigma:defer:
    1 162 ops/s, ±0.92%   | 14.81% slower

  sigma:grammar:
    1 364 ops/s, ±0.67%   | fastest

  parjs:
    265 ops/s, ±0.27%     | slowest, 80.57% slower
parseSigmaDefer(SAMPLE)  // 👈 defer first
parseSigmaGrammar(SAMPLE)

suite(
  'JSON :: sigma vs parjs',

  add('sigma:defer', () => parseSigmaDefer(SAMPLE)),  // 👈 defer first
  add('sigma:grammar', () => parseSigmaGrammar(SAMPLE)),
  add('parjs', () => parseParjs(SAMPLE)),

  ...handlers
)
  sigma:defer:
    1 321 ops/s, ±1.00%   | fastest

  sigma:grammar:
    1 238 ops/s, ±0.91%   | 6.28% slower

  parjs:
    261 ops/s, ±0.41%     | slowest, 80.24% slower

So it's really only a matter of which function gets called first - even if it gets called once outside of the benchmark, this somehow determines the winner.

I have no explanation for this. 😅

You can also try to run tests as 1-2-1-2 scheme.

I guess, how would you do that?

Although, based on this, there is no reason to think the results will be any different.

@StreetStrider
Copy link

I think you've already did 1-2-1-2 with:

  add('sigma:grammar:1', () => parseSigmaGrammar(SAMPLE)),
  add('sigma:defer:1', () => parseSigmaDefer(SAMPLE)),
  add('sigma:grammar:2', () => parseSigmaGrammar(SAMPLE)),
  add('sigma:defer:2', () => parseSigmaDefer(SAMPLE)),

The picture looks very similar to what of mine and it was solved for me with zero test.
I can't see zero test anywhere, but OK if it is there, I leave it to you.

Another thing I can't see but what may be important is IO. If parse routine involves reading of actual files the actual deviation may be much more than 1% displayed.

🤔 benchmarking is hard, hopefully someone would join us with some better insights.

This is how things look with zero test for me
Running "zero" suite...
Progress: 100%

  zero:
    439 080 696 ops/s, ±0.43%   | fastest

Finished 1 case!
Running "dict" suite...
Progress: 100%

  Map:
    1 664 867 ops/s, ±0.84%   | fastest

  Map try/catch:
    1 644 731 ops/s, ±0.95%   | 1.21% slower

  object:
    862 811 ops/s, ±28.62%     | 48.18% slower

  object null:
    811 373 ops/s, ±49.17%     | slowest, 51.26% slower

Finished 4 cases!
  Fastest: Map
  Slowest: object null
Running "vector" suite...
Progress: 100%

  [] index (no len):
    2 116 ops/s, ±0.30%   | fastest

  [] push:
    2 109 ops/s, ±0.27%   | 0.33% slower

  fixed size fill-map:
    2 114 ops/s, ±0.23%   | 0.09% slower

  fixed size:
    2 060 ops/s, ±0.27%   | slowest, 2.65% slower

Finished 4 cases!
  Fastest: [] index (no len)
  Slowest: fixed size
Running "set" suite...
Progress: 100%

  for-of:
    9 072 ops/s, ±0.24%   | fastest

  forEach:
    5 618 ops/s, ±0.20%   | 38.07% slower

  [] forEach:
    987 ops/s, ±0.18%     | slowest, 89.12% slower

  [] for:
    2 111 ops/s, ±0.18%   | 76.73% slower

Finished 4 cases!
  Fastest: for-of
  Slowest: [] forEach

@mindplay-dk
Copy link
Author

Yeah, I tried the "zero test" - just didn't commit (or include it above) because it didn't help.

suite(
  'zero',
  add('hello', () => {
    let n = 1

    const emit = (m: number) => {
      n = n * m
    }

    emit(-1)
    emit(-1)
    emit(-1)
  })
)

I tried it with/without cycle(), I tried it with one or multiple calls to emit, it doesn't seem to change anything - or at least not reliably... at one time it looked as though it was helping, but there's clearly a pretty big "random" factor here as well...

@mindplay-dk
Copy link
Author

I tried adding a "warmup" suite as well.

suite(
  'warmup',
  add('woosh', () => {
    parseSigmaGrammar(SAMPLE)
    parseSigmaDefer(SAMPLE)
  })
)

The order of the two calls in this still somehow determines the outcome of the following suite.

It really does look like the function that runs first gets the most favorable conditions somehow.

I really don't think mixing tests in a single run-time with V8 these days is reliable - the optimizations it makes are so incredibly complex, and it's definitely plausible that one function could affect the performance of another, since it does appear to be possible for any code to affect the performance of the engine overall, at least maybe temporarily.

My guess is the only reliable approach these days would be to fork the process before running each test. Or better still, just run the individual benchmarks one at a time under node. I'm going to try that next...

@mindplay-dk
Copy link
Author

Just to prove the point I'm trying to make, I decided to run each benchmark in isolation.

Just an ugly quick hack, but...

const benchmarks = {
  'sigma:grammar': () => parseSigmaGrammar(SAMPLE),
  'sigma:defer': () => parseSigmaDefer(SAMPLE),
  parjs: () => parseParjs(SAMPLE)
} as any

function selectBenchmark() {
  for (const name in benchmarks) {
    for (const arg of process.argv) {
      if (arg === name) {
        return add(name, benchmarks[name])
      }
    }
  }

  throw new Error('no benchmark selected')
}

suite(
  'JSON :: sigma vs parjs',

  selectBenchmark(),

  ...handlers
)

And then in package.json, my script:

tsx src/json/index.ts -- sigma:defer && tsx src/json/index.ts -- sigma:grammar && tsx src/json/index.ts -- parjs

And of course, I tried changing my script to:

tsx src/json/index.ts -- sigma:grammar && tsx src/json/index.ts -- sigma:defer && tsx src/json/index.ts -- parjs

The result is now what I expected: grammar comes out slightly on top (most ops/sec) every time, and the order in which I run these benchmarks has no effect.

Also, the numbers between individual runs are now much more consistent.

I'm afraid the benchmark package is very outdated and probably not reliable anymore.

For the record, I'm on Node v18.17.1.

@StreetStrider
Copy link

StreetStrider commented Aug 15, 2023

My guess is the only reliable approach these days would be to fork the process before running each test.

I think the same more and more. Somehow zero test fixed a lot of problems for me in the past (the most recent runs I had was Node v16). I think you're right and isolated test is the destination point here.

The idea of zero test is if first test is optimized, create some fake load to optimize. Then, following test are in the same conditions. The problem with isolated tests is that numbers can change between runs. When I run grouped tests absolute numbers do change, but the ratio remains relatively stable and I can compare.

@mindplay-dk
Copy link
Author

Unfortunately, forking the process (or launching a dedicated process) would break compatibility with the browser. Isolating tests in a Worker is another option, but that breaks compatibility with DOM, and other facilities unavailable to workers.

It's hard to think of a reliable way to do this both in browsers and under Node. 🤔

@mindplay-dk
Copy link
Author

Just to eliminate the funky maths in benchmark as a possible source of error, I did a totally vanilla for loop benchmark - for the record, even with no framework at all, whatever function runs first really does win. So this most likely does have something to do with the V8/JS runtime.

@mindplay-dk
Copy link
Author

Oh, hello.

https://github.com/Llorx/iso-bench

🤔

@Llorx
Copy link

Llorx commented Aug 16, 2023

Unfortunately, forking the process (or launching a dedicated process) would break compatibility with the browser. Isolating tests in a Worker is another option, but that breaks compatibility with DOM, and other facilities unavailable to workers.

It's hard to think of a reliable way to do this both in browsers and under Node. 🤔

In browsers is difficult, if not impossible. I've tried running iso-bench in workers instead of different processes and still had optimization pollution, so had to convert it to a fork. There MAY be a way with browsers, but it requires automatically reloading the website and such, which may not be a high price to pay just to benefit from the isolated benchmarking. That's something in my TODO list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants