Excessive data retention on fast benchmarks #61

jdmarshall · 2023-10-20T21:17:11Z

I don't think the warmup time for tinybench is anywhere near adequate to getting V8 to settle the code. 'Solving' that problem by deoptimizing tinybench is fighting the wrong fight.

In an attempt to get more accurate data out of tinybench, I've been using time: 2000 or in some cases 6000 for some test suites.

The problem here is that given some leaf node tests are running 2, 3, even 4 orders of magnitude faster than the more complex tests, some of my tests are generating thousands of samples, and others are generating several million. This extra memory pressure is causing some weird race conditions with the code under test.

Now while switching the telemetry calculation to an algorithm that uses k or logn memory complexity for running totals is probably a lot to ask, retaining samples for task A while running task C is a waste of resources.

Expected behavior: Disabling all but one task should result in the same success or failure modes for the code under test. The summaries should be calculated at the end of a task and the working data discarded.

Actual behavior: all statistics are kept for the entire run.

Workaround: iterations attribute instead of time

jdmarshall · 2023-10-20T21:22:49Z

Because the summary is calculated after teardown, I don't even have the option to force this behavior in my own tests.

Aslemammad · 2023-10-21T05:30:00Z

I see, thank you so much for submitting an issue.
Doesn't the iterations option help?

jdmarshall · 2023-10-26T21:01:43Z

That's what I'm using as the workaround. I'm definitely still having some sort of issue with ensuring V8 is completely warm, but then that's a very old benchmarking problem.

jdmarshall · 2023-10-26T21:03:54Z

Do you suppose it would break anyone if you calculated the performance numbers prior to teardown?

I have no way of knowing if anyone but you is depending on the existence of the telemetry data at the end of a run. I could expunge the samples during teardown. It's not awesome, but as a documented workaround I've encountered worse.

But right now teardown seems to be first, which scuttles that idea.

My most problematic library has a couple dozen benchmarks, so clearing the data between tasks is more than an order of magnitude difference in memory utilization.

Aslemammad · 2023-10-27T03:32:53Z

Do you suppose it would break anyone if you calculated the performance numbers prior to teardown?

I don't think so, we can try and see.

Aslemammad · 2023-10-27T03:33:53Z

expunge the samples during teardown

The samples are the problem here?

jdmarshall · 2023-10-27T18:30:11Z

That's where all the data goes, yes. But teardown is the last event I can see between tasks, and it is called immediately before the P## calculations are done. I could cheese something with the setup event, but that involves figuring out what task we are on and looking up the previous task, which seems problematic.

For me the difference in order between the calculations and the event don't matter, excepting whether I can manually prune data prior to the table() call. But I don't know enough people using tinybench enough ways to offer any sort of suggestion on whether changing the order would break their use cases.

I hate adding flags to fix user-specific problems, so a programmatic solution seems slightly more palatable IMO. But bumping the minor version number might be called for if you decide on that option.

jdmarshall · 2023-10-28T21:58:41Z

I've gone down a rabbit hole. Send help.

https://github.com/tdunning/t-digest

Unfortunately the whole tinybench library is about the size of this algorithm (although it's in Java so maybe divide that by 2?).

So while this is academically cool I expect this is A Bridge Too Far territory.

Aslemammad · 2023-10-29T03:31:50Z

Haha, I see, it's getting a bit complicated.
but when you gather all your findings, summarize them so we can work on them soon.

jdmarshall · 2023-10-29T05:02:27Z

This one doesn't seem too bad. The actual logic part of it is about 2 computer screens, and it's based on a refinement of t-digest. It's also the only one that seems to be moderately actively maintained.

https://github.com/hville/sample-distribution

The algorithm does have fairly enticing amortized cost overheads for space and computation. Less than the sort() call.

But really just inverting the calc and event emit phase is probably going to solve most people's problems. If your test can't manage 50 mb of telemetry per app you've got bigger fish to fry. If you tried to scale this up to some sort of replacement for Gatling or blazemeter, then you would really want to look at this.

polak-jan · 2023-11-04T18:13:19Z

I have had vitest crash while running benchmarks, and I am pretty sure this was the reason why. The crash was caused by heap memory being filled up. On subsequent runs with lower benchmark times I am easily getting over 5GBs used by the process, which seems to be mostly the samples.

Had to add a loop inside the test cases to aritifically slow them down to get the run-time I wanted without crashing.

jdmarshall · 2023-11-04T21:37:41Z

Workaround: It looks like you could clear the extra statistics in the 'cycle' event (though to me 'teardown' sounds like the spot where one would expect to manipulate system state).

psirenny · 2024-01-22T13:06:25Z

Had to add a loop inside the test cases to aritifically slow them down to get the run-time I wanted without crashing.

@polak-jan could you explain your workaround a bit more?

polak-jan · 2024-01-22T13:10:19Z

@psirenny I just added a loop around the entire benchmark, so each run of the benchmark is actually X number of runs. This isn't really ideal as it can interfere with things like cache and JIT states. But I don't think there is any other reasonable way to get around the limitation right now. And assuming all your benchmarks use the same workaround they should all be offset by a similar amount so comparisons should still be mostly accurate.

jdmarshall mentioned this issue Nov 4, 2023

reorder teardown callback to after the error event, and setResults. #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive data retention on fast benchmarks #61

Excessive data retention on fast benchmarks #61

jdmarshall commented Oct 20, 2023 •

edited

jdmarshall commented Oct 20, 2023

Aslemammad commented Oct 21, 2023

jdmarshall commented Oct 26, 2023

jdmarshall commented Oct 26, 2023 •

edited

Aslemammad commented Oct 27, 2023

Aslemammad commented Oct 27, 2023

jdmarshall commented Oct 27, 2023

jdmarshall commented Oct 28, 2023

Aslemammad commented Oct 29, 2023

jdmarshall commented Oct 29, 2023 •

edited

polak-jan commented Nov 4, 2023 •

edited

jdmarshall commented Nov 4, 2023

psirenny commented Jan 22, 2024

polak-jan commented Jan 22, 2024

Excessive data retention on fast benchmarks #61

Excessive data retention on fast benchmarks #61

Comments

jdmarshall commented Oct 20, 2023 • edited

jdmarshall commented Oct 20, 2023

Aslemammad commented Oct 21, 2023

jdmarshall commented Oct 26, 2023

jdmarshall commented Oct 26, 2023 • edited

Aslemammad commented Oct 27, 2023

Aslemammad commented Oct 27, 2023

jdmarshall commented Oct 27, 2023

jdmarshall commented Oct 28, 2023

Aslemammad commented Oct 29, 2023

jdmarshall commented Oct 29, 2023 • edited

polak-jan commented Nov 4, 2023 • edited

jdmarshall commented Nov 4, 2023

psirenny commented Jan 22, 2024

polak-jan commented Jan 22, 2024

jdmarshall commented Oct 20, 2023 •

edited

jdmarshall commented Oct 26, 2023 •

edited

jdmarshall commented Oct 29, 2023 •

edited

polak-jan commented Nov 4, 2023 •

edited