Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive data retention on fast benchmarks #61

Open
jdmarshall opened this issue Oct 20, 2023 · 14 comments
Open

Excessive data retention on fast benchmarks #61

jdmarshall opened this issue Oct 20, 2023 · 14 comments

Comments

@jdmarshall
Copy link

jdmarshall commented Oct 20, 2023

I don't think the warmup time for tinybench is anywhere near adequate to getting V8 to settle the code. 'Solving' that problem by deoptimizing tinybench is fighting the wrong fight.

In an attempt to get more accurate data out of tinybench, I've been using time: 2000 or in some cases 6000 for some test suites.

The problem here is that given some leaf node tests are running 2, 3, even 4 orders of magnitude faster than the more complex tests, some of my tests are generating thousands of samples, and others are generating several million. This extra memory pressure is causing some weird race conditions with the code under test.

Now while switching the telemetry calculation to an algorithm that uses k or logn memory complexity for running totals is probably a lot to ask, retaining samples for task A while running task C is a waste of resources.

Expected behavior: Disabling all but one task should result in the same success or failure modes for the code under test. The summaries should be calculated at the end of a task and the working data discarded.

Actual behavior: all statistics are kept for the entire run.

Workaround: iterations attribute instead of time

@jdmarshall
Copy link
Author

Because the summary is calculated after teardown, I don't even have the option to force this behavior in my own tests.

@Aslemammad
Copy link
Member

I see, thank you so much for submitting an issue.
Doesn't the iterations option help?

@jdmarshall
Copy link
Author

That's what I'm using as the workaround. I'm definitely still having some sort of issue with ensuring V8 is completely warm, but then that's a very old benchmarking problem.

@jdmarshall
Copy link
Author

jdmarshall commented Oct 26, 2023

Do you suppose it would break anyone if you calculated the performance numbers prior to teardown?

I have no way of knowing if anyone but you is depending on the existence of the telemetry data at the end of a run. I could expunge the samples during teardown. It's not awesome, but as a documented workaround I've encountered worse.

But right now teardown seems to be first, which scuttles that idea.

My most problematic library has a couple dozen benchmarks, so clearing the data between tasks is more than an order of magnitude difference in memory utilization.

@Aslemammad
Copy link
Member

Do you suppose it would break anyone if you calculated the performance numbers prior to teardown?

I don't think so, we can try and see.

@Aslemammad
Copy link
Member

expunge the samples during teardown

The samples are the problem here?

@jdmarshall
Copy link
Author

That's where all the data goes, yes. But teardown is the last event I can see between tasks, and it is called immediately before the P## calculations are done. I could cheese something with the setup event, but that involves figuring out what task we are on and looking up the previous task, which seems problematic.

For me the difference in order between the calculations and the event don't matter, excepting whether I can manually prune data prior to the table() call. But I don't know enough people using tinybench enough ways to offer any sort of suggestion on whether changing the order would break their use cases.

I hate adding flags to fix user-specific problems, so a programmatic solution seems slightly more palatable IMO. But bumping the minor version number might be called for if you decide on that option.

@jdmarshall
Copy link
Author

I've gone down a rabbit hole. Send help.

https://github.com/tdunning/t-digest

Unfortunately the whole tinybench library is about the size of this algorithm (although it's in Java so maybe divide that by 2?).

So while this is academically cool I expect this is A Bridge Too Far territory.

@Aslemammad
Copy link
Member

Haha, I see, it's getting a bit complicated.
but when you gather all your findings, summarize them so we can work on them soon.

@jdmarshall
Copy link
Author

jdmarshall commented Oct 29, 2023

This one doesn't seem too bad. The actual logic part of it is about 2 computer screens, and it's based on a refinement of t-digest. It's also the only one that seems to be moderately actively maintained.

https://github.com/hville/sample-distribution

The algorithm does have fairly enticing amortized cost overheads for space and computation. Less than the sort() call.

But really just inverting the calc and event emit phase is probably going to solve most people's problems. If your test can't manage 50 mb of telemetry per app you've got bigger fish to fry. If you tried to scale this up to some sort of replacement for Gatling or blazemeter, then you would really want to look at this.

@polak-jan
Copy link

polak-jan commented Nov 4, 2023

I have had vitest crash while running benchmarks, and I am pretty sure this was the reason why. The crash was caused by heap memory being filled up. On subsequent runs with lower benchmark times I am easily getting over 5GBs used by the process, which seems to be mostly the samples.

Had to add a loop inside the test cases to aritifically slow them down to get the run-time I wanted without crashing.

@jdmarshall
Copy link
Author

Workaround: It looks like you could clear the extra statistics in the 'cycle' event (though to me 'teardown' sounds like the spot where one would expect to manipulate system state).

@psirenny
Copy link

Had to add a loop inside the test cases to aritifically slow them down to get the run-time I wanted without crashing.

@polak-jan could you explain your workaround a bit more?

@polak-jan
Copy link

@psirenny I just added a loop around the entire benchmark, so each run of the benchmark is actually X number of runs. This isn't really ideal as it can interfere with things like cache and JIT states. But I don't think there is any other reasonable way to get around the limitation right now. And assuming all your benchmarks use the same workaround they should all be offset by a similar amount so comparisons should still be mostly accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants