Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve summary performance by replacing slice with container/list (doubly-linked list) #1301

Open
lujiajing1126 opened this issue Jun 21, 2023 · 3 comments

Comments

@lujiajing1126
Copy link

During benchmark, we found that the underlying summary implementation, i.e. bquant, wastes much CPU on runtime.memorymove.

image

The benchmark here, https://github.com/beorn7/perks/blob/master/quantile/bench_test.go#L7-L23, cannot show the real performance since the recorded element is monotonically increasing. Thus lines containing copy are not reached.

After using container/list, we get a better performance with a series of random float64 inserted, which is much more similar to the real scenario.

image

Though using doubly-linked list introduce obj allocation, it perform double qps under micro-benchmark,

Linked List

goos: darwin
goarch: arm64
pkg: github.com/beorn7/perks/quantile
BenchmarkInsertTargetedSmallEpsilon
BenchmarkInsertTargetedSmallEpsilon-20    	 6934218	       169.8 ns/op	      96 B/op	       3 allocs/op
PASS

Slice

goos: darwin
goarch: arm64
pkg: github.com/beorn7/perks/quantile
BenchmarkInsertTargetedSmallEpsilon
BenchmarkInsertTargetedSmallEpsilon-20    	 3421015	       343.3 ns/op	       0 B/op	       0 allocs/op
PASS
@lujiajing1126
Copy link
Author

The implementation can be found here https://github.com/lujiajing1126/perks/tree/use-linked-list

@beorn7
Copy link
Member

beorn7 commented Jun 21, 2023

The question is how much of the CPU is then burned by the additional GC caused. In the past, we were certainly very paranoid with any kind of allocation. With better GC, this might be unfounded by now. But I guess we need a real-world end-to-end benchmark rather than a micro benchmark…

@lujiajing1126
Copy link
Author

The question is how much of the CPU is then burned by the additional GC caused. In the past, we were certainly very paranoid with any kind of allocation. With better GC, this might be unfounded by now. But I guess we need a real-world end-to-end benchmark rather than a micro benchmark…

Thanks for the remind. We will setup a real-world E2E test to see if it could help and/or how much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants