Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a byte buffer pool for reusing bytes #292

Closed
wants to merge 0 commits into from
Closed

Add a byte buffer pool for reusing bytes #292

wants to merge 0 commits into from

Conversation

panjf2000
Copy link

@panjf2000 panjf2000 commented Dec 19, 2019

Signed-off-by: Andy Pan panjf2000@gmail.com

@panjf2000
Copy link
Author

gin-gonic/gin#2179

@panjf2000
Copy link
Author

@julienschmidt
PTAL

@panjf2000
Copy link
Author

panjf2000 commented Dec 19, 2019

One of the TravisCI jobs failed when reporting code coverage to goveralls and I have no access to restart that job, maybe you could retry it:

PASS
coverage: 100.0% of statements
ok  	github.com/julienschmidt/httprouter	0.006s
The command "go test -v -covermode=count -coverprofile=coverage.out" exited with 0.
0.15s$ go vet ./...
The command "go vet ./..." exited with 0.
0.03s$ test -z "$(gofmt -d -s . | tee /dev/stderr)"
The command "test -z "$(gofmt -d -s . | tee /dev/stderr)"" exited with 0.
0.87s$ $HOME/gopath/bin/goveralls  -coverprofile=coverage.out -service=travis-ci
Bad response status from coveralls: 422
{"message":"Couldn't find Travis Job 627272871 from https://api.travis-ci.org (Service name: travis-ci). ","error":true}
The command "$HOME/gopath/bin/goveralls  -coverprofile=coverage.out -service=travis-ci" exited with 1.

@dahankzter
Copy link
Contributor

Any profiling to see if this makes a difference? Was it Rick Hudson that said that every use of a pool is a bug in the GC?

@panjf2000
Copy link
Author

I've tested it on gin: benchmark code and the results show as below:
CPU:

      flat  flat%   sum%        cum   cum%
    1140ms 23.22% 23.22%     2500ms 50.92%  github.com/gin-gonic/gin.cleanPathRaw
    1010ms 20.57% 43.79%     1010ms 20.57%  runtime.pthread_cond_signal
     560ms 11.41% 55.19%     1010ms 20.57%  runtime.mallocgc
     340ms  6.92% 62.12%     1060ms 21.59%  github.com/gin-gonic/gin.cleanPath
     150ms  3.05% 65.17%      250ms  5.09%  github.com/gin-gonic/gin.bufApp
     150ms  3.05% 68.23%      530ms 10.79%  github.com/gin-gonic/gin.bufAppRaw
     150ms  3.05% 71.28%      150ms  3.05%  runtime.memmove
     130ms  2.65% 73.93%      620ms 12.63%  runtime.makeslice
     120ms  2.44% 76.37%      660ms 13.44%  runtime.slicebytetostring
     110ms  2.24% 78.62%      110ms  2.24%  runtime.acquirem

Mem:

      flat  flat%   sum%        cum   cum%
  343.01MB 38.99% 38.99%   578.01MB 65.70%  github.com/gin-gonic/gin.cleanPathRaw
  235.01MB 26.71% 65.70%   235.01MB 26.71%  github.com/gin-gonic/gin.bufAppRaw
  217.01MB 24.67% 90.37%   217.01MB 24.67%  github.com/gin-gonic/gin.glob..func4
   52.50MB  5.97% 96.34%    52.50MB  5.97%  github.com/gin-gonic/gin.cleanPath
      28MB  3.18% 99.52%   245.01MB 27.85%  github.com/gin-gonic/gin.cleanPathSyncPool
         0     0% 99.52%   578.01MB 65.70%  github.com/gin-gonic/gin.BenchmarkCleanPath
         0     0% 99.52%    52.50MB  5.97%  github.com/gin-gonic/gin.BenchmarkCleanPathBytesPool
         0     0% 99.52%   245.01MB 27.85%  github.com/gin-gonic/gin.BenchmarkCleanPathSyncPool
         0     0% 99.52%   875.52MB 99.52%  testing.(*B).launch
         0     0% 99.52%   875.52MB 99.52%  testing.(*B).runN

@dahankzter
Copy link
Contributor

What is the difference from before the patch or am I missing how to read the output?

@panjf2000
Copy link
Author

panjf2000 commented Dec 20, 2019

cleanPathRaw is the current function and cleanPath is the one that uses byte pool.
The memory usage has been reduced to 15% after using a byte pool.

var (
// GetLen returns byte buffer with fixed length from the pool, exported from httprouter/bytebuffer.
GetLen = func(n int) *bytebufferpool.ByteBuffer {
b := bytebufferpool.Get()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses the defaultPool of github.com/valyala/bytebufferpool. How about allowing the user to set and configure their own pool? They could have several packages that all uses the pool.

Copy link
Author

@panjf2000 panjf2000 Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses the defaultPool of github.com/valyala/bytebufferpool. How about allowing the user to set and configure their own pool?

Actually the bytebufferpool package didn't expose APIs for customizing Pool so there is nothing we can do, but I think all configurations in the default pool has been adjusted near perfectly by the author of bytebufferpool.

They could have several packages that all uses the pool.

I don't think that would be a problem for several packages sharing one pool, that is what a pool does(serve for multiple invokers), isn't it?

@panjf2000
Copy link
Author

panjf2000 commented Dec 21, 2019

@dahankzter
Are there any other questions about this PR?

@dahankzter
Copy link
Contributor

I don't have any more questions in particular.
I am ambivalent about pooling in general but perhaps it's a good thing in this case.

What does @julienschmidt think?

@panjf2000
Copy link
Author

@julienschmidt
Hello, any conclusion?

@appleboy
Copy link

appleboy commented Jan 4, 2020

Waiting for the PR. See the Gin framework

@julienschmidt
Copy link
Owner

julienschmidt commented Jan 4, 2020

Two general thoughts:

  1. It might be better to use a static sized buffer on the stack (with some size between 64 and 256 bytes, I doubt that many inputs are longer than that) and avoid heap allocations in the common case.
  2. We're not going to start adding an external dependency for this. Just use a sync.Pool for cases where a large buffer (e.g. 1024 bytes) is required. Anything larger than that should be allocated only when needed anyway.

@SamWhited
Copy link

We're not going to start adding a dependency for this. Just use a sync.Pool

With the disclaimer that I haven't looked at this PR and only quickly glanced over github.com/valyala/bytebufferpool, it appears these serve two different purposes.

sync.Pool is about garbage collection pressure and buffers in a pool may be GCed at any time.

The bytebufferpool package appears to keep the entire pool alive forever and is more about allocating long-lived buffers with some resistance to memory fragmentation.

Depending on exactly why you're making this PR, it may or may not make sense to use one or the other, but it's worth being aware that they may not be interchangeable.

@julienschmidt
Copy link
Owner

julienschmidt commented Jan 4, 2020

Indeed. But as I understood it, the aim of this PR is to avoid memory allocations.
Reusing already allocated buffers from a sync.Pool fulfills the same purpose in this case.

But I don't see how keeping a memory pool alive forever could be a good idea here. Assume we have a peak of 10k concurrent requests, while normally we only serve less than a hundred concurrently. Do we really want to keep that much memory allocated if such a peak isn't likely to happen any time soon again? There is no magic bullet, but I think a sync.Pool is actually the much better tool in this case.

But like I said above, avoiding the need to use some sort of pool is most likely far better anyway.

@panjf2000
Copy link
Author

panjf2000 commented Jan 5, 2020

But I don't see how keeping a memory pool alive forever could be a good idea here.

bufferpool uses the sync.Pool inside so it doesn't keep the pool alive forever, don't worry about that.

It might be better to use a static sized buffer on the stack.

But it can't be done cuz the lengths of paths are dynamic, so it is going to be allocations on the heap.

@julienschmidt

@julienschmidt
Copy link
Owner

julienschmidt commented Jan 5, 2020

Adding bufferpool is not an option here. The potential gain, if any, simply doesn't justify it.

But it can't be done cuz the lengths of paths are dynamic, so it is going to be allocations on the heap.

One uses e.g. a static sized slice / array with capacity 128 on the stack and for any len(input) <= 128 one re-slices the existing slice, for anything larger than that one either takes an again static sized slice e.g. with capacity 1024 from a sync.Pool or one skips that pool and just dynamically allocates the slice on the heap.

@panjf2000
Copy link
Author

panjf2000 commented Jan 5, 2020

@julienschmidt
I've gotten your points and am going to rewrite the code of this PR based on your thoughts, we will see how that works and keep discussing it further, thanks!

@panjf2000
Copy link
Author

One uses e.g. a static sized slice / array with capacity 128 on the stack and for any len(input) <= 128 one re-slices the existing slice

I just realize that reusing one slice will trigger race condition cuz CleanPath() runs concurrently.

@julienschmidt
Copy link
Owner

julienschmidt commented Jan 6, 2020

See 8222db1

Original:

BenchmarkPathClean-4   	  630038	      1893 ns/op	     416 B/op	      40 allocs/op

With stack buf (8222db1):

BenchmarkPathClean-4   	  731874	      1550 ns/op	     112 B/op	      17 allocs/op

With bytebufferpool:

BenchmarkPathClean-4   	  428262	      2728 ns/op	     112 B/op	      17 allocs/op

The remaining 17 allocs/op should be for the returned strings.

@julienschmidt
Copy link
Owner

I pushed the stack buffer commit. Feel free to implement further optimizations for inputs longer than 128 bytes.
The other interesting question is what the actual stack buffer size should be. 128 is just a random number I picked. What is the usual (maximum) length of a path?

@julienschmidt
Copy link
Owner

If you're still wondering how this stack buffer works: If the size is static (and the buffer does not escape), the size is known before the function is called, thus during the function call some space for this buffer can be reserved in the stack frame instead of allocating it on the heap.

@panjf2000
Copy link
Author

panjf2000 commented Jan 6, 2020

Got it, I will update this PR for the cases that lengths of input strings are greater than 128 bytes.

@julienschmidt
Copy link
Owner

One can actually push this concept a bit further and provide different sized stack buffers via helper functions: 92dde9c (not in master)

path.go Outdated
return string(buf[:w])

cPath := string(buf[:w])
putBack(buf)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're also putting back the supposed stack buffer. Thus, it is marked as "escaping" and actually allocated on the heap

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the point, it seems that no way to avoid allocations on the heap when using sync.Pool, so I guess we are abandoning sync.Pool?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could avoid it by something like that:

var poolBuf []byte

...

poolBuf = getLen(n + 1)
buf = poolBuf

...

if poolBuf != nil {
    putBack(poolBuf)
}

Most likely 92dde9c is the more efficient approach. But you're also almost done, so let's benchmark them both.

Copy link
Author

@panjf2000 panjf2000 Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is going to work cuz the root cause of "escaping" variable buf is from sync.Pool.Put(interface) not putBuf, so sync.Pool will always make allocs on the heap.

path_test.go Outdated
@@ -81,7 +81,7 @@ func TestPathCleanMallocs(t *testing.T) {

for _, test := range cleanTests {
allocs := testing.AllocsPerRun(100, func() { CleanPath(test.result) })
if allocs > 0 {
if allocs > 1 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 allocs was indeed correct.

@julienschmidt
Copy link
Owner

julienschmidt commented Jan 6, 2020

I added a simple benchmark for long inputs: 8e4b52b

It seems like the idea 92dde9c is not worth pursuing further:

Single stack buf:

BenchmarkPathClean-4          801126          1480 ns/op         112 B/op         17 allocs/op
BenchmarkPathCleanLong-4         230       5122911 ns/op     3217488 B/op       4683 allocs/op

Helpers with differently sized stack bufs (92dde9c):

BenchmarkPathClean-4          738854          1616 ns/op         112 B/op         17 allocs/op
BenchmarkPathCleanLong-4         169       7185612 ns/op     4724305 B/op       6393 allocs/op

@panjf2000
Copy link
Author

panjf2000 commented Jan 6, 2020 via email

@panjf2000 panjf2000 deleted the bytes-pool branch January 7, 2020 04:10
@julienschmidt
Copy link
Owner

Thanks for the proposal, which led to an improvement, anyway!

similark pushed a commit to similarweb/httprouter that referenced this pull request May 9, 2023
Signed-off-by: Aaron Schlesinger <aaron@ecomaz.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants