Possible buffer leak in blocking stub DEADLINE_EXCEEDED race condition #7105

njhill · 2020-06-08T22:41:31Z

One of my unit tests started reporting a netty ByteBuf leak when upgrading from 1.26.x to 1.27.0, and it remains in subsequent versions up to 1.29.0. I'm not 100% sure but strongly suspect it's related to #6328, and in particular the interplay of that change with the ThreadlessExecutor used by blocking unary calls.

It's a somewhat extreme case, the test itself was there to verify a past unrelated concurrency issue. Basically making many consecutive calls using blocking stub with very short (1ms) deadlines. Though likely rare in the wild I think it does expose a real bug.

Here is example snippet which reproduces the leak error log(s) every time mid-test for me when pointed at a local etcd server:

        ManagedChannel channel = NettyChannelBuilder.forTarget("localhost:2379")
                .usePlaintext().build();
        KVBlockingStub stub = KVGrpc.newBlockingStub(channel);
        for (int i = 0; i < 6000; i++) {
            try {
                stub.withDeadlineAfter(1L, TimeUnit.MILLISECONDS)
                    .put(PutRequest.newBuilder().setValue(bs("some-value"))
                        .setKey(bs("test/" + UUID.randomUUID().toString())).build());
            } catch (Exception e) {
                Status status = Status.fromThrowable(e);
                // Throws a mix of server and client-side DEADLINE_EXCEEDED errors
            }
        }

[grpc-nio-worker-ELG-1-7] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:363)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
        ...

I can provide more of the leak access records if needed.

The text was updated successfully, but these errors were encountered:

As reported in grpc#7105. Not sure if this is how you want it done, but it does fix the problem. Fixes grpc#7105 Fixes grpc#3557

njhill · 2020-06-09T15:55:56Z

I found that since 1.27.0 tasks might be submitted to the app executor after the call is closed / status set. I guess that either needs to be avoided again or accepted/handled - I made a fix for the latter in #7106 and verified that my reproducer no longer produces the ByteBuf leak with it.

ejona86 · 2020-06-12T01:55:11Z

So we think the response is arriving and racing with the deadline?

I think part of the problem is we are calling executeCloseObserverInContext without the transport's involvement.

I think we would want to go with the "avoid doing that" approach, but we'll have to think about how to do it (worst-case, we can have the transports handle the delay #6328 was trying to achieve; that may even be easier). We may want to revert #6328 while we do a fix. I don't see a quick-and-easy patch/workaround opportunity.

njhill · 2020-06-15T16:49:54Z

So we think the response is arriving and racing with the deadline?

@ejona86 I'm actually not sure since I didn't debug fully, but that seems most likely given the observations.

This should avoid messages being leaked when a Listener throws an exception and the executor is shut down immediately after the call completes. This is related to grpc#7105 but a different scenario and we aren't aware of any user having observed the previous behavior. Note also this does _not_ fix the similar case of reordering caused by delayedCancelOnDeadlineExceeded().

ejona86 · 2020-07-07T18:33:10Z

I feel confident there were two bugs introduced with #6328:

Other callbacks scheduled after onClose(). This is what njhill is experiencing, as we aren't guaranteed the executor will be available after onClose(). I'm cleaning up a similar situation in core: Delay client listener exception notification until normal close #7187
callExecutor is called from multiple threads. Right now callExecutor is only guaranteed to be single-producer, single-consumer, so this is a thread-safety issue. Although SerializingExecutor is multi-producer-safe, so most users would be okay. SerializeReentrantCallsDirectExecutor is not multi-producer-safe.

I think a proper fix would be to jump onto the transport thread and schedule onClose() from there (which would mean adding runOnTransportThread to Stream). From that point forward the ClientStreamListener would need to release resources without calling into the application (so no more usage of callExecutor).

Cc @ran-su, @zhangkun83

This should avoid messages being leaked when a Listener throws an exception and the executor is shut down immediately after the call completes. This is related to #7105 but a different scenario and we aren't aware of any user having observed the previous behavior. Note also this does _not_ fix the similar case of reordering caused by delayedCancelOnDeadlineExceeded().

ejona86 · 2020-09-23T14:53:43Z

@ran-su, @zhangkun83, I'm seriously considering reverting #6328 , since it has been 3.5 months and we still have the memory leak caused by it.

ran-su · 2020-09-23T17:40:20Z

Talked with Kun, we can revert #6328 for now, and I will be come back to this issue later(like next quarter).

ejona86 · 2020-09-24T15:35:52Z

@ran-su, will y'all handle the revert? I expect it'll need a large test run.

ran-su · 2020-09-24T17:18:26Z

Sure, I'll run an internal test before send out a PR.

ejona86 · 2020-12-09T23:54:38Z

Note: a user has noticed a bug where onMessage() could be called after onClose() on client-side (b/169277773). The problem was present starting in 1.27 and fixed in 1.33, so we strongly believe it was this bug.

That means this was even more serious of a bug than previous believed. I'm very surprised more didn't break. Maybe more broke, and people just couldn't debug it. I'm going to update the release notes to mention it can produce weird callback orders.

This should avoid messages being leaked when a Listener throws an exception and the executor is shut down immediately after the call completes. This is related to grpc#7105 but a different scenario and we aren't aware of any user having observed the previous behavior. Note also this does _not_ fix the similar case of reordering caused by delayedCancelOnDeadlineExceeded().

njhill added the question label Jun 8, 2020

njhill added a commit to njhill/grpc-java that referenced this issue Jun 9, 2020

Fix client call close race condition message leak

2734349

As reported in grpc#7105. Not sure if this is how you want it done, but it does fix the problem. Fixes grpc#7105 Fixes grpc#3557

njhill added a commit to njhill/grpc-java that referenced this issue Jun 9, 2020

Fix client call close race condition message leak

78c7a4d

As reported in grpc#7105. Not sure if this is how you want it done, but it does fix the problem. Fixes grpc#7105 Fixes grpc#3557

njhill mentioned this issue Jun 9, 2020

Fix client call close race condition message leak #7106

Closed

voidzcy self-assigned this Jun 10, 2020

ejona86 added bug and removed question labels Jun 12, 2020

ejona86 added this to the 1.31 milestone Jun 12, 2020

ejona86 mentioned this issue Jul 7, 2020

core: Delay client listener exception notification until normal close #7187

Merged

ericgribkoff modified the milestones: 1.31, 1.32 Jul 28, 2020

ejona86 modified the milestones: 1.32, 1.33 Sep 9, 2020

ran-su assigned ran-su and unassigned voidzcy Sep 23, 2020

ran-su mentioned this issue Sep 25, 2020

Revert "core: delay sending cancel request on client-side when deadline expires (#6328)" #7457

Merged

ejona86 closed this as completed Oct 5, 2020

scele mentioned this issue Feb 17, 2021

Output download failed: digest mismatch bazelbuild/bazel#12927

Closed

voidzcy mentioned this issue Feb 23, 2021

ByteBuf.release() was not called before it's garbage-collected #5978

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible buffer leak in blocking stub DEADLINE_EXCEEDED race condition #7105

Possible buffer leak in blocking stub DEADLINE_EXCEEDED race condition #7105

njhill commented Jun 8, 2020

njhill commented Jun 9, 2020

ejona86 commented Jun 12, 2020

njhill commented Jun 15, 2020

ejona86 commented Jul 7, 2020 •

edited

ejona86 commented Sep 23, 2020

ran-su commented Sep 23, 2020

ejona86 commented Sep 24, 2020

ran-su commented Sep 24, 2020

ejona86 commented Dec 9, 2020

Possible buffer leak in blocking stub DEADLINE_EXCEEDED race condition #7105

Possible buffer leak in blocking stub DEADLINE_EXCEEDED race condition #7105

Comments

njhill commented Jun 8, 2020

njhill commented Jun 9, 2020

ejona86 commented Jun 12, 2020

njhill commented Jun 15, 2020

ejona86 commented Jul 7, 2020 • edited

ejona86 commented Sep 23, 2020

ran-su commented Sep 23, 2020

ejona86 commented Sep 24, 2020

ran-su commented Sep 24, 2020

ejona86 commented Dec 9, 2020

ejona86 commented Jul 7, 2020 •

edited