Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible buffer leak in blocking stub DEADLINE_EXCEEDED race condition #7105

Closed
njhill opened this issue Jun 8, 2020 · 9 comments
Closed
Assignees
Labels
Milestone

Comments

@njhill
Copy link
Contributor

njhill commented Jun 8, 2020

One of my unit tests started reporting a netty ByteBuf leak when upgrading from 1.26.x to 1.27.0, and it remains in subsequent versions up to 1.29.0. I'm not 100% sure but strongly suspect it's related to #6328, and in particular the interplay of that change with the ThreadlessExecutor used by blocking unary calls.

It's a somewhat extreme case, the test itself was there to verify a past unrelated concurrency issue. Basically making many consecutive calls using blocking stub with very short (1ms) deadlines. Though likely rare in the wild I think it does expose a real bug.

Here is example snippet which reproduces the leak error log(s) every time mid-test for me when pointed at a local etcd server:

        ManagedChannel channel = NettyChannelBuilder.forTarget("localhost:2379")
                .usePlaintext().build();
        KVBlockingStub stub = KVGrpc.newBlockingStub(channel);
        for (int i = 0; i < 6000; i++) {
            try {
                stub.withDeadlineAfter(1L, TimeUnit.MILLISECONDS)
                    .put(PutRequest.newBuilder().setValue(bs("some-value"))
                        .setKey(bs("test/" + UUID.randomUUID().toString())).build());
            } catch (Exception e) {
                Status status = Status.fromThrowable(e);
                // Throws a mix of server and client-side DEADLINE_EXCEEDED errors
            }
        }
[grpc-nio-worker-ELG-1-7] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:363)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
        ...

I can provide more of the leak access records if needed.

@njhill njhill added the question label Jun 8, 2020
njhill added a commit to njhill/grpc-java that referenced this issue Jun 9, 2020
As reported in grpc#7105. Not sure if this is how you want it done, but it does fix the problem.

Fixes grpc#7105
Fixes grpc#3557
njhill added a commit to njhill/grpc-java that referenced this issue Jun 9, 2020
As reported in grpc#7105. Not sure if this is how you want it done, but it does fix the problem.

Fixes grpc#7105
Fixes grpc#3557
@njhill
Copy link
Contributor Author

njhill commented Jun 9, 2020

I found that since 1.27.0 tasks might be submitted to the app executor after the call is closed / status set. I guess that either needs to be avoided again or accepted/handled - I made a fix for the latter in #7106 and verified that my reproducer no longer produces the ByteBuf leak with it.

@voidzcy voidzcy self-assigned this Jun 10, 2020
@ejona86
Copy link
Member

ejona86 commented Jun 12, 2020

So we think the response is arriving and racing with the deadline?

I think part of the problem is we are calling executeCloseObserverInContext without the transport's involvement.

I think we would want to go with the "avoid doing that" approach, but we'll have to think about how to do it (worst-case, we can have the transports handle the delay #6328 was trying to achieve; that may even be easier). We may want to revert #6328 while we do a fix. I don't see a quick-and-easy patch/workaround opportunity.

@ejona86 ejona86 added bug and removed question labels Jun 12, 2020
@ejona86 ejona86 added this to the 1.31 milestone Jun 12, 2020
@njhill
Copy link
Contributor Author

njhill commented Jun 15, 2020

So we think the response is arriving and racing with the deadline?

@ejona86 I'm actually not sure since I didn't debug fully, but that seems most likely given the observations.

ejona86 added a commit to ejona86/grpc-java that referenced this issue Jul 7, 2020
This should avoid messages being leaked when a Listener throws an exception and
the executor is shut down immediately after the call completes. This is related
to grpc#7105 but a different scenario and we aren't aware of any user having
observed the previous behavior.

Note also this does _not_ fix the similar case of reordering caused by
delayedCancelOnDeadlineExceeded().
@ejona86
Copy link
Member

ejona86 commented Jul 7, 2020

I feel confident there were two bugs introduced with #6328:

  1. Other callbacks scheduled after onClose(). This is what njhill is experiencing, as we aren't guaranteed the executor will be available after onClose(). I'm cleaning up a similar situation in core: Delay client listener exception notification until normal close #7187
  2. callExecutor is called from multiple threads. Right now callExecutor is only guaranteed to be single-producer, single-consumer, so this is a thread-safety issue. Although SerializingExecutor is multi-producer-safe, so most users would be okay. SerializeReentrantCallsDirectExecutor is not multi-producer-safe.

I think a proper fix would be to jump onto the transport thread and schedule onClose() from there (which would mean adding runOnTransportThread to Stream). From that point forward the ClientStreamListener would need to release resources without calling into the application (so no more usage of callExecutor).

Cc @ran-su, @zhangkun83

ejona86 added a commit that referenced this issue Jul 9, 2020
This should avoid messages being leaked when a Listener throws an exception and
the executor is shut down immediately after the call completes. This is related
to #7105 but a different scenario and we aren't aware of any user having
observed the previous behavior.

Note also this does _not_ fix the similar case of reordering caused by
delayedCancelOnDeadlineExceeded().
@ericgribkoff ericgribkoff modified the milestones: 1.31, 1.32 Jul 28, 2020
@ejona86 ejona86 modified the milestones: 1.32, 1.33 Sep 9, 2020
@ejona86
Copy link
Member

ejona86 commented Sep 23, 2020

@ran-su, @zhangkun83, I'm seriously considering reverting #6328 , since it has been 3.5 months and we still have the memory leak caused by it.

@ran-su
Copy link
Member

ran-su commented Sep 23, 2020

Talked with Kun, we can revert #6328 for now, and I will be come back to this issue later(like next quarter).

@ran-su ran-su assigned ran-su and unassigned voidzcy Sep 23, 2020
@ejona86
Copy link
Member

ejona86 commented Sep 24, 2020

@ran-su, will y'all handle the revert? I expect it'll need a large test run.

@ran-su
Copy link
Member

ran-su commented Sep 24, 2020

Sure, I'll run an internal test before send out a PR.

@ejona86
Copy link
Member

ejona86 commented Dec 9, 2020

Note: a user has noticed a bug where onMessage() could be called after onClose() on client-side (b/169277773). The problem was present starting in 1.27 and fixed in 1.33, so we strongly believe it was this bug.

That means this was even more serious of a bug than previous believed. I'm very surprised more didn't break. Maybe more broke, and people just couldn't debug it. I'm going to update the release notes to mention it can produce weird callback orders.

dfawley pushed a commit to dfawley/grpc-java that referenced this issue Jan 15, 2021
This should avoid messages being leaked when a Listener throws an exception and
the executor is shut down immediately after the call completes. This is related
to grpc#7105 but a different scenario and we aren't aware of any user having
observed the previous behavior.

Note also this does _not_ fix the similar case of reordering caused by
delayedCancelOnDeadlineExceeded().
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants