Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netty: fix a race for channelz at server transport creation #6610

Merged
merged 4 commits into from Jan 16, 2020

Conversation

dapengzhang0
Copy link
Member

@dapengzhang0 dapengzhang0 commented Jan 15, 2020

A race condition was reported by user in #6601:

ServerImpl.start() calls NettyServer.start() while holding ServerImpl.lock. NettyServer.start() awaits a submitted runnable in eventloop. However, this pending runnable may never be executed because the eventloop might be executing some other task, like ServerListenerImpl.transportCreated(), that is trying to acquire ServerImpl.lock causing a deadlock.

This PR resolves the particular issue reported in #6601 for server with a single port, but NettyServer (https://github.com/grpc/grpc-java/blob/v1.26.0/netty/src/main/java/io/grpc/netty/NettyServer.java#L251) and ServerImpl (https://github.com/grpc/grpc-java/blob/v1.26.0/core/src/main/java/io/grpc/internal/ServerImpl.java#L184) in general still have the same potential risk of deadlock, which need further fix.

@dapengzhang0 dapengzhang0 changed the title core,netty: fix a race for channelz at server transport creation netty: fix a race for channelz at server transport creation Jan 15, 2020
Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you briefly describe the race. At least the components racing would be very helpful within the commit message. Yes, someone can look at the issue for more info, but honestly right now I don't see a deadlock nor race that isn't already a problem (for example, the future.await() after the bind() can deadlock if called from a Netty thread).

throw new RuntimeException("Interrupted while registering listen socket to channelz", ex);
}

InternalInstrumented<SocketStats> listenSocket = new ListenSocket(channel);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a comment here that this requires the channel to have completed bind()ing.

@ejona86
Copy link
Member

ejona86 commented Jan 15, 2020

Okay. I'm seeing now there is a loop with ServerImpl.lock from start() and transportCreated(). I think the lock in start() is to avoid callbacks from being seen too early. Hmm... and it is also used for things like getPort and shutdown(), although honestly we don't really support calling shutdown/getPort before start returns, so we could potentially call ts.start() after releasing the lock within ServerImpl.start()

Socket socket = new Socket();
socket.connect(ns.getListenSocketAddress(), /* timeout= */ 8000);
socket.close();
countDownLatch.await(); // SocketStats won't be available until binding complete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binding is complete when start() returns, so I don't think this comment is accurate.

assertThat(((InetSocketAddress) ns.getListenSocketAddress()).getPort()).isGreaterThan(0);

Socket socket = new Socket();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but is a roundabout way of making sure the event loop has completed the runnable. Would it be better to Create a single EventLoop and pass that as the boss+worker EventLoopGroup to the NettyServer constructor? Then you can submit() your own Runnable and await its completion. We could use that same EventLoop for all the tests, really, if you'd prefer a @BeforeClass/@AfterClass to manage its lifetime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This roundabout is just copied from the test above it. Actually ns.getListenSocketStats() is not required to be called in the event loop. I'm not sure submit() is the most correct way to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one above is actually checking the worker channel options. So creating a connection appropriate and necessary there.

I agree ns.getListenSocketStats() isn't required to be done in the event loop, but I wasn't suggesting to call it from the event loop. I was suggesting:

eventLoop.submit(new Runnable() {
    @Override public void run() {}
}).await();
InternalInstrumented<SocketStats> listenSocket = ns.getListenSocketStats();

It is just a synchronization barrier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with fixed event loop pool.

@dapengzhang0 dapengzhang0 merged commit b8474d6 into grpc:master Jan 16, 2020
@lock lock bot locked as resolved and limited conversation to collaborators Apr 17, 2020
@dapengzhang0 dapengzhang0 deleted the fix-channelz-race branch January 16, 2022 18:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants