Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: fix a race in ServerImpl for multi-transport-server #6614

Closed
wants to merge 2 commits into from

Conversation

dapengzhang0
Copy link
Member

@dapengzhang0 dapengzhang0 commented Jan 16, 2020

As discovered in #6601 and its partial fix #6610, ServerImpl (https://github.com/grpc/grpc-java/blob/v1.26.0/core/src/main/java/io/grpc/internal/ServerImpl.java#L184) and NettyServer (https://github.com/grpc/grpc-java/blob/v1.26.0/netty/src/main/java/io/grpc/netty/NettyServer.java#L251) have risk of deadlock.

Targeting the issue #6641.

There might be two things need be done to make the code healthy:

  1. move ts.start(listener) out of synchronization block
  2. move multiple port support from ServerImpl to NettyServer.

This PR is working on (1).

Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... It is really awkward that we can have bind succeed for one transport and have it fail for another and then start() with throw and leave things in a half-started state. Previously, when there was only one transport, ts.start() throwing would leave started == false.

private void awaitStartComplete() {
boolean startCompleted = false;
try {
startCompleted = startComplete.await(30, TimeUnit.SECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only await for start to complete if started == true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can assert that is true.

@@ -420,7 +420,6 @@ public void start(ServerListener listener) throws IOException {
} catch (IOException e) {
assertSame(ex, e);
}
verifyNoMoreInteractions(executorPool);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because executorPool.getObject() is called after the change.

@@ -388,6 +409,7 @@ public ServerTransportListener transportCreated(ServerTransport transport) {
public void serverShutdown() {
ArrayList<ServerTransport> copiedTransports;
Status shutdownNowStatusCopy;
awaitStartComplete();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks dangerous.

I'd much rather we set activeTransportServers = transportServers.size() within the initial synchronization block of start() to prevent needing to block here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set activeTransportServers = transportServers.size() is broken if one of ts.start() throws, because then there will be not enough times of listern.serverShutdown() called .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we need redesign multiple transportServers support in future. It's just too inconvenient now. For now the fix seems working. Can you elaborate what is the danger?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I see the danger, because serverShutdown() is running in the event loop, if another ts.start() submits something into the event loop while awaitStartComplete() in serverShutdown(), then it will be deadlock again.

@dapengzhang0
Copy link
Member Author

dapengzhang0 commented Jan 16, 2020

Hmmm... It is really awkward that we can have bind succeed for one transport and have it fail for another and then start() with throw and leave things in a half-started state. Previously, when there was only one transport, ts.start() throwing would leave started == false.

Yes previously it was broken in multiple ways for multi-transport.

@ejona86
Copy link
Member

ejona86 commented Jun 26, 2020

We gave up on trying to fix Server for multiple ports. The API is just wrong to support partial failures. The multiple-port handling will be moved into the Netty transport, where partial failures can be handled.

@dapengzhang0
Copy link
Member Author

Close in favor of #7674 and subsequent PRs

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 3, 2021
@dapengzhang0 dapengzhang0 deleted the fix-multi-server-race branch January 16, 2022 18:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants