New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: fix a race in ServerImpl for multi-transport-server #6614
Conversation
2b3fda2
to
085f4b6
Compare
085f4b6
to
bb608b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... It is really awkward that we can have bind succeed for one transport and have it fail for another and then start() with throw and leave things in a half-started state. Previously, when there was only one transport, ts.start() throwing would leave started == false
.
private void awaitStartComplete() { | ||
boolean startCompleted = false; | ||
try { | ||
startCompleted = startComplete.await(30, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only await for start to complete if started == true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can assert that is true.
@@ -420,7 +420,6 @@ public void start(ServerListener listener) throws IOException { | |||
} catch (IOException e) { | |||
assertSame(ex, e); | |||
} | |||
verifyNoMoreInteractions(executorPool); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because executorPool.getObject()
is called after the change.
@@ -388,6 +409,7 @@ public ServerTransportListener transportCreated(ServerTransport transport) { | |||
public void serverShutdown() { | |||
ArrayList<ServerTransport> copiedTransports; | |||
Status shutdownNowStatusCopy; | |||
awaitStartComplete(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks dangerous.
I'd much rather we set activeTransportServers = transportServers.size()
within the initial synchronization block of start() to prevent needing to block here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set activeTransportServers = transportServers.size()
is broken if one of ts.start()
throws, because then there will be not enough times of listern.serverShutdown()
called .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we need redesign multiple transportServers support in future. It's just too inconvenient now. For now the fix seems working. Can you elaborate what is the danger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I see the danger, because serverShutdown()
is running in the event loop, if another ts.start()
submits something into the event loop while awaitStartComplete()
in serverShutdown()
, then it will be deadlock again.
Yes previously it was broken in multiple ways for multi-transport. |
We gave up on trying to fix Server for multiple ports. The API is just wrong to support partial failures. The multiple-port handling will be moved into the Netty transport, where partial failures can be handled. |
Close in favor of #7674 and subsequent PRs |
As discovered in #6601 and its partial fix #6610,
ServerImpl
(https://github.com/grpc/grpc-java/blob/v1.26.0/core/src/main/java/io/grpc/internal/ServerImpl.java#L184) andNettyServer
(https://github.com/grpc/grpc-java/blob/v1.26.0/netty/src/main/java/io/grpc/netty/NettyServer.java#L251) have risk of deadlock.Targeting the issue #6641.
There might be two things need be done to make the code healthy:
ts.start(listener)
out of synchronization blockServerImpl
toNettyServer
.This PR is working on (1).