New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the flakiness of the OkHttp tests (though not eliminate). #10973
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are making some of these changes because of TSAN failures, it is very helpful to include the failure report in the commit message in case we ever second-guess ourselves.
synchronized (lock) { | ||
socket.setTcpNoDelay(true); | ||
} | ||
HandshakerSocketFactory.HandshakeResult result = | ||
config.handshakerSocketFactory.handshake(socket, Attributes.EMPTY); | ||
synchronized (handShakeLock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does nothing. This method is only called once ever per handShakeLock
instance, and the lock isn't used anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write of size 4 at 0x00008dc081dc by thread T73 (mutexes: write M0):
#0 sun.security.ssl.SSLSessionImpl.invalidate()V SSLSessionImpl.java:781
#1 sun.security.ssl.TransportContext.fatal(Lsun/security/ssl/Alert;Ljava/lang/String;ZLjava/lang/Throwable;)Ljavax/net/ssl/SSLException; TransportContext.java:386
#2 sun.security.ssl.TransportContext.fatal(Lsun/security/ssl/Alert;Ljava/lang/String;Ljava/lang/Throwable;)Ljavax/net/ssl/SSLException; TransportContext.java:303
#3 sun.security.ssl.TransportContext.fatal(Lsun/security/ssl/Alert;Ljava/lang/Throwable;)Ljavax/net/ssl/SSLException; TransportContext.java:298
#4 sun.security.ssl.SSLSocketImpl.handleException(Ljava/lang/Exception;)V SSLSocketImpl.java:1686
#5 sun.security.ssl.SSLSocketImpl.startHandshake(Z)V SSLSocketImpl.java:466
#6 sun.security.ssl.SSLSocketImpl.startHandshake()V SSLSocketImpl.java:427
#7 io.grpc.okhttp.OkHttpProtocolNegotiator.negotiate(Ljavax/net/ssl/SSLSocket;Ljava/lang/String;Ljava/util/List;)Ljava/lang/String; OkHttpProtocolNegotiator.java:100
#8 io.grpc.okhttp.TlsServerHandshakerSocketFactory.handshake(Ljava/net/Socket;Lio/grpc/Attributes;)Lio/grpc/okhttp/HandshakerSocketFactory$HandshakeResult; TlsServerHandshakerSocketFactory.java:58
#9 io.grpc.okhttp.OkHttpServerTransport.startIo(Lio/grpc/internal/SerializingExecutor;)V OkHttpServerTransport.java:171
#10 io.grpc.okhttp.OkHttpServerTransport.lambda$start$0(Lio/grpc/internal/SerializingExecutor;)V OkHttpServerTransport.java:159
#11 io.grpc.okhttp.OkHttpServerTransport$$Lambda$90.run()V ??
Previous read of size 4 at 0x00008dc081dc by thread T76:
#0 sun.security.ssl.SSLSessionImpl.setContext(Lsun/security/ssl/SSLSessionContextImpl;)V SSLSessionImpl.java:768
#1 sun.security.ssl.SSLSessionContextImpl.put(Lsun/security/ssl/SSLSessionImpl;)V SSLSessionContextImpl.java:196
#2 sun.security.ssl.Finished$T12FinishedConsumer.onConsumeFinished(Lsun/security/ssl/ServerHandshakeContext;Ljava/nio/ByteBuffer;)V Finished.java:606
#3 sun.security.ssl.Finished$T12FinishedConsumer.consume(Lsun/security/ssl/ConnectionContext;Ljava/nio/ByteBuffer;)V Finished.java:527
#4 sun.security.ssl.SSLHandshake.consume(Lsun/security/ssl/ConnectionContext;Ljava/nio/ByteBuffer;)V SSLHandshake.java:392
#5 sun.security.ssl.HandshakeContext.dispatch(BLjava/nio/ByteBuffer;)V HandshakeContext.java:443
#6 sun.security.ssl.HandshakeContext.dispatch(BLsun/security/ssl/Plaintext;)V HandshakeContext.java:421
#7 sun.security.ssl.TransportContext.dispatch(Lsun/security/ssl/Plaintext;)V TransportContext.java:183
#8 sun.security.ssl.SSLTransport.decode(Lsun/security/ssl/TransportContext;[Ljava/nio/ByteBuffer;II[Ljava/nio/ByteBuffer;II)Lsun/security/ssl/Plaintext; SSLTransport.java:172
#9 sun.security.ssl.SSLSocketImpl.decode(Ljava/nio/ByteBuffer;)Lsun/security/ssl/Plaintext; SSLSocketImpl.java:1511
#10 sun.security.ssl.SSLSocketImpl.readHandshakeRecord()I SSLSocketImpl.java:1421
#11 sun.security.ssl.SSLSocketImpl.startHandshake(Z)V SSLSocketImpl.java:456
#12 sun.security.ssl.SSLSocketImpl.startHandshake()V SSLSocketImpl.java:427
#13 io.grpc.okhttp.OkHttpProtocolNegotiator.negotiate(Ljavax/net/ssl/SSLSocket;Ljava/lang/String;Ljava/util/List;)Ljava/lang/String; OkHttpProtocolNegotiator.java:100
#14 io.grpc.okhttp.TlsServerHandshakerSocketFactory.handshake(Ljava/net/Socket;Lio/grpc/Attributes;)Lio/grpc/okhttp/HandshakerSocketFactory$HandshakeResult; TlsServerHandshakerSocketFactory.java:58
#15 io.grpc.okhttp.OkHttpServerTransport.startIo(Lio/grpc/internal/SerializingExecutor;)V OkHttpServerTransport.java:171
#16 io.grpc.okhttp.OkHttpServerTransport.lambda$start$0(Lio/grpc/internal/SerializingExecutor;)V OkHttpServerTransport.java:159
#17 io.grpc.okhttp.OkHttpServerTransport$$Lambda$90.run()V ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a JDK bug. I don't know what JDK was used when TSAN failed, but based on the line numbers it looks to be JDK 11 or earlier. And it adds the new session to the session cache before setting the context, which does seem racy.
https://github.com/openjdk/jdk/blob/da75f3c4ad5bdf25167a3ed80e51f567ab3dbd01/src/java.base/share/classes/sun/security/ssl/SSLSessionContextImpl.java#L175
JDK master does the same, without any extra synchronization. So I think we need to report this and get it fixed.
@@ -117,6 +117,7 @@ public void mtls_succeeds() throws Exception { | |||
ManagedChannel channel = grpcCleanupRule.register(clientChannel(server, channelCreds)); | |||
|
|||
SimpleServiceGrpc.newBlockingStub(channel).unaryRpc(SimpleRequest.getDefaultInstance()); | |||
server.shutdown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do these do? The Cleanup rule will shut down the server just after this line. Do we have to shut down the server before the client?
It isn't like OkHttpServer.acceptConnecitons() is spinning tightly in its loop. accept()
blocks until there is a connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But because the client shuts down before the server, after it falls out of the accept it goes through the logic that is skipped when the shutdown flag is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this and I saw no change in failure rate. (1000 runs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Darn reality not matching the theory .... :(
@@ -293,7 +294,7 @@ public void close() throws SecurityException { | |||
Buffer buffer = createMessageFrame(message); | |||
frameHandler().data(false, 3, buffer, (int) buffer.size(), | |||
(int) buffer.size()); | |||
assertThat(logs).hasSize(1); | |||
assertWithMessage("log messages: " + logs).that(logs).hasSize(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this help? I tried it out and it seems to change:
io.grpc.okhttp.OkHttpClientTransportTest > testClientHandlerFrameLogger FAILED
value of : iterable.size()
expected : 2
but was : 1
iterable was: [java.util.logging.LogRecord@7f12d356]
into:
io.grpc.okhttp.OkHttpClientTransportTest > testClientHandlerFrameLogger FAILED
log messages: [java.util.logging.LogRecord@45026a28]
value of : iterable.size()
expected : 2
but was : 1
iterable was: [java.util.logging.LogRecord@45026a28]
The point of Truth is that it prints the value of what your testing. Why print it twice?
synchronized (lock) { | ||
this.socket = result.socket; | ||
this.attributes = result.attributes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did TSAN show this was needed? This field shouldn't be looked at until listener.transportReady()
. I'd be interested to see what TSAN found.
Doing server.shutdown() after the TlsTest tests, because OkHttpServer.acceptConnections is looping creating transports which TSAN gets unhappy when they conflict with cleanup. Doing a shutdown at the end of each test, so that loop can break before the rule's cleanup, pretty much eliminated those incidents.