Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve client and server handling of STOMP heartbeat failures #32195

Closed
funnelspider opened this issue Feb 5, 2024 · 15 comments
Closed

Improve client and server handling of STOMP heartbeat failures #32195

funnelspider opened this issue Feb 5, 2024 · 15 comments
Assignees
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) type: enhancement A general enhancement
Milestone

Comments

@funnelspider
Copy link

funnelspider commented Feb 5, 2024

I have a spring framework application and a swing application which communicates with the spring app (tomcat) via websockets.

Locally I cannot reproduce this issue, but in a cloud environment it is reproducible.

The swing application has an auto-reconnect job which checks if the websocket is disconnected, and if it is it reconnects.

My assumption is that when the swing client disconnects (because it doesn't receive heartbeats from the server), the server does not notice this and keeps the sessions alive.

The log in tomcat looks like this

[snippet from the log]
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 899019 ms. Closing WebSocketServerSockJsSession[id=838e557837c74ef2a23bcd1bb72cec9c].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 1134777 ms. Closing WebSocketServerSockJsSession[id=0b0d711972344aa09395006e90ba83e1].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 323835 ms. Closing WebSocketServerSockJsSession[id=26f17a9feec3402982b1eba5ad0bde7d].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 805026 ms. Closing WebSocketServerSockJsSession[id=6816a489125d4161991935172939e50a].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 452725 ms. Closing WebSocketServerSockJsSession[id=005db3a02a9d445389441e61a737cbbc].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 605709 ms. Closing WebSocketServerSockJsSession[id=f02769867e3a4504b17c0adce5a29cc4].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 746056 ms. Closing WebSocketServerSockJsSession[id=f3641dd2ed904e519bf41815a9138497].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 617321 ms. Closing WebSocketServerSockJsSession[id=2b5cb43b57b947e0bf4d668b4f669bf4].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 148400 ms. Closing WebSocketServerSockJsSession[id=684868babebc4a31b1c19678b6d60023].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 699036 ms. Closing WebSocketServerSockJsSession[id=c20bdd087a914016ab67d368e85b4db6].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 1122979 ms. Closing WebSocketServerSockJsSession[id=eef41fbe669a4b1a9bb54c429e5d5453].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 347276 ms. Closing WebSocketServerSockJsSession[id=749c7af7447b4326b3dd075f8551cf83].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 1052356 ms. Closing WebSocketServerSockJsSession[id=a152d5653145419eabca743be7264d77].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 663907 ms. Closing WebSocketServerSockJsSession[id=7ffa907967d145bda0aeb47374ce152c].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 828531 ms. Closing WebSocketServerSockJsSession[id=cab6085d5f6b4715850e611e7a556906].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 101438 ms. Closing WebSocketServerSockJsSession[id=c3ca9b8b5f0e4ea2acd2b3cb1ea01a0d].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 570379 ms. Closing WebSocketServerSockJsSession[id=28b0d8b124e242acbb5532dd55c98ed3].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 218556 ms. Closing WebSocketServerSockJsSession[id=72d98d863d1241ea8a1395c589dc2237].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 288750 ms. Closing WebSocketServerSockJsSession[id=67962e5785344dc6a31e0efdfbb806a0].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 734328 ms. Closing WebSocketServerSockJsSession[id=f99634fbeb744bfda64a9c6f367c88c8].
2024-01-30 13:53:53 INFO  SubProtocolWebSocketHandler:503 - No messages received after 863701 ms. Closing WebSocketServerSockJsSession[id=5181f2406c1149219affd012491c2b90].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 487918 ms. Closing WebSocketServerSockJsSession[id=c4f684581c2e42c69771d35f5ed28a5d].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 358939 ms. Closing WebSocketServerSockJsSession[id=141d1c84a5dc4152b272a64be10d970a].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 312150 ms. Closing WebSocketServerSockJsSession[id=12b30f3b3b89445583936d601578c07b].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 476291 ms. Closing WebSocketServerSockJsSession[id=e3f1f65e14ac449c8569ff94eb5538e1].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 934254 ms. Closing WebSocketServerSockJsSession[id=4451c723925b4ab689b979581407822b].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 405862 ms. Closing WebSocketServerSockJsSession[id=0faf90546b7f46b6ab05fb4233ccc4b9].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 265358 ms. Closing WebSocketServerSockJsSession[id=4dc1ec3422b74beb8d77e6af801dfe7f].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 440993 ms. Closing WebSocketServerSockJsSession[id=87672dcc5c6542758156d0ae504bac87].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 499644 ms. Closing WebSocketServerSockJsSession[id=1d3ce99766884f4b8fa6b59ad5f91ce6].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 1075868 ms. Closing WebSocketServerSockJsSession[id=c0d82ff27bc04618bd24e732fd206582].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 523268 ms. Closing WebSocketServerSockJsSession[id=5e9eb26093df44a0b4f9f2104fb55f66].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 981544 ms. Closing WebSocketServerSockJsSession[id=b095f6b892b54f378812649e603c0aa5].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 910673 ms. Closing WebSocketServerSockJsSession[id=5116b157ed24403bba49ce71709bc137].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 558508 ms. Closing WebSocketServerSockJsSession[id=a7f2174b5c154e99920836ebfdddb988].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 183551 ms. Closing WebSocketServerSockJsSession[id=4d7e2af66ab0495ab5b07f4a753dbc5d].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 206850 ms. Closing WebSocketServerSockJsSession[id=d5fc5f07b6064b64abe21f190ece09c6].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 136655 ms. Closing WebSocketServerSockJsSession[id=dedbdc7d5e8947b4a798326594495210].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 793166 ms. Closing WebSocketServerSockJsSession[id=a47f59c5ba4f4391a1e7f22cf509fa70].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 1099473 ms. Closing WebSocketServerSockJsSession[id=1c387dbaf9b14bb8a6581a3544fda7e7].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 851999 ms. Closing WebSocketServerSockJsSession[id=d7cce468c28746f2977bc74df9e9897d].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 710846 ms. Closing WebSocketServerSockJsSession[id=f111560133f04e54a15035bd33af2a0f].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 1240907 ms. Closing WebSocketServerSockJsSession[id=312b15c8b0a9405390710e1e1db882fb].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 1217247 ms. Closing WebSocketServerSockJsSession[id=0b8b10d2f8a84694ae6d767d902605e6].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 769425 ms. Closing WebSocketServerSockJsSession[id=f780343ac0a14247a70671dd6fa05785].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 840296 ms. Closing WebSocketServerSockJsSession[id=15032180aed9426db864a6ac4616b1fc].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 1087685 ms. Closing WebSocketServerSockJsSession[id=88a8d1a67d52424bb7969757514496f8].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 1264368 ms. Closing WebSocketServerSockJsSession[id=4f2c66e9302a4efd8e25e617450eb697].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 1252667 ms. Closing WebSocketServerSockJsSession[id=70021c8bea4940c6a94f628ae57d369e].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 77837 ms. Closing WebSocketServerSockJsSession[id=ba1f324f20704c06a382be73fd4262ff].
2024-01-30 13:53:54 INFO  SubProtocolWebSocketHandler:503 - No messages received after 253665 ms. Closing WebSocketServerSockJsSession[id=23cd562d6f1f4f818b40797818dec8ef].
@Configuration
@EnableWebSocketMessageBroker
public class WebSocketConfig extends AbstractSecurityWebSocketMessageBrokerConfigurer {

@Override
  public void configureMessageBroker(MessageBrokerRegistry config) {
    var taskScheduler = new ThreadPoolTaskScheduler();
    taskScheduler.initialize();

    config.enableSimpleBroker("/endpint1","/endpoint2")
            .setHeartbeatValue(new long[] {10000, 10000})
            .setTaskScheduler(taskScheduler);
    config.setUserDestinationPrefix("/user")
            .setApplicationDestinationPrefixes("/app");
  }
@Override
  public void registerStompEndpoints(StompEndpointRegistry registry) {
    registry.addEndpoint("/endpint1").setAllowedOrigins(allowedOrigins.split(",")).
      setHandshakeHandler(handshakeHandler()).withSockJS().setInterceptors(proxyConfig);
    registry.addEndpoint("/endpoint2").setAllowedOrigins(allowedOrigins.split(",")).withSockJS();
  }
 @Override public void configureWebSocketTransport(final WebSocketTransportRegistration registration) {
    registration.setMessageSizeLimit(500 * 1024);
  }

@Override
  public void configureClientOutboundChannel(ChannelRegistration registration) {
    registration.interceptors(customChannelInterceptor);
  }
@Bean
  CustomHandshakeHandler handshakeHandler() {
    return new CustomHandshakeHandler();
  }

}

And client connects like this:

var stompSession = stompClient.connect(StringUtil.getURLString(model.getServer(), "/api2/endpoint1"), headers, new StompSessionHandlerAdapter() {
 @Override public void afterConnected(StompSession session, StompHeaders connectedHeaders) {... }
@Override public void handleTransportError(StompSession session, Throwable exception) { 
// disconnects websocket on exception 
}).get();

stompSession.subscribe("/user/**", stompFrameHandler)

Is there a way to check on the client that there is a websocket connection active or not ?

I also found this comment where the log above comes from:

	/**
	 * When a session is connected through a higher-level protocol it has a chance
	 * to use heartbeat management to shut down sessions that are too slow to send
	 * or receive messages. However, after a WebSocketSession is established and
	 * before the higher level protocol is fully connected there is a possibility for
	 * sessions to hang. This method checks and closes any sessions that have been
	 * connected for more than 60 seconds without having received a single message.
	 */

What exactly is a 'session that is to slow to send or receive messages' ?

How would i be able to simulate this?

image

It's worth mentioning that when the websocket request in the CLOUD is made the following http request are performed:

/api2/endpoint1/info
/api2/endpoint1/29/4f2c66e9302a4efd8e25e617450eb697/websocket 
/api2/endpoint1/xhr_streaming
/api2/endpoint1/xhr

Locally I only get

/api2/endpoint1/info
/api2/endpoint1/29/4f2c66e9302a4efd8e25e617450eb697/websocket 

I'm using spring framework 6.0.14

Did not have this many issues with 6.0.8

Did something change between the 2 versions?

Thanks

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Feb 5, 2024
@jhoeller jhoeller added the in: web Issues in web modules (web, webmvc, webflux, websocket) label Feb 5, 2024
@funnelspider
Copy link
Author

I found a scenario although not reproducible
It's like this
StompSession tries to send heartbeats. This fails with

java.lang.IllegalStateException: Message will not be sent because the WebSocket session has been closed
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.writeMessagePart(WsRemoteEndpointImplBase.java:450) ~[tomcat-embed-websocket-10.1.15.jar!/:?]
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:308) ~[tomcat-embed-websocket-10.1.15.jar!/:?]
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:250) ~[tomcat-embed-websocket-10.1.15.jar!/:?]
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendPartialString(WsRemoteEndpointImplBase.java:223) ~[tomcat-embed-websocket-10.1.15.jar!/:?]
	at org.apache.tomcat.websocket.WsRemoteEndpointBasic.sendText(WsRemoteEndpointBasic.java:48) ~[tomcat-embed-websocket-10.1.15.jar!/:?]
	at org.springframework.web.socket.adapter.standard.StandardWebSocketSession.sendTextMessage(StandardWebSocketSession.java:215) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.adapter.AbstractWebSocketSession.sendMessage(AbstractWebSocketSession.java:108) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.sockjs.client.WebSocketClientSockJsSession.sendInternal(WebSocketClientSockJsSession.java:131) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.sockjs.client.AbstractClientSockJsSession.sendMessage(AbstractClientSockJsSession.java:173) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.messaging.WebSocketStompClient$WebSocketTcpConnectionHandlerAdapter.sendAsync(WebSocketStompClient.java:465) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.messaging.simp.stomp.DefaultStompSession$WriteInactivityTask.run(DefaultStompSession.java:718) ~[spring-messaging-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.messaging.WebSocketStompClient$WebSocketTcpConnectionHandlerAdapter.lambda$onWriteInactivity$1(WebSocketStompClient.java:511) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) ~[spring-context-6.1.1.jar!/:6.1.1]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1583) ~[?:?]

My handleTransportError implementation is called which tries to disconnect a StompSession if it is connected.
In this case it is connected and so what I do when disconnecting a session is I call
subscription.unsubscribe() and stompSession.disconnect();

Unsubscribe and disconnect both fail with java.lang.IllegalStateException: The remote endpoint was in state [TEXT_PARTIAL_WRITING] which is an invalid state for called method


2024-02-07 17:41:50 ERROR WebSocketClientService:276 - Error unsubscribing from event: Subscription [id=0, destination='/user/**', receiptId='null', handler=d.v.k.t.s.StompFrameHandlerService@29fa6b65]
org.springframework.messaging.MessageDeliveryException: null
	at org.springframework.messaging.simp.stomp.DefaultStompSession.execute(DefaultStompSession.java:293) ~[spring-messaging-6.1.1.jar!/:6.1.1]
	at org.springframework.messaging.simp.stomp.DefaultStompSession.unsubscribe(DefaultStompSession.java:361) ~[spring-messaging-6.1.1.jar!/:6.1.1]
	at org.springframework.messaging.simp.stomp.DefaultStompSession$DefaultSubscription.unsubscribe(DefaultStompSession.java:699) ~[spring-messaging-6.1.1.jar!/:6.1.1]
	at org.springframework.messaging.simp.stomp.DefaultStompSession$DefaultSubscription.unsubscribe(DefaultStompSession.java:691) ~[spring-messaging-6.1.1.jar!/:6.1.1]
	at d.v.k.t.s.WebSocketClientService.lambda$disconnect$1(WebSocketClientService.java:274) ~[!/:2.8.0-19956-SNAPSHOT]
	at java.util.ArrayList.forEach(ArrayList.java:1596) ~[?:?]
	at d.v.k.t.s.WebSocketClientService.disconnect(WebSocketClientService.java:272) ~[!/:2.8.0-19956-SNAPSHOT]
	at d.v.k.t.s.WebSocketClientService$1.handleTransportError(WebSocketClientService.java:216) ~[!/:2.8.0-19956-SNAPSHOT]
	at org.springframework.messaging.simp.stomp.DefaultStompSession.handleFailure(DefaultStompSession.java:508) ~[spring-messaging-6.1.1.jar!/:6.1.1]
	at org.springframework.messaging.simp.stomp.DefaultStompSession$WriteInactivityTask.lambda$run$0(DefaultStompSession.java:720) ~[spring-messaging-6.1.1.jar!/:6.1.1]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
	at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:887) ~[?:?]
	at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2357) ~[?:?]
	at org.springframework.messaging.simp.stomp.DefaultStompSession$WriteInactivityTask.run(DefaultStompSession.java:718) ~[spring-messaging-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.messaging.WebSocketStompClient$WebSocketTcpConnectionHandlerAdapter.lambda$onWriteInactivity$1(WebSocketStompClient.java:511) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) ~[spring-context-6.1.1.jar!/:6.1.1]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1583) ~[?:?]
Caused by: java.lang.IllegalStateException: The remote endpoint was in state [TEXT_PARTIAL_WRITING] which is an invalid state for called method
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase$StateMachine.checkState(WsRemoteEndpointImplBase.java:1245) ~[tomcat-embed-websocket-10.1.15.jar!/:?]
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase$StateMachine.textPartialStart(WsRemoteEndpointImplBase.java:1210) ~[tomcat-embed-websocket-10.1.15.jar!/:?]
	at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendPartialString(WsRemoteEndpointImplBase.java:222) ~[tomcat-embed-websocket-10.1.15.jar!/:?]
	at org.apache.tomcat.websocket.WsRemoteEndpointBasic.sendText(WsRemoteEndpointBasic.java:48) ~[tomcat-embed-websocket-10.1.15.jar!/:?]
	at org.springframework.web.socket.adapter.standard.StandardWebSocketSession.sendTextMessage(StandardWebSocketSession.java:215) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.adapter.AbstractWebSocketSession.sendMessage(AbstractWebSocketSession.java:108) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.sockjs.client.WebSocketClientSockJsSession.sendInternal(WebSocketClientSockJsSession.java:131) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.sockjs.client.AbstractClientSockJsSession.sendMessage(AbstractClientSockJsSession.java:173) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.messaging.WebSocketStompClient$WebSocketTcpConnectionHandlerAdapter.sendAsync(WebSocketStompClient.java:465) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.messaging.simp.stomp.DefaultStompSession.execute(DefaultStompSession.java:290) ~[spring-messaging-6.1.1.jar!/:6.1.1]
	... 21 more

As a consequence of this, the inactivityTasks (read and write) are not cancelled and they keep popping up in the log file
This is the swing app that's running using spring boot 3.2.0

I didn't have these issues with spring boot 2.7.4

Not sure if not cancelling these tasks keeps somehow the connection alive (as the server disconnect did not happen)

@funnelspider
Copy link
Author

Managed to reproduce the "not cancelling the inactivityTasks" from the WebSocketStompClient.WebSocketTcpConnectionHandlerAdapter.

With this scenario, the cancelInactivityTasks() method is not called and so I get every 30 seconds (heart-beats are sent every 10 seconds) the exception
java.lang.IllegalStateException: Server has gone quiet. Closing connection in session id=a256ab6c-0f85-b8b6-8de8-e143dbedba52.

The stack is this

Caused by: java.lang.IllegalStateException: The WebSocket session [3] has been closed and no method (apart from close()) may be called on a closed session
	at org.apache.tomcat.websocket.WsSession.checkState(WsSession.java:978) ~[tomcat-embed-websocket-10.1.15.jar!/:?]
	at org.apache.tomcat.websocket.WsSession.getBasicRemote(WsSession.java:529) ~[tomcat-embed-websocket-10.1.15.jar!/:?]
	at org.springframework.web.socket.adapter.standard.StandardWebSocketSession.sendTextMessage(StandardWebSocketSession.java:215) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.adapter.AbstractWebSocketSession.sendMessage(AbstractWebSocketSession.java:108) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.sockjs.client.WebSocketClientSockJsSession.sendInternal(WebSocketClientSockJsSession.java:131) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.sockjs.client.AbstractClientSockJsSession.sendMessage(AbstractClientSockJsSession.java:173) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.web.socket.messaging.WebSocketStompClient$WebSocketTcpConnectionHandlerAdapter.sendAsync(WebSocketStompClient.java:465) ~[spring-websocket-6.1.1.jar!/:6.1.1]
	at org.springframework.messaging.simp.stomp.DefaultStompSession.execute(DefaultStompSession.java:290) ~[spring-messaging-6.1.1.jar!/:6.1.1]
	... 21 more

If you set a breakpoint in the WsSession.checkState() method and manually set the status to CLOSED, you get this exception and the inactivityTasks run further even though the session is closed and the StandardWebSocketSession.sendTextMessage() throws the above exception.

Can anyone help here?

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Feb 9, 2024

@funnelspider, mixing analysis without fully describing the issue results in a report that is difficult to read and leaves a lot to be unsure about. For future reference, please focus on describing the scenario and the outcome in as much details as you can before moving on to analysis and assumptions.

Overall I am uncertain what the actual issue you're reporting is. Whether it is a client issue or a server issue for example, or both. There are however a few things I can answer.

I have checked 6.0.9 to 6.0.14 and there is nothing related to this. There could be other upgrades like WebSocket or HTTP client versions, and others including environment changes.

"No messages received after..." indicates a WebSocket handshake without a STOMP CONNECT frame within 60 seconds, usually because a proxy does not understand or does not allow WebSocket traffic. If CONNECT and CONNECTED cannot be exchanged, there are no heartbeats and server WebSocket sessions can linger. This is what this check guards against.

The log messages are consistent with the SockJS request sequence, which indicates in the cloud it fails to communicate over WebSocket, then tries HTTP streaming (xhr_streaming), and finally falls back on polling (xhr). A proxy is likely precluding WebSocket and streaming. Check what is required for WebSocket and/or streaming in your cloud configuration, or whether those requirements have changed. I also noticed the proxyConfig interceptor in your config that could be relevant.

handleTransportError is called for transport issues, and it makes no sense to unsubscribe or disconnect as the transport is already lost, and it's no longer possible to send messages.

cancelInactivityTasks depends on the WebSocketHandler#afterConnectionClosed being called. This should always be called by the WebSocket client (Tomcat in this case). However, it's possible that with SockJS fallbacks there is some issue. Generally, the SockJsClient does not call WebSocketHandler methods until it has connected successfully over some transport, which also involves the exchange of SockJS frames so we know the transport is usable. However, the stacktrace indicates the STOMP session is trying to send messages over WebSocket, which should not have been successfully and fully connected.

We could ensure cancelInactivityTasks is also called whenever handleTransportError happens to make this more fail-safe. However, it's not entirely clear if that would clear everything up. It would also be very good to understand more about what actually happens. Would be possible to provide log output from the client with org.springframework.web, org.springframework.http, and org.springframework.messaging set to DEBUG level?

@rstoyanchev rstoyanchev added the status: waiting-for-feedback We need additional information before we can continue label Feb 9, 2024
@funnelspider
Copy link
Author

funnelspider commented Feb 9, 2024

I realize mixing the issue description with my own analysis is not the best approach. I was basically trying to give you information when I was discovering them.

Regarding the 'No messages received after' message. Most of them are closed after > 10 minutes. (No messages received after 1122979 ms. Closing WebSocketServerSockJsSession[id=eef41fbe669a4b1a9bb54c429e5d5453)
If the websocket session lingers and if enough sessions are open like this, this could very well cause the OOM Error couldn't it ?

Why not have a scheduled job which clears these after 60 seconds (as the method comment says).
Right now I saw this checkSessions() method is only called when a new message comes in. (called from SubProtocolWebSocketHandler.handleMessage() method)

Regarding the environment configuration, I was assured by our OPS team that nothing had changed in that regard.
A system which was previously working fine with an older version of the application suddenly started having issues. We had such problems in the past but not as frequent.

The proxyInterceptor is used to set some custom headers in the beforeHandshake method. This is unchanged since 2019.

Regarding handleTransportError. I'm not so sure that the connection is closed in all cases because I added a check for isConnected() and it also returns true sometimes.

Maybe the WSSession is closed but not the StompSession.

 public void handleTransportError(StompSession session, Throwable exception) {
          if (session.isConnected()) {
             disconnect() 
         }
}

It would be great if cancelInactivityTasks was called when handleTransportError is called because right now in our case this just pollutes the log file. Closed sessions writing every 30 seconds that they didn't receive any heartbeats.

Yes, I will add the DEBUG logging for org.springframework.web, org.springframework.http, and org.springframework.messaging to the ticket. I already have for org.springframework.messaging

ws.messaging.snippet.log
Here you can see fdcb4147-2a7b-5960-40fd-9ad30f33d5b9 session is established and gets heartbeats.

After a while an exception comes (you can't see it in this log but it's the one from above Caused by: java.lang.IllegalStateException: The WebSocket session [3] has been closed and no method (apart from close()) may be called on a closed session
and the session gets disconnected
A new session is created 2826788c-5e28-ca54-b937-8c0d4abecd4d and after 30 seconds a DEBUG DefaultStompSession:127 - Server has gone quiet. Closing connection in session id=fdcb4147-2a7b-5960-40fd-9ad30f33d5b9. comes from the old (already disconnected) session.

I will come back with logs from the org.springframework.web, org.springframework.http.

LATER EDIT
Before OOM I had these statistics

WebSocketMessageBrokerStats:128 - WebSocketSession[152 current WS(152)-HttpStream(0)-HttpPoll(0), 964 total, 0 closed abnormally (0 connect failure, 0 send limit, 170 transport error)], stompSubProtocol[processed CONNECT(649)-CONNECTED(649)-DISCONNECT(642)], stompBrokerRelay[null], inboundChannel[pool size = 2, active threads = 0, queued tasks = 0, completed tasks = 12297], outboundChannel[pool size = 2, active threads = 0, queued tasks = 0, completed tasks = 3776], sockJsScheduler[pool size = 1, active threads = 1, queued tasks = 169, completed tasks = 37864]

Can you shed some light into what they show ? Especially the sockJsScheduler and if you see something wrong here

One more mention here, I've seen on other systems X closed abnormally. (

2024-01-31 13:03:45 INFO WebSocketMessageBrokerStats:128 - WebSocketSession[406 current WS(406)-HttpStream(0)-HttpPoll(0), 1794 total, 732 closed abnormally (732 connect failure, 0 send limit, 738 transport error)], stompSubProtocol[processed CONNECT(658)-CONNECTED(658)-DISCONNECT(642)], stompBrokerRelay[null], inboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 16524], outboundChannel[pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6154], sockJsScheduler[pool size = 2, active threads = 1, queued tasks = 425, completed tasks = 130457]

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Feb 9, 2024
@funnelspider
Copy link
Author

spring.20240210.log
The requested org.springframework.web, org.springframework.http, and org.springframework.messaging set to DEBUG level logs

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Feb 12, 2024

Regarding the 'No messages received after' message. Most of them are closed after > 10 minutes. ...
If the websocket session lingers and if enough sessions are open like this, this could very well cause the OOM Error couldn't it ?

Why not have a scheduled job which clears these after 60 seconds (as the method comment says).
Right now I saw this checkSessions() method is only called when a new message comes in. (called from SubProtocolWebSocketHandler.handleMessage() method)

The idea is to use a natural trigger vs managing yet another scheduled task. Thinking about it again, it might make more sense to check when new connections are established instead of when messages arrive. I'll make a change along those lines.

Regarding the environment configuration, I was assured by our OPS team that nothing had changed in that regard.

Nevertheless, the "No server messages received..." indicates clearly WebSocket connectivity issues. WebSocket connections are established, but messages do not start flowing, and STOMP negotiation fails to complete. Maybe this issue happens for some clients only and the proxy issue is on their side if they are external, or perhaps there was some other less obvious or indirect environment or network change (some upgrade somewhere).

Maybe the WSSession is closed but not the StompSession.

That's what I expect, correct. At the level of DefaultStompSession the isConnected just means the internal connection reference hasn't been reset, which would happen when afterConnectionClosed is called and that's where cancelInactivityTasks is also called. For a WebSocket connection, this callback should be triggered by the underlying WebSocket client library (Tomcat in this case it seems). For SockJS polling, it's us making sure it is called if polling fails. It doesn't seem to be happening.

WebSocketBrokerStats should be reasonably straight forward to follow, if you look in the toString() method and then follow usages of the individual stats objects. The first log message shows 964 sessions in total established over time; currently 152 WebSocket sessions are open, and no HTTP streaming/polling sessions; 170 transport errors where we were notified by the WebSocket session that writing failed (and presumably sessions were closed); 649 sessions successfully negotiated at the STOMP level, and 642 eventually disconnected at the STOMP level, so 7 remain connected at the STOMP level, while 163 remain not successfully negotiated at the STOMP level.

The SockJS protocol requires the server to send heartbeat messages, by default every 25 seconds but it's configurable. When STOMP is used on top, we disable those from StompSubProtocolHandler by calling sockJsSession.disableHeartbeat() but only after STOMP is successfully negotiates heartbeats at its level. That negotiation fails for a number of sessions and the queued tasks for SockJS scheduler indicate that that it's sending SockJS heartbeats from the server side, but we know a number of WebSocket sessions are not working well.

I've looked at the logs. I was hoping they would help me to see if there is any issue with SockJS transport fallbacks and the close notification, but I did not see any evidence of falling back such as requests to "/xhr_streaming" or "/xhr" as you had indicated earlier, and likewise no IllegalStateException nor any exception or stacktrace. Not sure if there are some parts that are maybe missing.

@rstoyanchev rstoyanchev self-assigned this Feb 12, 2024
@funnelspider
Copy link
Author

funnelspider commented Feb 12, 2024

Thank you for your answer and insight
The logs are only from a test system whilst the xhr_streaming and /xhr were from productive systems. It's going to take some time until we rollout a version to those customers but I'll come back with a feedback.

From the log that I've attached spring.20240210.log it is visible that the session is closed at 03:51:43

2024-02-10 03:51:43 TRACE DefaultStompSession:137 - Sending DISCONNECT session=839c0640-30cc-c1e5-f6da-fb46734b9388 2024-02-10 03:51:43 TRACE StompEncoder:137 - Encoding STOMP DISCONNECT, headers=null

Afterwards, a new websocket connection is established at 03.51.49 (6 seconds later)

2024-02-10 03:51:49 DEBUG LoggingWebSocketHandlerDecorator:46 - New WebSocketClientSockJsSession[id='ef2e1004f6dd400c84e7f9e3d29fd930, url=https://nhc-demo1.noventi-ora.de/jart/api2/local-jart-proxy] 2024-02-10 03:51:49 DEBUG DefaultStompSession:127 - Connection established in session id=67ee2be0-7bc8-1ada-0617-4be2ef61c85a

and afterwards the old session is sending exceptions
2024-02-10 03:52:04 DEBUG DefaultStompSession:127 - Server has gone quiet. Closing connection in session id=839c0640-30cc-c1e5-f6da-fb46734b9388. 2024-02-10 03:52:19 DEBUG DefaultStompSession:127 - Server has gone quiet. Closing connection in session id=839c0640-30cc-c1e5-f6da-fb46734b9388.
...

So maybe two things can be done here,

  1. you already mentioned, calling the checkSessions() when a connection is established and not when a message comes in and
  2. cancelling the inactivity tasks when the socket is closed / handleTransportError is called

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Feb 12, 2024

"Sending DISCONNECT" is not present in the log file you provided. I do see the following:

2024-02-10 03:51:43 DEBUG WebSocketClientSockJsSession:189 - Closing session with CloseStatus[code=1000, reason=null] in WebSocketClientSockJsSession[id='f3a1ea4c69854b9083c05932f2d99463, url=https://fake-url.com/api2/local-jart-proxy]
2024-02-10 03:51:43 DEBUG NativeWebSocketSession:142 - Closing StandardWebSocketSession[id=7946dcfa-2f4e-f949-dae0-acd4040a5f3a, uri=null]

This indicates something called the close() method on the WebSocket session as a result of a call to resetConnection in DefaultStompSession, but that could have been your handleTransportError, and doesn't help me to understand more about the scenario and see why there was no afterConnectionClosed notification. There is no actual transport error visible either, the IllegalStateException presumably, so not sure what led to the closing from the given log output.

rstoyanchev added a commit that referenced this issue Feb 12, 2024
It makes more sense to call this from afterConnectionEstablished as it
relates to the creation of new sessions.

See gh-32195
@rstoyanchev rstoyanchev added this to the 6.1.4 milestone Feb 12, 2024
@rstoyanchev rstoyanchev added type: enhancement A general enhancement and removed status: waiting-for-triage An issue we've not yet triaged or decided on status: feedback-provided Feedback has been provided labels Feb 12, 2024
@rstoyanchev
Copy link
Contributor

rstoyanchev commented Feb 12, 2024

I've made the change 5851cdc related to checkSessions that should become available in a 6.1.4-SNAPSHOT once the CI build completes. I'll consider some change to the client as well, but further details would still be appreciated. Likewise giving the snapshot a try.

@funnelspider
Copy link
Author

You are right. The logs I was mentioning are on level TRACE.
I've attached a snippet.
On the test system, every night at 3:50 (aprox) the router disconnects and gets a new IP address.
I've attached the org.springframework.messaging logs that are on log level TRACE

ws.20240210.log

@rstoyanchev
Copy link
Contributor

Okay, so for this:

2024-02-10 03:51:43 TRACE StompEncoder:137 - Encoding heartbeat
2024-02-10 03:51:43 TRACE DefaultStompSession:137 - Sending UNSUBSCRIBE {id=[0]} session=839c0640-30cc-c1e5-f6da-fb46734b9388
2024-02-10 03:51:43 TRACE StompEncoder:137 - Encoding STOMP UNSUBSCRIBE, headers={id=[0]}
2024-02-10 03:51:43 TRACE DefaultStompSession:137 - Sending DISCONNECT  session=839c0640-30cc-c1e5-f6da-fb46734b9388
2024-02-10 03:51:43 TRACE StompEncoder:137 - Encoding STOMP DISCONNECT, headers=null
2024-02-10 03:51:43 DEBUG WebSocketClientSockJsSession:189 - Closing session with CloseStatus[code=1000, reason=null] in WebSocketClientSockJsSession[id='f3a1ea4c69854b9083c05932f2d99463, url=https://fake-url.com/api2/local-jart-proxy]
2024-02-10 03:51:43 DEBUG NativeWebSocketSession:142 - Closing StandardWebSocketSession[id=7946dcfa-2f4e-f949-dae0-acd4040a5f3a, uri=null]

Any idea what triggers the unubscribe and disconnect? If it is to match the scenarios described in #32195 (comment), then there should be an IllegalStateException from the heartbeat failing and a "WriteInactivityTask failure" message, but I don't see that.

@funnelspider
Copy link
Author

That scenario I could only reproduce in debug mode. (by manually setting the state to CLOSED
I saw it in the logs but that's it unfortunately.

I will leave the DEBUG logs in the application and if I see it again I'll check the logs.
Is DEBUG enough for this ?

rstoyanchev added a commit that referenced this issue Feb 13, 2024
rstoyanchev added a commit that referenced this issue Feb 13, 2024
ConnectionLostException was applies only after the WebSocket library
notified us of a session close. However, read inactivity and heartbeat
send failures are also cases of the connection being lost rather than
closed intentionally.

This commit also ensures resetConnection is called after a heartbeat
write failure, consistent with other places where a transport error
is handled that implies the connection is lost.

See gh-32195
@rstoyanchev
Copy link
Contributor

I've made improvements on the client side as well that should better ensure the cancellation of heartbeat tasks. If you could give 6.1.4-SNAPSHOT a try after the CI build completes that would be much appreciated.

@rstoyanchev rstoyanchev changed the title OOM error because of lots of “org.apache.tomcat.websocket.server.WsFrameServer” instances Improve client and server handling of STOMP heartbeat failures Feb 13, 2024
@funnelspider
Copy link
Author

On the client side I'm using spring boot version 3.2.0. Can I get a test version for this ?
Otherwise I could just get the spring library and manually patch it :)

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Feb 13, 2024

You don't need a whole new Spring Boot release. You can update just the dependency version for Spring Framework, e.g. for Gradle. For Maven it's just setting a similarly named property "spring-framework.version".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

4 participants