Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve read publisher cancel handling to avoid connections in CLOSE_WAIT state with WebSocket on Tomcat #30393

Closed
philsttr opened this issue Apr 28, 2023 · 5 comments
Assignees
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) type: enhancement A general enhancement
Milestone

Comments

@philsttr
Copy link

philsttr commented Apr 28, 2023

There are a couple cases where using WebSockets with WebFlux on Tomcat can leave connections in a CLOSE_WAIT state after closing the websocket session. These connections stick around, and will eventually cause tomcat to reach its connection limit (if set). This prevents tomcat from accepting new connections, and thus leads to the server becoming unresponsive (except for previously established connections)

When running the same test cases with WebFlux on Netty or Undertow, the connections are closed properly.

I have provided an example project (ws-close-waiting.zip) that shows the cases where the connection gets stuck in CLOSE_WAIT on tomcat after the websocket session is closed.

The project has three websocket endpoints, each showing a different case (only 2 cases fail). In each case, the server will close the websocket session (but in different ways) after receiving a message from the client.

  1. /closeZip - Calls session.close(...) while processing the input stream. The input/output stream are merged with the zip operator. This case leaves the connection in CLOSE_WAIT on tomcat.
  2. /closeZipDelayError - Calls session.close(...) while processing the input stream. The input/output stream are merged with the zipDelayError operator. This case properly closes the connection. I included this case for comparison with the first case. I'm not sure what the downsides of using zipDelayError would be though. Advice appreciated.
  3. /exceptionZipDelayError - Propagates an exception on the input stream, but handles that exception with onErrorResume by calling session.close(...). The input/output streams are merged with the zipDelayError operator. This case leaves the connection in CLOSE_WAIT on tomcat. I included this case to show that the zipWithError operator will "fix" some cases (2), but not every case.

I have enabled the following logging:

logging.level.org.springframework.http.server.reactive=debug
logging.level._org.springframework.http.server.reactive.AbstractListenerReadPublisher=trace
logging.level._org.springframework.http.server.reactive.AbstractListenerWriteProcessor=trace
logging.level._org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor=trace
logging.level._org.springframework.http.server.reactive=trace
logging.level.reactor.netty=debug
logging.level.org.apache.tomcat.websocket=debug

In the failing cases (1 and 3), the read publisher logs a cancel message, and I see the following log lines:

2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [37936546] cancel [READING]
2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [37936546] READING -> COMPLETED

In the successful case (2), the read publisher does not log a cancel message. I think the cancelling is the underlying problem. It prevents the server from noticing that the client has closed the connection.

To test each use case, I used netstat to observe connections, and websocat as the websocket client. Specifically...

I started netstat in a loop to observe connections every second...

while true ; do clear; date; sudo netstat -pn | grep 8080; sleep 1; done

Then I used websocat in another terminal as follows:

  1. connect to one of the three websocket endpoints...
    e.g. websocat -v -v ws://localhost:8080/closeZip (or closeZipNoDelay or exceptionZipNoDelay)
    netstat will show something like...
    Fri Apr 28 01:59:55 PM CEST 2023
    tcp        0      0 127.0.0.1:57316         127.0.0.1:8080          ESTABLISHED 232014/./websocat
    tcp6       0      0 127.0.0.1:8080          127.0.0.1:57316         ESTABLISHED 231835/java
    
  2. type something on the websocat console and press enter. websocat will send what you typed as a text websocket message, and leave the connection open. netstat output remains unchanged
  3. press CTRL-D on the websocat console to end the input stream. websocat will exit.
    For the successful cases, the connections will disappear from netstat.
    For the failure cases, netstat will show something like...
    Fri Apr 28 02:01:36 PM CEST 2023
    tcp        0      0 127.0.0.1:57316         127.0.0.1:8080          FIN_WAIT2   -
    tcp6       7      0 127.0.0.1:8080          127.0.0.1:57316         CLOSE_WAIT  231835/java
    
    Eventually the old client side connection (the one in FIN_WAIT2) will go away. But the server connection (the one in CLOSE_WAIT) will remain until the server is shutdown.

Again, when running WebFlux on Netty or Undertow, the connections always go away in all three cases.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Apr 28, 2023
@jhoeller jhoeller added the in: web Issues in web modules (web, webmvc, webflux, websocket) label Dec 29, 2023
@sdeleuze
Copy link
Contributor

That seems to be a pretty low level issue, have you identified anything that let you think this issue is on Spring side and not on Tomcat side?

@sdeleuze sdeleuze added the status: waiting-for-feedback We need additional information before we can continue label Jan 12, 2024
@philsttr
Copy link
Author

philsttr commented Jan 12, 2024

Just that Spring's AbstractListenerReadPublisher is cancelling the stream in the failure cases. But it does not cancel in the success case.

Specifically this

In the failing cases (1 and 3), the read publisher logs a cancel message, and I see the following log lines:

2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [37936546] cancel [READING]
2023-04-28T13:48:29.358+02:00 TRACE 227341 --- [nio-8080-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [37936546] READING -> COMPLETED

In the successful case (2), the read publisher does not log a cancel message. I think the cancelling is the underlying problem. It prevents the server from noticing that the client has closed the connection.

TBH, I'm not an expert at the interaction between Spring and Tomcat. I'd love if someone more familiar with it could take a look at the example project and the reproduction steps I provided.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Jan 12, 2024
@rstoyanchev rstoyanchev self-assigned this Jan 15, 2024
@rstoyanchev
Copy link
Contributor

rstoyanchev commented Jan 15, 2024

This is related to the resuming and suspending of the Tomcat WebSocket session, and not reading the final close frame (I think) if the session is closed while in the suspended state.

The session is closed while handling an incoming message. Tomcat calls onClose and AbstractListenerWebSocketSession#handleClose completes the reading and the writing sides. The reading side actually defers transitioning to COMPLETED temporarily because it's still in the READING state (and on the same thread). However, as the zip operator is used to combine the reading and writing, when the writing processor completes, the zip operator then cancels reading too, and currently the handling of cancel transitions the read publisher to COMPLETED immediately. As a result when the callstack unwinds back to the reading loop, we just exit without transitioning to the DEMAND state before completing, and so the session remains suspended, and Tomcat never reads any more frames. By comparison, in the success scenario I see the following additional lines at the end:

[nio-8080-exec-2] o.a.t.websocket.server.WsFrameServer     : wsFrameServer.onDataAvailable
[nio-8080-exec-2] o.a.t.websocket.server.WsFrameServer     : Read [6] bytes into input buffer ready for processing
[nio-8080-exec-2] o.a.t.websocket.server.WsFrameServer     : WebSocket frame received. fin [true], rsv [0], OpCode [8], payload length [0]

I can make changes so that cancel handling is similar to completion handling when in the READING state where completion is temporarily delayed until we exit the reading loop. This was was originally done in #26834 but did not cover a cancel signal.

That said I am also wondering if Tomcat should be handling this differently, i.e. when the session is closed should the session be resumed to allow the closing to be completed. @markt-asf any thoughts on that?

@rstoyanchev rstoyanchev added type: enhancement A general enhancement and removed status: waiting-for-triage An issue we've not yet triaged or decided on status: feedback-provided Feedback has been provided labels Jan 16, 2024
@rstoyanchev rstoyanchev added this to the 6.1.4 milestone Jan 16, 2024
@rstoyanchev rstoyanchev changed the title WebSocket server using WebFlux on Tomcat can leave connections in CLOSE_WAIT state Improve read publisher cancel handling to avoid connections in CLOSE_WAIT state with WebSocket on Tomcat Jan 16, 2024
rstoyanchev added a commit that referenced this issue Jan 17, 2024
@philsttr
Copy link
Author

Thanks as always @rstoyanchev !

@markt-asf
Copy link

Late to the party but I'll be looking at this today on the Tomcat side.

I was on the fence as to whether whatever called suspend() should be expected to call resume() if it is going to close the connection. It could be argued either way.

Regardless, it looks like Tomcat needs to (better) handle the case where the WebSocket session is closed by the server but the client confirmation of the close is never received. That is what I am planning on looking at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

6 participants