FR: Adding ping/pong redundancy #1414

MarlieChiller · 2023-10-30T13:09:04Z

I've been on a bit of a journey to get here (heres the origin).

I am writing a server application which receives client connections which can on occasion be from poor network environments. I have observed sometimes that the websocket ping/pong mechanism (heartbeat) is not fully reliable with regards to the status of the connection. Specifically, sometimes a ping will go unacked with a requisite pong in poor network conditions, possibly due to the client experiencing packet loss and never receiving the ping.

The consequences of missing solely the pong means that whilst the heartbeat has been missed, the rest of data exchange continues unabated until the ping timeout is violated some seconds later, abruptly ending the exchange despite an otherwise healthy connection. Unfortunately due to the requirements of my application, reconnection is prohibited and the process will need to initialise a new instance. This means that dropping a connection prematurely is costly.

Currently it seems the heartbeat logic in this library has no functionality to add mechanisms seen in other similar connection types such as a retry or backoff whereby if a heartbeat is missed, the server immediately tries again for X number of times and if those fail then the connection is closed.

Would adding such functionality to this library (disabled by default of course) violate other websocket protocols or assumed logic? It seems to me that adding this logic would solve a few edge use cases whilst not explicitly breaking existing uses and add some redundancy in these poor network scenarios.

I have read other users running into similar problems which they mitigate by adding their own custom heartbeat on the application layer but this seems wrong due to the duplication of existing fundamental patterns and the chance for a deviation of between the protocol level and the application level (e.g application reporting connection is valid whilst protocol reports the connection has timed out).

Let me know your thoughts!

aaugustin · 2023-10-30T17:15:28Z

When you say:

whilst the heartbeat has been missed, the rest of data exchange continues unabated

I assume that you mean that the data exchange from client to server continues unabated while a ping from server to client goes missing. Indeed, WebSockets runs over TCP, with strong ordering guarantees, making it impossible for the two effects that you describe to happen on the same half of the connection.

I suppose it's "client to server" and "server to client" in that order because the situation you're describing happens when:

you run heartbeat from the server side (server sends pings and expects pongs)
the client running in a browser loses connectivity but doesn't notice
the client continues sending information happily for some time

In this scenario, even if the server noticed and closed the connection, the client still wouldn't notice, as we assume that connectivity is lost!

The only way to fix that is to run heartbeats in the client. As you may know, browsers do NOT expose an API for protocol-level ping/pongs. As a consequence, you have no choice but to do it at the application level. This is the real reason why many services have to design application-level heartbeats.

If I guessed your scenario accurately, there's nothing we can do to fix it on the server side.

If your scenario is different and there's a chance we can do something on the server side, please explain why, considering the guarantees of TCP, your proposed solution changes something. The way you formulated it sounds like you're proposing a shitty version of TCP because you don't realize that TCP already handles retries for you. If TCP doesn't succeed, we will not succeed :-)

MarlieChiller · 2023-10-30T19:32:57Z

Thanks for the comprehensive response. It sounds like you have diagnosed my issue pretty accurately. Indeed, I did not realise that TCP has retransmission logic built into it which removes the need for the feature. However, I am still not fully understanding how a server -> client ping can be dropped but client -> server data can continue to be streamed... It seems I need to do some reading into the fundamentals of the TCP protocol to fully understand the bidirectional nature of the connection.

It is sounding like either I need to trust the existing heartbeat mechanism to diagnose lost connections or implement application layer heartbeats. Given some of my clients are browser based, this might end up being the solution. Thanks for your input 👍

aaugustin · 2023-10-31T07:57:58Z

Here's a neat StackOverflow answer from 7 years ago that says exactly what I said above: https://stackoverflow.com/questions/35820885/why-do-many-websocket-libraries-implement-their-own-application-level-heartbeats

And here's a great blog post that goes into the details of a real life encounter with this issue: https://making.close.com/posts/reliable-websockets/ This blog post answers your question of "how do I create a half-broken TCP connection"? It has to do with the two-way closing handshake in TCP: if you break it at the wrong point, the connection can remain in a non-functional state and timeouts are very long.

I should add this to the discussion of heartbeats and/or the FAQ. Marking as a doc issue.

Kludex · 2023-10-31T08:07:52Z

As always, thanks @aaugustin . I learned, and I keep learning from your writings. 🙏

Sorry the noise. 👍

MarlieChiller changed the title ~~Exposing Ping/Pong Frames for Application Layer Logic~~ FR: Adding ping/pong redundancy Oct 30, 2023

aaugustin added the documentation label Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FR: Adding ping/pong redundancy #1414

FR: Adding ping/pong redundancy #1414

MarlieChiller commented Oct 30, 2023 •

edited

aaugustin commented Oct 30, 2023 •

edited

MarlieChiller commented Oct 30, 2023

aaugustin commented Oct 31, 2023

Kludex commented Oct 31, 2023

FR: Adding ping/pong redundancy #1414

FR: Adding ping/pong redundancy #1414

Comments

MarlieChiller commented Oct 30, 2023 • edited

aaugustin commented Oct 30, 2023 • edited

MarlieChiller commented Oct 30, 2023

aaugustin commented Oct 31, 2023

Kludex commented Oct 31, 2023

MarlieChiller commented Oct 30, 2023 •

edited

aaugustin commented Oct 30, 2023 •

edited