Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnection of a NATS server in a cluster is not detected by ping-pong mechanism #4014

Closed
sandykellagher opened this issue Apr 3, 2023 · 1 comment

Comments

@sandykellagher
Copy link
Contributor

Defect
This is more or less a follow on to #3682

We are using three NATS servers in a cluster, and see a bug which prevents the loss of a particular server being detected by the other two servers. This matters because we use queue group subscriptions: when a server goes offline, we need the queue group subscription for services on the dead server to be removed. Otherwise, NATS can end up selecting entities on the dead server as the nominated responder for pub messages to the queue group.

In our system, there is continual traffic exchanged between all three servers during normal operation. And in short, this continuous traffic results in all outbound Pings from one server to another being held off/delayed indefinitely. And because the Pings are held off, the servers don't detect a stale connection, and hence doesn't close the connection and attempt to reconnect.

I believe I understand the issue.

In NATS server client.go:: processPingTimer() there is a check to test whether to delay sending an outgoing ping, in two cases:

  • we recently (within specified pingInterval) received traffic (a message or sub/unsub) from the remote end
  • we recently received a ping from the remote end (Remote ping)

This makes perfect sense: incoming receive messages mean we still have a link and don't need to send a ping.

However, the first test above is derived from the client.last field ("last packet" time) which is set for both incoming and outgoing traffic on this link.

I think the solution is to store a separate client.ping.lastRxMsg field which is updated for receive traffic only, and then use that field in the test of whether to hold off sending a ping.

Also, with this change, I think the same logic can be applied to all connection types (CLIENT, ROUTER, GATEWAY, LEAF). So we can remove the tests to always send ping for a GATEWAY.

OS/Container environment
Linux ARM64 - but not a factor

Steps or code to reproduce the issue
Configure 3 server cluster, and then break the connection to one of the servers

Expected result
The remaining servers should automatically detect loss of connection, with the detection time as configured by ping_interval and ping_max parameters.

Actual result
The remaining servers do eventually detect loss of connection via a WriteBuffer deadline being exceeded, but it takes a very long time (presumably) depending on the volume of traffic and the size of the buffer.

@sandykellagher
Copy link
Contributor Author

I've closed PR #4015 in favour of #4016

derekcollison added a commit that referenced this issue Apr 6, 2023
…connections (#4016)

Alternative fix for issue
#4014: where we always send
pings for ROUTER, GATEWAY and LEAF spoke connections.

This is my own original work that I license to the project

[ x] Link to issue, e.g. Resolves #NNN
 Documentation added (if applicable)
 Tests added
[ x] Branch rebased on top of current main (git pull --rebase origin
main)
[ x] Changes squashed to a single commit (described
[here](http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html))
[x] Build is green in Travis CI
[ x] You have certified that the contribution is your original work and
that you license the work to the project under the [Apache 2
license](https://github.com/nats-io/nats-server/blob/main/LICENSE)
Resolves #

Changes proposed in this pull request:
- in processPingTimer, add explicit test for ROUTER connection, and only
hold off sending ping based on receive ping
@bruth bruth removed the 🐞 bug label Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants