Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partially cut off nats-server in a cluster of three nats-servers #4681

Closed
maxmoehl opened this issue Oct 19, 2023 · 10 comments
Closed

Partially cut off nats-server in a cluster of three nats-servers #4681

maxmoehl opened this issue Oct 19, 2023 · 10 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@maxmoehl
Copy link

Observed behavior

We observed that one nats-server in a cluster of three was no longer receiving / sending all messages. This led to messages from producers connected to that nats-server being lost.

Our messages are sent in intervals of 20s, we could see that instance 1 was receiving messages which it then relayed only to instance 2. Subscribers on instance 3 did not receive the messages:

1 <-> 2 ✅
1 <-> 3 ❌
2 <-> 3 ✅

We checked this via the /routez endpoint:

$ curl -i 10.5.1.3:8223/routez # this is instance 3
[...]
    { # instance 2, other connections were inactive
      "ip": "10.2.1.3",
      "port": 4225,
      "in_msgs": 4099266,
      "out_msgs": 1269,
      "in_bytes": 3096262046,
      "out_bytes": 893231,
      "subscriptions": 5,
      "compression": "off"
    }
[...]
    { # instance 1, all connections inactive
       "ip": "10.1.1.3",
      "port": 58410,
      "start": "2023-10-13T05:25:37.293102354Z",
      "last_activity": "2023-10-13T05:25:37.3046772Z",
      "rtt": "1.560618ms",
      "uptime": "4d4h16m14s",
      "idle": "4d4h16m14s",
      "pending_size": 0,
      "in_msgs": 0,
      "out_msgs": 0,
      "in_bytes": 0,
      "out_bytes": 0,
      "subscriptions": 0,
      "compression": "off"
    },
[...]

We are not yet sure what caused this issue but we upgraded to 2.10.x about a day before we started seeing this issue and never had issues like this before. Could it be related to the new connection pooling feature introduced with #4001?

The issue occurred on Azure and are aware that there was some service degradation in that timeframe but it is not yet clear if this issue is related to it.

Expected behavior

All messages are reliably delivered from all producers to all consumers.

If a connection breaks, nats notices and falls back to another connection or re-establishes the connection.

Server and client version

nats-server v2.10.1
nats.go multiple

Host environment

Provider: Microsoft Azure
Host OS: Ubuntu 22.04
Architecture: amd64
Deployed via bosh, compiled like this:

go build -o "${GOBIN}/nats-server" github.com/nats-io/nats-server/v2

Steps to reproduce

Unfortunately we are not able to reproduce the issue.

@maxmoehl maxmoehl added the defect Suspected defect such as a bug or regression label Oct 19, 2023
@maxmoehl
Copy link
Author

We captured the state of all cluster connections using lsof -ni4TCP:4225:

# instance 1
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
nats-serv 10536 vcap    9u  IPv4  52734      0t0  TCP 10.1.1.3:4225 (LISTEN)
nats-serv 10536 vcap   21u  IPv4  60685      0t0  TCP 10.1.1.3:4225->10.2.1.3:58844 (ESTABLISHED)
nats-serv 10536 vcap   26u  IPv4  61473      0t0  TCP 10.1.1.3:4225->10.2.1.3:58860 (ESTABLISHED)
nats-serv 10536 vcap   27u  IPv4  60341      0t0  TCP 10.1.1.3:58278->10.2.1.3:4225 (ESTABLISHED)
nats-serv 10536 vcap   41u  IPv4  69436      0t0  TCP 10.1.1.3:58410->10.5.1.3:4225 (ESTABLISHED)
nats-serv 10536 vcap  121u  IPv4  70481      0t0  TCP 10.1.1.3:4225->10.5.1.3:34648 (ESTABLISHED)
# instance 2
COMMAND     PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
nats-serv 10364 vcap    8u  IPv4   52091      0t0  TCP 10.2.1.3:58844->10.1.1.3:4225 (ESTABLISHED)
nats-serv 10364 vcap    9u  IPv4   52080      0t0  TCP 10.2.1.3:4225 (LISTEN)
nats-serv 10364 vcap   14u  IPv4   52417      0t0  TCP 10.2.1.3:4225->10.1.1.3:58278 (ESTABLISHED)
nats-serv 10364 vcap   15u  IPv4   53518      0t0  TCP 10.2.1.3:58860->10.1.1.3:4225 (ESTABLISHED)
nats-serv 10364 vcap   17u  IPv4   61841      0t0  TCP 10.2.1.3:4225->10.5.1.3:36138 (ESTABLISHED)
nats-serv 10364 vcap   18u  IPv4   63850      0t0  TCP 10.2.1.3:4225->10.5.1.3:36146 (ESTABLISHED)
nats-serv 10364 vcap   54u  IPv4 2597256      0t0  TCP 10.2.1.3:4225->10.5.1.3:39910 (ESTABLISHED)
# instance 3
COMMAND     PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
nats-serv 10529 vcap    8u  IPv4   51968      0t0  TCP 10.5.1.3:36138->10.2.1.3:4225 (ESTABLISHED)
nats-serv 10529 vcap    9u  IPv4   49887      0t0  TCP 10.5.1.3:4225 (LISTEN)
nats-serv 10529 vcap   10u  IPv4   51969      0t0  TCP 10.5.1.3:4225->10.1.1.3:58410 (ESTABLISHED)
nats-serv 10529 vcap   11u  IPv4   49121      0t0  TCP 10.5.1.3:34648->10.1.1.3:4225 (ESTABLISHED)
nats-serv 10529 vcap   13u  IPv4   50915      0t0  TCP 10.5.1.3:36146->10.2.1.3:4225 (ESTABLISHED)
nats-serv 10529 vcap   14u  IPv4 2583749      0t0  TCP 10.5.1.3:39910->10.2.1.3:4225 (ESTABLISHED)

And you can see that one connection between 1 and 3 is missing.

@maxmoehl maxmoehl changed the title Partially cut of nats-server in a cluster of three nats-servers Partially cut off nats-server in a cluster of three nats-servers Oct 19, 2023
@derekcollison
Copy link
Member

Could you verify that the IP path between them was still functional? Could the CP have changed something out from underneath of you?

@maxmoehl
Copy link
Author

The initial feedback from Azure was that there were no issue with those VMs during the timeframe in question but you never know 🤷🏼‍♂️ Unfortunately we didn't explicitly check that they could ping each other (and the situation has been resolved by restarting the nats-server).

The weird thing is that it looks like they were able to communicate since two connections still existed. But somehow they didn't exchange messages over those two connections and also didn't re-establish the failed connection. Another odd thing we noticed is that the age of the connections differed. The inactive connections were more than 4 days old, the active connections however were only between 2 and 3 days old:

{
  "server_id": "NDEAFX56442IOYDGRYKCMZIU3MJGGFLRYBNJ2OPCBELA7X2YKVQWP3LZ",
  "server_name": "",
  "now": "2023-10-17T14:04:55.737670507Z",
  "num_routes": 5,
  "routes": [
    {
      "rid": 3,
      "remote_id": "ND6X72FNHI37RSU4SM6NRQVCFSRUFES3B4AIUEIM7Q7E4MOCJKGWDAO2",
      "remote_name": "ND6X72FNHI37RSU4SM6NRQVCFSRUFES3B4AIUEIM7Q7E4MOCJKGWDAO2",
      "did_solicit": true,
      "is_configured": true,
      "ip": "10.2.1.3",
      "port": 4225,
      "start": "2023-10-13T05:25:37.278749173Z",
      "last_activity": "2023-10-13T05:25:37.288734699Z",
      "rtt": "447µs",
      "uptime": "4d8h39m18s",
      "idle": "4d8h39m18s",
      "pending_size": 0,
      "in_msgs": 0,
      "out_msgs": 0,
      "in_bytes": 0,
      "out_bytes": 0,
      "subscriptions": 0,
      "compression": "off"
    },
    {
      "rid": 7,
      "remote_id": "ND6X72FNHI37RSU4SM6NRQVCFSRUFES3B4AIUEIM7Q7E4MOCJKGWDAO2",
      "remote_name": "ND6X72FNHI37RSU4SM6NRQVCFSRUFES3B4AIUEIM7Q7E4MOCJKGWDAO2",
      "did_solicit": true,
      "is_configured": true,
      "ip": "10.2.1.3",
      "port": 4225,
      "start": "2023-10-13T05:25:37.372306553Z",
      "last_activity": "2023-10-13T05:25:37.382811386Z",
      "rtt": "414µs",
      "uptime": "4d8h39m18s",
      "idle": "4d8h39m18s",
      "pending_size": 0,
      "in_msgs": 0,
      "out_msgs": 0,
      "in_bytes": 0,
      "out_bytes": 0,
      "subscriptions": 0,
      "compression": "off"
    },
    {
      "rid": 29680,
      "remote_id": "ND6X72FNHI37RSU4SM6NRQVCFSRUFES3B4AIUEIM7Q7E4MOCJKGWDAO2",
      "remote_name": "ND6X72FNHI37RSU4SM6NRQVCFSRUFES3B4AIUEIM7Q7E4MOCJKGWDAO2",
      "did_solicit": true,
      "is_configured": true,
      "ip": "10.2.1.3",
      "port": 4225,
      "start": "2023-10-14T22:41:56.893090841Z",
      "last_activity": "2023-10-17T14:04:54.480116381Z",
      "rtt": "483µs",
      "uptime": "2d15h22m58s",
      "idle": "1s",
      "pending_size": 0,
      "in_msgs": 4431612,
      "out_msgs": 3634,
      "in_bytes": 3344004157,
      "out_bytes": 2540714,
      "subscriptions": 5,
      "compression": "off"
    },
    {
      "rid": 4,
      "remote_id": "NDAJBOUSAYDWGDAKOOAG2DVV2PAJXC4CQK4VGLRG75U3RV2ZLFPEGDLX",
      "remote_name": "NDAJBOUSAYDWGDAKOOAG2DVV2PAJXC4CQK4VGLRG75U3RV2ZLFPEGDLX",
      "did_solicit": false,
      "is_configured": false,
      "ip": "10.1.1.3",
      "port": 58410,
      "start": "2023-10-13T05:25:37.293102354Z",
      "last_activity": "2023-10-13T05:25:37.3046772Z",
      "rtt": "1.481113ms",
      "uptime": "4d8h39m18s",
      "idle": "4d8h39m18s",
      "pending_size": 0,
      "in_msgs": 0,
      "out_msgs": 0,
      "in_bytes": 0,
      "out_bytes": 0,
      "subscriptions": 0,
      "compression": "off"
    },
    {
      "rid": 5,
      "remote_id": "NDAJBOUSAYDWGDAKOOAG2DVV2PAJXC4CQK4VGLRG75U3RV2ZLFPEGDLX",
      "remote_name": "NDAJBOUSAYDWGDAKOOAG2DVV2PAJXC4CQK4VGLRG75U3RV2ZLFPEGDLX",
      "did_solicit": true,
      "is_configured": true,
      "ip": "10.1.1.3",
      "port": 4225,
      "start": "2023-10-13T05:25:37.317184158Z",
      "last_activity": "2023-10-13T05:25:37.33085953Z",
      "rtt": "1.744615ms",
      "uptime": "4d8h39m18s",
      "idle": "4d8h39m18s",
      "pending_size": 0,
      "in_msgs": 0,
      "out_msgs": 0,
      "in_bytes": 0,
      "out_bytes": 0,
      "subscriptions": 0,
      "compression": "off"
    }
  ]
}

We are trying to reproduce the issue in a local environment in the meantime.

@derekcollison
Copy link
Member

Also might be good to upgrade to latest 2.10.3

@lionello
Copy link

I'm seeing the same with 2.10.3, see https://natsio.slack.com/archives/C069GSYFP/p1698121874424809

@kozlovic
Copy link
Member

I have noticed that some of the route connections are shown as "not solicited" nor "configured". Assuming that your servers point to each other, I would have expected not to be the case, but I found out that it was. I will have a PR shortly to try to address that. However, it should not matter if one side was not marked as solicited (nor explicit), because when the connection broke, the server that DID have the connection marked as solicited should have retried. So I suspect that - at least in the case of the OP - there was some connectivity issue that prevented the connection to be re-established.

derekcollison added a commit that referenced this issue Oct 25, 2023
…#4701)

When a server accepts a route, that route would not be marked as
solicited or configured while it should if it was present in the
configuration.

Prior to pooling, if server A pointed to server B and vice-versa, one of
the server will have the route as solicited and the server that accepted
the route would have upgraded it as solicited. With pooling code, this
was not always happening.

Maybe related to #4681 

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@domdom82
Copy link

I've tried to reproduce the issue by tcpkill-ing various connections between NATS servers, either from the soliciting or non-soliciting side. The connections are re-established immediately and I haven't found any missing messages so far.

It seems the network issue we faced was not simply disrupted connections via RST but rather dropped packets on the network.

My current theory, @derekcollison please correct me if I am wrong:
Looking at the initial picture of @maxmoehl :

# instance 1
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
nats-serv 10536 vcap    9u  IPv4  52734      0t0  TCP 10.1.1.3:4225 (LISTEN)
nats-serv 10536 vcap   21u  IPv4  60685      0t0  TCP 10.1.1.3:4225->10.2.1.3:58844 (ESTABLISHED)
nats-serv 10536 vcap   26u  IPv4  61473      0t0  TCP 10.1.1.3:4225->10.2.1.3:58860 (ESTABLISHED)
nats-serv 10536 vcap   27u  IPv4  60341      0t0  TCP 10.1.1.3:58278->10.2.1.3:4225 (ESTABLISHED)
nats-serv 10536 vcap   41u  IPv4  69436      0t0  TCP 10.1.1.3:58410->10.5.1.3:4225 (ESTABLISHED)
nats-serv 10536 vcap  121u  IPv4  70481      0t0  TCP 10.1.1.3:4225->10.5.1.3:34648 (ESTABLISHED)
  • There is one connection missing between 10.1.1.3:xxxxx and 10.2.1.3:4225
  • From what I've understood of the conn pool, unless there is head-of-line blocking going on, only one of the three tcp connections will be actually used to relay messages. The other two are idling around.
  • Does an idle connection send any data? (heartbeat etc.) If not, it could well be that the two remaining idle connections were actually dead. Nobody noticed because nothing was sent on them.
  • The one "real" connection that got dropped was probably tried to re-establish but could not because of the network issues.
  • This has lead to a temporary disconnect of the one node 10.1.1.3 and any subscriber on that node did not get messages from other nodes.

The question I'm currently stuck with: Why did the cluster not auto-heal? We could only fix the issue by restarting the node.

@kozlovic
Copy link
Member

@domdom82 Here are some answers:

From what I've understood of the conn pool, unless there is head-of-line blocking going on, only one of the three tcp connections will be actually used to relay messages. The other two are idling around.

No. Instead, each connection in the pool handles a set of accounts. Say accounts "A", "C" and "D" may be handled by connection 1, accounts "B", "E" and "G" by connection 2 and "F", "H" and "J" by connection 3. If there is no traffic on any accounts handled by a given connection, then this connection will show as idle with no change in in/out msgs.

Does an idle connection send any data? (heartbeat etc.) If not, it could well be that the two remaining idle connections were actually dead. Nobody noticed because nothing was sent on them.

See comment above regarding data. Still, an idle connection should have heartbeats. Running the server with -D should show that there is a ping timer event, even for those idle connections. I will test later today and make sure that this is the case. If it is, then a stale connection (absence of PONG response from the peer to a PING request) should be disconnected. As I described in a previous comment, even if a server had a connection from its peer marked as "not solicited" and therefore would not try to reconnect, the other side should have it marked as "solicited" and therefore should try to reconnect. Hence my suspicion that there was really something wrong network wise.

The one "real" connection that got dropped was probably tried to re-establish but could not because of the network issues.

I would think so, and there would probably be error in the logs?

The question I'm currently stuck with: Why did the cluster not auto-heal? We could only fix the issue by restarting the node.

That I am not sure. Maybe the server resolved the hostname to an IP that later was changed and the reconnect was not happening due to that?

A last note: if the user does not need the extra route connections, or if it is suspected that the issue is due to pooling, an option is to disable pooling by setting pool_size: -1 in the cluster{} block. This should cause the server to behave as a pre v2.10.0 server with regards to routes handling.

@domdom82
Copy link

@kozlovic Thanks for clarifying the connection pool semantics. Since we don't have accounts configured on the cluster, we will turn off the pooling for now, just to rule this out as a possibility and to safe some overhead with dealing with idle connections.

Unfortunately, the logs of the event are gone so we won't know for sure. I could try reproducing by adding an iptables rule that drops packets between two nodes, though it would probably not be exactly the issue we saw still.

@kozlovic
Copy link
Member

@domdom82 I have verified (with main branch but also v2.10.3) that PINGs are properly sent on all route pooled connections. I have simulated a server not responding to a PING (with a PONG) for one of those pooled connections and the route is properly detected as "stale" and closed. Even when in v2.10.3 the detection occurs on the side that has not the connection marked as "solicited", the other side gets the disconnect and initiates a reconnect. So "auto healing" occurs in that case too.
I am going to close this issue for now, but feel free to re-open if you have a reproducible test case. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

5 participants