New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partially cut off nats-server in a cluster of three nats-servers #4681
Comments
We captured the state of all cluster connections using
And you can see that one connection between 1 and 3 is missing. |
Could you verify that the IP path between them was still functional? Could the CP have changed something out from underneath of you? |
The initial feedback from Azure was that there were no issue with those VMs during the timeframe in question but you never know 🤷🏼♂️ Unfortunately we didn't explicitly check that they could ping each other (and the situation has been resolved by restarting the nats-server). The weird thing is that it looks like they were able to communicate since two connections still existed. But somehow they didn't exchange messages over those two connections and also didn't re-establish the failed connection. Another odd thing we noticed is that the age of the connections differed. The inactive connections were more than 4 days old, the active connections however were only between 2 and 3 days old: {
"server_id": "NDEAFX56442IOYDGRYKCMZIU3MJGGFLRYBNJ2OPCBELA7X2YKVQWP3LZ",
"server_name": "",
"now": "2023-10-17T14:04:55.737670507Z",
"num_routes": 5,
"routes": [
{
"rid": 3,
"remote_id": "ND6X72FNHI37RSU4SM6NRQVCFSRUFES3B4AIUEIM7Q7E4MOCJKGWDAO2",
"remote_name": "ND6X72FNHI37RSU4SM6NRQVCFSRUFES3B4AIUEIM7Q7E4MOCJKGWDAO2",
"did_solicit": true,
"is_configured": true,
"ip": "10.2.1.3",
"port": 4225,
"start": "2023-10-13T05:25:37.278749173Z",
"last_activity": "2023-10-13T05:25:37.288734699Z",
"rtt": "447µs",
"uptime": "4d8h39m18s",
"idle": "4d8h39m18s",
"pending_size": 0,
"in_msgs": 0,
"out_msgs": 0,
"in_bytes": 0,
"out_bytes": 0,
"subscriptions": 0,
"compression": "off"
},
{
"rid": 7,
"remote_id": "ND6X72FNHI37RSU4SM6NRQVCFSRUFES3B4AIUEIM7Q7E4MOCJKGWDAO2",
"remote_name": "ND6X72FNHI37RSU4SM6NRQVCFSRUFES3B4AIUEIM7Q7E4MOCJKGWDAO2",
"did_solicit": true,
"is_configured": true,
"ip": "10.2.1.3",
"port": 4225,
"start": "2023-10-13T05:25:37.372306553Z",
"last_activity": "2023-10-13T05:25:37.382811386Z",
"rtt": "414µs",
"uptime": "4d8h39m18s",
"idle": "4d8h39m18s",
"pending_size": 0,
"in_msgs": 0,
"out_msgs": 0,
"in_bytes": 0,
"out_bytes": 0,
"subscriptions": 0,
"compression": "off"
},
{
"rid": 29680,
"remote_id": "ND6X72FNHI37RSU4SM6NRQVCFSRUFES3B4AIUEIM7Q7E4MOCJKGWDAO2",
"remote_name": "ND6X72FNHI37RSU4SM6NRQVCFSRUFES3B4AIUEIM7Q7E4MOCJKGWDAO2",
"did_solicit": true,
"is_configured": true,
"ip": "10.2.1.3",
"port": 4225,
"start": "2023-10-14T22:41:56.893090841Z",
"last_activity": "2023-10-17T14:04:54.480116381Z",
"rtt": "483µs",
"uptime": "2d15h22m58s",
"idle": "1s",
"pending_size": 0,
"in_msgs": 4431612,
"out_msgs": 3634,
"in_bytes": 3344004157,
"out_bytes": 2540714,
"subscriptions": 5,
"compression": "off"
},
{
"rid": 4,
"remote_id": "NDAJBOUSAYDWGDAKOOAG2DVV2PAJXC4CQK4VGLRG75U3RV2ZLFPEGDLX",
"remote_name": "NDAJBOUSAYDWGDAKOOAG2DVV2PAJXC4CQK4VGLRG75U3RV2ZLFPEGDLX",
"did_solicit": false,
"is_configured": false,
"ip": "10.1.1.3",
"port": 58410,
"start": "2023-10-13T05:25:37.293102354Z",
"last_activity": "2023-10-13T05:25:37.3046772Z",
"rtt": "1.481113ms",
"uptime": "4d8h39m18s",
"idle": "4d8h39m18s",
"pending_size": 0,
"in_msgs": 0,
"out_msgs": 0,
"in_bytes": 0,
"out_bytes": 0,
"subscriptions": 0,
"compression": "off"
},
{
"rid": 5,
"remote_id": "NDAJBOUSAYDWGDAKOOAG2DVV2PAJXC4CQK4VGLRG75U3RV2ZLFPEGDLX",
"remote_name": "NDAJBOUSAYDWGDAKOOAG2DVV2PAJXC4CQK4VGLRG75U3RV2ZLFPEGDLX",
"did_solicit": true,
"is_configured": true,
"ip": "10.1.1.3",
"port": 4225,
"start": "2023-10-13T05:25:37.317184158Z",
"last_activity": "2023-10-13T05:25:37.33085953Z",
"rtt": "1.744615ms",
"uptime": "4d8h39m18s",
"idle": "4d8h39m18s",
"pending_size": 0,
"in_msgs": 0,
"out_msgs": 0,
"in_bytes": 0,
"out_bytes": 0,
"subscriptions": 0,
"compression": "off"
}
]
} We are trying to reproduce the issue in a local environment in the meantime. |
Also might be good to upgrade to latest 2.10.3 |
I'm seeing the same with 2.10.3, see https://natsio.slack.com/archives/C069GSYFP/p1698121874424809 |
I have noticed that some of the route connections are shown as "not solicited" nor "configured". Assuming that your servers point to each other, I would have expected not to be the case, but I found out that it was. I will have a PR shortly to try to address that. However, it should not matter if one side was not marked as solicited (nor explicit), because when the connection broke, the server that DID have the connection marked as solicited should have retried. So I suspect that - at least in the case of the OP - there was some connectivity issue that prevented the connection to be re-established. |
…#4701) When a server accepts a route, that route would not be marked as solicited or configured while it should if it was present in the configuration. Prior to pooling, if server A pointed to server B and vice-versa, one of the server will have the route as solicited and the server that accepted the route would have upgraded it as solicited. With pooling code, this was not always happening. Maybe related to #4681 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
I've tried to reproduce the issue by tcpkill-ing various connections between NATS servers, either from the soliciting or non-soliciting side. The connections are re-established immediately and I haven't found any missing messages so far. It seems the network issue we faced was not simply disrupted connections via RST but rather dropped packets on the network. My current theory, @derekcollison please correct me if I am wrong:
The question I'm currently stuck with: Why did the cluster not auto-heal? We could only fix the issue by restarting the node. |
@domdom82 Here are some answers:
No. Instead, each connection in the pool handles a set of accounts. Say accounts "A", "C" and "D" may be handled by connection 1, accounts "B", "E" and "G" by connection 2 and "F", "H" and "J" by connection 3. If there is no traffic on any accounts handled by a given connection, then this connection will show as idle with no change in in/out msgs.
See comment above regarding data. Still, an idle connection should have heartbeats. Running the server with
I would think so, and there would probably be error in the logs?
That I am not sure. Maybe the server resolved the hostname to an IP that later was changed and the reconnect was not happening due to that? A last note: if the user does not need the extra route connections, or if it is suspected that the issue is due to pooling, an option is to disable pooling by setting |
@kozlovic Thanks for clarifying the connection pool semantics. Since we don't have accounts configured on the cluster, we will turn off the pooling for now, just to rule this out as a possibility and to safe some overhead with dealing with idle connections. Unfortunately, the logs of the event are gone so we won't know for sure. I could try reproducing by adding an iptables rule that drops packets between two nodes, though it would probably not be exactly the issue we saw still. |
@domdom82 I have verified (with main branch but also v2.10.3) that PINGs are properly sent on all route pooled connections. I have simulated a server not responding to a PING (with a PONG) for one of those pooled connections and the route is properly detected as "stale" and closed. Even when in v2.10.3 the detection occurs on the side that has not the connection marked as "solicited", the other side gets the disconnect and initiates a reconnect. So "auto healing" occurs in that case too. |
Observed behavior
We observed that one nats-server in a cluster of three was no longer receiving / sending all messages. This led to messages from producers connected to that nats-server being lost.
Our messages are sent in intervals of 20s, we could see that instance 1 was receiving messages which it then relayed only to instance 2. Subscribers on instance 3 did not receive the messages:
1 <-> 2 ✅
1 <-> 3 ❌
2 <-> 3 ✅
We checked this via the
/routez
endpoint:We are not yet sure what caused this issue but we upgraded to 2.10.x about a day before we started seeing this issue and never had issues like this before. Could it be related to the new connection pooling feature introduced with #4001?
The issue occurred on Azure and are aware that there was some service degradation in that timeframe but it is not yet clear if this issue is related to it.
Expected behavior
All messages are reliably delivered from all producers to all consumers.
If a connection breaks, nats notices and falls back to another connection or re-establishes the connection.
Server and client version
nats-server v2.10.1
nats.go multiple
Host environment
Provider: Microsoft Azure
Host OS: Ubuntu 22.04
Architecture: amd64
Deployed via bosh, compiled like this:
Steps to reproduce
Unfortunately we are not able to reproduce the issue.
The text was updated successfully, but these errors were encountered: