Client connections load balancing among nats cluster nodes #1556

pananton · 2020-08-12T06:02:02Z

Hello! Currently the mechanism of load balancing of client connections to nats cluster is based on random node selection. The problem (already discussed in some of the issues, like #1359) is when we restart nats cluster. To be able to serve clients we restart cluster one node after another and this will lead us to approximately the following distribution (step-by-step example for 3 node cluster):

original connection distribution is 24 / 24 / 24
after node1 restart : 0 / 36 / 36
after node2 restart: 18 / 0 / 54
after node3 restart: 45 / 27 / 0

As you can see original uniform distribution 1/3 | 1/3 | 1/3 tends to 2/3 | 1/3 | 0 . I suggest you to add some modifications that will allow developers to get around this problem (for example, we use nats cluster in 24/7 fashion and never stop it completely but sometimes need to upgrade server version or our hardware). There can be several solutions:

Implement clever load balancing (when nats cluster node tells client to try to connect to another known node which have less active clients)
Provide a command which can be sent to cluster and cause it to disconnect existing clients - we can use it after cluster restart is finished to redistribute client connections among nodes.

derekcollison · 2020-08-12T14:50:56Z

We have been aware of this issue and have been exploring ideas here for some time, but nothing has come up that we really love yet as a solution. It is on our list though for sure.

pananton · 2020-08-12T16:06:26Z

Ok, thanks. Btw, thank you for adding a feature to nats server that will immediately finsih request when no replier is listening. Waiting for 2.2 release)

BradErz · 2021-02-01T19:44:32Z

We have been aware of this issue and have been exploring ideas here for some time, but nothing has come up that we really love yet as a solution. It is on our list though for sure.

@derekcollison has any progress or soloution been found towards this?

We are struggling with the same situation and I dont want to do hacky things ontop of the nats client to solve this 😂. Any suggestions?

derekcollison · 2021-02-01T20:15:39Z

Post 2.2 release we will be focusing on some client upgrades, this being one of them. But need to get 2.2 out the door first.

domdom82 · 2022-03-24T13:16:46Z

Bump. Is there any movement on this? We're facing the exact same problem. So far we only thought of building our NATS clients in a way that they reconnect after some time to give them a chance to pick another NATS server and improve the distribution of client connections.

derekcollison · 2022-03-24T20:27:21Z

Apologies for delays here, been a very interesting few months for us in a positive way.

So prior to any intelligent client work, I took a look at this today during a break. I wanted to see if there might be a way to balance a cluster after and upgrade that is possible today. I believe there is and I will test this myself at some point in next few days but here is the thought.

For this experiment let's assume 10 connections on 3 servers A, B, and C. Also assume accurate randomness etc which we know is not the case but will help illustrate what I think is possible.

START
A:10, B:10, C:10

POST UPGRADE A
A:0, B:15, C:15

POST UPGRADE B
A:8, B:0, C:22

POST UPGRADE C
A:19, B:11, C:0

Servers all the setting on max connections max_connections in config. And this value can be updated and the server reloaded without a server restart. If the new max_connections is lower than the currently connected count, clients will be randomly chosen to be disconnected.

So again assuming accurate and true randomness of next server picked. If we set A to max_connections: 10.

POST LIMIT:10 to A
A:10, B:15, C:5

POST LIMIT:10 to B (Keep A's liit in place, but none on C)
A:10, B:10, C:10

Release limits on A and B.

Again this was me doodling during a lunch break but I did verify code wise the max_connections can be set and reloaded without server restart and that current overages will be disconnected randomly.

So I think this would work today.

domdom82 · 2022-03-25T18:47:50Z

@derekcollison interesting thought. In our case we don't know upfront about the number of client connections. So we cannot use a fixed number as we would drop any client that's more than number_of_servers * max_connections (e.g. if we had 3 servers, 100 connections each, the 301st client would no longer be accepted - bad for us)

I think it could work by having a nanny job that continuously monitors the cluster distribution and adjusts the max_connections value by e.g. (sum(client_conns_server1, client_conns_server2, client_conns_server2) / 3) * 1.10 to leave some room for additional clients connecting (10% per server) but still be able to reduce major imbalances like one server having no client connections at all.

WDYT?

derekcollison · 2022-03-25T18:52:56Z

After a full sweep upgrade count all the current connections, divide by number of servers and that will give you your target per server balance number which can guild the temp max_connections settings outlined above.

Right now the client (Go client at least) will see the error from the server and terminate and not to normal reconnect. @tbeets and @wallyqs will look into this and get it fixed. Its non-fatal and if the client knows it has other server options to perform a reconnect to, that should happen.

tbeets · 2022-03-25T21:20:26Z

PR for client to auto reconnect in this scenario is in the hopper:

nats-io/nats.go#935

Ann-Geo · 2022-10-08T02:22:33Z

After a full sweep upgrade count all the current connections, divide by number of servers and that will give you your target per server balance number which can guild the temp max_connections settings outlined above.

Right now the client (Go client at least) will see the error from the server and terminate and not to normal reconnect. @tbeets and @wallyqs will look into this and get it fixed. Its non-fatal and if the client knows it has other server options to perform a reconnect to, that should happen.

Hi, any updates on this thread. We are having the same issue. Client connections will not distribute across restarted failed servers in the cluster.

derekcollison · 2022-10-08T03:08:58Z

The work here is two fold, server side which will be manual and can be done today with tooling and scripts etc. We may automate some of this in our service offering NGS.

The other option would be hint based sent from the servers that client v2 could respond to and do the right thing.. That work looks to happen possibly in 2023.

/cc @ColinSullivan1

zbindenren · 2022-11-01T08:01:09Z

Would be interesting for us too. After an server upgrade:

The 1 Connection is from the nats utility.

Even if we had do manually launch a SYS command, to redistribute, would be fine for us.

derekcollison self-assigned this Aug 12, 2020

jnmoyne mentioned this issue Jul 10, 2023

[ADDED] $SYS server request to 'kick' or 'LDM' a client connection #4298

Merged

5 tasks

philpennock closed this as completed in d474e3b Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client connections load balancing among nats cluster nodes #1556

Client connections load balancing among nats cluster nodes #1556

pananton commented Aug 12, 2020 •

edited

derekcollison commented Aug 12, 2020

pananton commented Aug 12, 2020

BradErz commented Feb 1, 2021

derekcollison commented Feb 1, 2021

domdom82 commented Mar 24, 2022

derekcollison commented Mar 24, 2022

domdom82 commented Mar 25, 2022

derekcollison commented Mar 25, 2022

tbeets commented Mar 25, 2022

Ann-Geo commented Oct 8, 2022

derekcollison commented Oct 8, 2022

zbindenren commented Nov 1, 2022

Client connections load balancing among nats cluster nodes #1556

Client connections load balancing among nats cluster nodes #1556

Comments

pananton commented Aug 12, 2020 • edited

derekcollison commented Aug 12, 2020

pananton commented Aug 12, 2020

BradErz commented Feb 1, 2021

derekcollison commented Feb 1, 2021

domdom82 commented Mar 24, 2022

derekcollison commented Mar 24, 2022

domdom82 commented Mar 25, 2022

derekcollison commented Mar 25, 2022

tbeets commented Mar 25, 2022

Ann-Geo commented Oct 8, 2022

derekcollison commented Oct 8, 2022

zbindenren commented Nov 1, 2022

pananton commented Aug 12, 2020 •

edited