Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client connections load balancing among nats cluster nodes #1556

Closed
pananton opened this issue Aug 12, 2020 · 12 comments
Closed

Client connections load balancing among nats cluster nodes #1556

pananton opened this issue Aug 12, 2020 · 12 comments
Assignees

Comments

@pananton
Copy link

pananton commented Aug 12, 2020

Hello! Currently the mechanism of load balancing of client connections to nats cluster is based on random node selection. The problem (already discussed in some of the issues, like #1359) is when we restart nats cluster. To be able to serve clients we restart cluster one node after another and this will lead us to approximately the following distribution (step-by-step example for 3 node cluster):

  1. original connection distribution is 24 / 24 / 24
  2. after node1 restart : 0 / 36 / 36
  3. after node2 restart: 18 / 0 / 54
  4. after node3 restart: 45 / 27 / 0

As you can see original uniform distribution 1/3 | 1/3 | 1/3 tends to 2/3 | 1/3 | 0 . I suggest you to add some modifications that will allow developers to get around this problem (for example, we use nats cluster in 24/7 fashion and never stop it completely but sometimes need to upgrade server version or our hardware). There can be several solutions:

  1. Implement clever load balancing (when nats cluster node tells client to try to connect to another known node which have less active clients)
  2. Provide a command which can be sent to cluster and cause it to disconnect existing clients - we can use it after cluster restart is finished to redistribute client connections among nodes.
@derekcollison
Copy link
Member

We have been aware of this issue and have been exploring ideas here for some time, but nothing has come up that we really love yet as a solution. It is on our list though for sure.

@derekcollison derekcollison self-assigned this Aug 12, 2020
@pananton
Copy link
Author

Ok, thanks. Btw, thank you for adding a feature to nats server that will immediately finsih request when no replier is listening. Waiting for 2.2 release)

@BradErz
Copy link

BradErz commented Feb 1, 2021

We have been aware of this issue and have been exploring ideas here for some time, but nothing has come up that we really love yet as a solution. It is on our list though for sure.

@derekcollison has any progress or soloution been found towards this?

We are struggling with the same situation and I dont want to do hacky things ontop of the nats client to solve this 😂. Any suggestions?

@derekcollison
Copy link
Member

Post 2.2 release we will be focusing on some client upgrades, this being one of them. But need to get 2.2 out the door first.

@domdom82
Copy link

Bump. Is there any movement on this? We're facing the exact same problem. So far we only thought of building our NATS clients in a way that they reconnect after some time to give them a chance to pick another NATS server and improve the distribution of client connections.

@derekcollison
Copy link
Member

Apologies for delays here, been a very interesting few months for us in a positive way.

So prior to any intelligent client work, I took a look at this today during a break. I wanted to see if there might be a way to balance a cluster after and upgrade that is possible today. I believe there is and I will test this myself at some point in next few days but here is the thought.

For this experiment let's assume 10 connections on 3 servers A, B, and C. Also assume accurate randomness etc which we know is not the case but will help illustrate what I think is possible.

START
A:10, B:10, C:10

POST UPGRADE A
A:0, B:15, C:15

POST UPGRADE B
A:8, B:0, C:22

POST UPGRADE C
A:19, B:11, C:0

Servers all the setting on max connections max_connections in config. And this value can be updated and the server reloaded without a server restart. If the new max_connections is lower than the currently connected count, clients will be randomly chosen to be disconnected.

So again assuming accurate and true randomness of next server picked. If we set A to max_connections: 10.

POST LIMIT:10 to A
A:10, B:15, C:5

POST LIMIT:10 to B (Keep A's liit in place, but none on C)
A:10, B:10, C:10

Release limits on A and B.

Again this was me doodling during a lunch break but I did verify code wise the max_connections can be set and reloaded without server restart and that current overages will be disconnected randomly.

So I think this would work today.

@domdom82
Copy link

@derekcollison interesting thought. In our case we don't know upfront about the number of client connections. So we cannot use a fixed number as we would drop any client that's more than number_of_servers * max_connections (e.g. if we had 3 servers, 100 connections each, the 301st client would no longer be accepted - bad for us)

I think it could work by having a nanny job that continuously monitors the cluster distribution and adjusts the max_connections value by e.g. (sum(client_conns_server1, client_conns_server2, client_conns_server2) / 3) * 1.10 to leave some room for additional clients connecting (10% per server) but still be able to reduce major imbalances like one server having no client connections at all.

WDYT?

@derekcollison
Copy link
Member

After a full sweep upgrade count all the current connections, divide by number of servers and that will give you your target per server balance number which can guild the temp max_connections settings outlined above.

Right now the client (Go client at least) will see the error from the server and terminate and not to normal reconnect. @tbeets and @wallyqs will look into this and get it fixed. Its non-fatal and if the client knows it has other server options to perform a reconnect to, that should happen.

@tbeets
Copy link
Contributor

tbeets commented Mar 25, 2022

PR for client to auto reconnect in this scenario is in the hopper:

nats-io/nats.go#935

@Ann-Geo
Copy link

Ann-Geo commented Oct 8, 2022

After a full sweep upgrade count all the current connections, divide by number of servers and that will give you your target per server balance number which can guild the temp max_connections settings outlined above.

Right now the client (Go client at least) will see the error from the server and terminate and not to normal reconnect. @tbeets and @wallyqs will look into this and get it fixed. Its non-fatal and if the client knows it has other server options to perform a reconnect to, that should happen.

Hi, any updates on this thread. We are having the same issue. Client connections will not distribute across restarted failed servers in the cluster.

@derekcollison
Copy link
Member

The work here is two fold, server side which will be manual and can be done today with tooling and scripts etc. We may automate some of this in our service offering NGS.

The other option would be hint based sent from the servers that client v2 could respond to and do the right thing.. That work looks to happen possibly in 2023.

/cc @ColinSullivan1

@zbindenren
Copy link

Would be interesting for us too. After an server upgrade:

image

The 1 Connection is from the nats utility.

Even if we had do manually launch a SYS command, to redistribute, would be fine for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants