Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADDED] $SYS server request to 'kick' or 'LDM' a client connection #4298

Merged
merged 1 commit into from Aug 11, 2023

Conversation

jnmoyne
Copy link
Contributor

@jnmoyne jnmoyne commented Jul 10, 2023

  • Link to issue, e.g. Resolves #NNN
  • Branch rebased on top of current main (git pull --rebase origin main)
  • Changes squashed to a single commit (described here)
  • Build is green in Travis CI
  • You have certified that the contribution is your original work and that you license the work to the project under the Apache 2 license

Resolves #1556

Changes proposed in this pull request:

Adds tw new $SYS server API endpoints:

  • $SYS.REQ.SERVER.%s.KICK (where %s is the server_id) which 'kicks' (effectiveley 'rebalance' as the client application reconnects itself right away (potentially to another server in the cluster)). The service takes a JSON payload containing either an "id" or a "name" field. "id" disconnects the client connection id, "name" disconnects all of the clients connected to the server with that name.

  • $SYS.REQ.SERVER.%s.LDM (where %s is the server_id) and takes a JSON payload containing either an "id" or a "name" field. "id" sends an LDM Info message to the client connection id, "name" sends an LDM Info message to all of the clients connected to the server with that name.

This features allow administrators to manually 're-balance' client connections between the servers in the cluster (e.g. after a rolling upgrade of the servers where one server ends up with no client connections after the upgrade), by kicking some of the client connections from one of the 'overloaded' (in comparison to other servers) servers in the cluster, causing them to re-estalibsh their connection to (hopefully) another server.

@jnmoyne jnmoyne requested a review from a team as a code owner July 10, 2023 20:31
server/server.go Outdated Show resolved Hide resolved
@derekcollison
Copy link
Member

Let's think a bit more about all of this for 2.11, lot more moving parts and want to have a cohesive answer across the board.

server/events.go Outdated Show resolved Hide resolved
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the whole I think this looks good, but it needs at least a unit test to prove it works (probably using nats.go's LameDuckModeHandler) and preferably a second unit test to prove it's only reachable on the system account.

Copy link
Contributor

@ripienaar ripienaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesnt look like it gives feedback about what was done, if a server disconnected a client or not or which handled these?

Looking at the recently added reloadConfig example it uses the zReq() helper to craft responses for which servers handled it and errors are sent back to the caller.

From the perspective of using this (CLI) I cant see how to make this a friendly UX, getting the server info and affected changes back means i can tell a user success/fail etc

server/events.go Outdated Show resolved Hide resolved
Copy link
Contributor

@ripienaar ripienaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - docs on exported structs and funcs would be nice, in go a function / struct doc starts with its name

server/server.go Outdated Show resolved Hide resolved
server/server.go Outdated Show resolved Hide resolved
@jnmoyne
Copy link
Contributor Author

jnmoyne commented Aug 10, 2023

Ok should be all good now. @neilalexander please merge if you are good with those last changes and the new test.

@neilalexander
Copy link
Member

Test & race fix look good, please squash down to one commit and I'll hit merge in the morning!

Signed-off-by: Jean-Noël Moyne <jnmoyne@gmail.com>
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@neilalexander neilalexander merged commit d474e3b into dev Aug 11, 2023
2 checks passed
@neilalexander neilalexander deleted the jnm/sys_kick branch August 11, 2023 08:39
@dw612
Copy link

dw612 commented Oct 26, 2023

Hi what would be the nats version to have this feature?

@derekcollison
Copy link
Member

2.10.x will have that feature.

@dw612
Copy link

dw612 commented Oct 27, 2023

Hi I used 2.10.3-alpine image for my nats statefulset, and used nats-pub tool to send this message "Published [$SYS.REQ.SERVER.NAVK5EQIMAJ4S7SMDSERITSXY2Q4MCDAMSFJYCURWDQC4OQTVVJ6A.KICK] : '{"name":"rebalance-worker"}'"
The server does not seem to be kicking any connections. What did I miss?

@derekcollison
Copy link
Member

This is an internal feature and works by CID.

@dw612
Copy link

dw612 commented Oct 27, 2023

Oh that means it has no released image yet? Is there a estimated time for the release?

@derekcollison
Copy link
Member

It's in there. You have to supply CID.

@dw612
Copy link

dw612 commented Oct 27, 2023

Sorry could you elaborate? CID do you mean by container id? I was following the above "$SYS.REQ.SERVER.%s.KICK (%s is the server_id)"

@jnmoyne
Copy link
Contributor Author

jnmoyne commented Oct 27, 2023

The payload of the request should be
{"cid": xxx}

Where xxx is the connection id (a number) for that connection to the server.

The initial version of the PR also allowed to kick by name but that was removed before the PR merge as it's too much of a potential footgun (since it would kick all of the connections with the name)

@ripienaar
Copy link
Contributor

You can use “nats server request kick” and its help will ask the required info. Add —trace to see how it works

@dw612
Copy link

dw612 commented Oct 30, 2023

Oh I was looking to kick all connection to a server - given the server id, not just a particular connection between a client and a server. The reason is after spinning up new NATS server replicas, I want to disconnect all the existing connections to the original replica and have them auto distribute evenly across all replicas. Is this feature not supported yet?

@ripienaar
Copy link
Contributor

Just shut down the server? If you don’t want connection on it at all there is no point in hanging around. So shut it with the LDM signal to the process

@derekcollison
Copy link
Member

Since these are system wide operations, the controls at an individual server level will be very low level and raw. The plan is to provide system level directives through the NATS cli to balance things like connections and JetStream leaders.

@04116
Copy link

04116 commented Nov 2, 2023

hi team,
as proposal, the solution here is provide a way to disconnect and let client reconnect? Depend on st like proxy to balance number of clients on server?

Do we handle when client still reconnect to the previous server?

@ripienaar
Copy link
Contributor

Clients who have a list of servers - either in co fig or using topology discovery - will connect to a random next server but not the one they were just on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants