Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] MQTT: rapid load-balanced (re-)CONNECT to cluster causes races #4734

Merged
merged 1 commit into from Nov 3, 2023

Conversation

levb
Copy link
Contributor

@levb levb commented Nov 2, 2023

The tests explain the condition. TL;DR: a rapid, load-balanced sequence of connects/disconnects from the same client to a cluster was causing failures.

This PR:

  • eliminates the need to load the session message from JS when processing a persist notification from another server. The clientID that was previously obtained from the loaded message is now available in the ACK subject.
  • eliminates the separate goroutine/ipQueue for processing session persist messages; now in-line
  • ignores deleteMsg errors if caused by outdated cached session data
  • fixes incorrect ACK subject in transfering retained messages 2.9->2.10

@derekcollison please note a highly variable test run times O(1ms) - O(100ms), including targeting the same server. Not sure what's going on there. (go test -v ./server --run TestMQTTClusterConnectDisconnect prints them out):
image

Same test failing in main:
image

@levb levb requested a review from kozlovic November 2, 2023 00:31
@levb levb requested a review from a team as a code owner November 2, 2023 00:31
@levb levb marked this pull request as draft November 2, 2023 00:37
@levb
Copy link
Contributor Author

levb commented Nov 2, 2023

Actually, still failed the new test when I run it with --count 50, investigating.

@levb
Copy link
Contributor Author

levb commented Nov 2, 2023

While testing with --count=X I saw that persistent sessions were hitting a similar race condition when the session persist notification was not processed in-time; it was causing invalid seq errors.

The processing of persist notification was in a separate go-routine fed from ipQueue so it could load session messages for inspection. However, the only data we need from the loaded message was the client ID (hash) which we can add to the JS ACK subject and use directly... This is what the last commit does.

This PR shortens the notification processing time by cutting out a goroutine/queue, but more importantly a loadMsg call from this processing.

server/mqtt.go Show resolved Hide resolved
Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about interoperability, some not and recommendation about error checking.

as.processJSAPIReplies, &sid, &subs); err != nil {
return nil, err
}

// We will listen for replies to session persist requests so that we can
// detect the use of a session with the same client ID anywhere in the cluster.
if err := as.createSubscription(mqttJSARepliesPrefix+"*."+mqttJSASessPersist+".*",
// `$MQTT.JSA.{js-id}.SP.{client-id-hash}.{uuid}`
if err := as.createSubscription(mqttJSARepliesPrefix+"*."+mqttJSASessPersist+".*.*",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does that mean that server with this fix will not be able to co-operate with other servers (say current v2.10.3)? That is, a current server that would persist a session with a reply subject on $MQTT.JSA.<serverId>.SP.<nuid> would not be received by a server with this fix. Maybe it's ok, but I am just raising this to make sure that you thought about it.

Copy link
Contributor Author

@levb levb Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks for raising it. I did consider the compatibility issue and kinda punted on it, because of the "edge condition" nature of the use-case. I considered serializing the clientID into the uuid token using a different separator, but that felt too hacky for a permanent solution to a temporary edge-case. Nothing else that I could think of would make this PR backwards-operable, i.e. broadcasting to the <2.10.(N) servers in a way that they'd understand. I could easily add another listening subscription to pick up their messages, but that'd be 1-way only.

All in all, 1/5 leave as is and require that all servers in an MQTT cluster are upgraded/downgraded at approximately the same time. (Note for others, this is not affecting the session store itself, just the ACK change notifications.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, I think using say, $MQTT.JSA.{js-id}.SP.{client-id-hash}_{uuid} would work just fine.

Copy link
Contributor Author

@levb levb Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kozlovic you agree with ^^? (leaving as is?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage is that you could revert some of the create subscription to keep the same number of tokens. But you would need to do more processing to extract the client ID from the last token. Up to you.

server/mqtt.go Outdated Show resolved Hide resolved
server/mqtt.go Outdated Show resolved Hide resolved
server/mqtt.go Outdated Show resolved Hide resolved
server/mqtt.go Show resolved Hide resolved
server/mqtt.go Outdated Show resolved Hide resolved
server/stream.go Outdated Show resolved Hide resolved
@levb levb changed the title [FIXED] MQTT rapid cluster CONNECT race to delete session [FIXED] MQTT: rapid load-balanced (re-)CONNECT to cluster causes races Nov 2, 2023
server/mqtt.go Outdated Show resolved Hide resolved
server/stream.go Outdated Show resolved Hide resolved
@levb levb requested a review from kozlovic November 2, 2023 20:18
Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Inline persistent sess notification processing

PR feedback: nit _EMPTY_

PR feedback: more robust error handling, _EMPTY_

PR feedback: error handling
@levb levb merged commit 091aa85 into main Nov 3, 2023
4 checks passed
@levb levb deleted the lev-mqtt-delsess-error branch November 3, 2023 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants