[FIXED] Race condition in some leader failover scenarios leading to messages being potentially sourced more than once. #4592

jnmoyne · 2023-09-27T07:56:37Z

Tests added
Branch rebased on top of current main (git pull --rebase origin main)
Changes squashed to a single commit (described here)
Build is green in Travis CI
You have certified that the contribution is your original work and that you license the work to the project under the Apache 2 license

Changes proposed in this pull request:

Fixes a race condition in some leader failover scenarios leading to messages being potentially sourced more than once.

In some failure scenarios where the current leader of a stream sourcing from other stream(s) gets shutdown while publications are happening on the stream(s) being sourced leads to setLeader(true) being called on the new leader for the sourcing stream before all the messages having been sourced by the previous leader are completely processed such that when the new leader does it's reverse scan from the last message in it's view of the stream in order to know what sequence number to start the consumer for the stream being sourced from, such that the last message(s) sourced by the previous leader get sourced again, leading to some messages being sourced more than once.

The existing TestNoRaceJetStreamSuperClusterSources test would sidestep the issue by relying on the deduplication window in the sourcing stream. Without deduplication the test is a flapper.

This avoid the race condition by adding a small delay before scanning for the last message(s) having been sourced and starting the sources' consumer(s). Now the test (without using the deduplication window) never fails because more messages than expected have been received in the sourcing stream.

(Also adds a guard to give up if setupSourceConsumers() is called and we are no longer the leader for the stream (that check was already present in setupMirrorConsumer() so assuming it was forgotten for setupSourceConsumers())

derekcollison · 2023-09-27T15:09:51Z

We going to pull this in for 2.10.2?

server/norace_test.go

server/stream.go

…essages being potentially sourced more than once. - In some failure scenarios where the current leader of a stream sourcing from other stream(s) gets shutdown while publications are happening on the stream(s) being sourced leads to `setLeader(true)` being called on the new leader for the sourcing stream before all the messages having been sourced by the previous leader are completely processed such that when the new leader does it's reverse scan from the last message in it's view of the stream in order to know what sequence number to start the consumer for the stream being sourced from, such that the last message(s) sourced by the previous leader get sourced again, leading to some messages being sourced more than once. The existing `TestNoRaceJetStreamSuperClusterSources` test would sidestep the issue by relying on the deduplication window in the sourcing stream. Without deduplication the test is a flapper. This avoid the race condition by adding a small delay before scanning for the last message(s) having been sourced and starting the sources' consumer(s). Now the test (without using the deduplication window) never fails because more messages than expected have been received in the sourcing stream. - Fix test TestJetStreamWorkQueueSourceRestart that expects the sourcing stream to get all of the expected messages right away by adding a small sleep before checking the number of messages pending on the consumer for that stream. Signed-off-by: Jean-Noël Moyne <jnmoyne@gmail.com>

derekcollison · 2023-09-28T17:53:32Z

If this runs green are we good to pull in?

jnmoyne · 2023-09-28T18:18:47Z

Yes

derekcollison

LGTM

Signed-off-by: Jean-Noël Moyne <jnmoyne@gmail.com>

Backport #4592 to 2.9.23 --------- Signed-off-by: Jean-Noël Moyne <jnmoyne@gmail.com>

jnmoyne requested a review from a team as a code owner September 27, 2023 07:56

jnmoyne force-pushed the jnm/fix_27Sept23 branch from eedcf9c to f273609 Compare September 27, 2023 08:34

jnmoyne marked this pull request as draft September 27, 2023 09:01

derekcollison reviewed Sep 27, 2023

View reviewed changes

server/norace_test.go Outdated Show resolved Hide resolved

derekcollison reviewed Sep 27, 2023

View reviewed changes

server/stream.go Outdated Show resolved Hide resolved

derekcollison reviewed Sep 27, 2023

View reviewed changes

server/stream.go Outdated Show resolved Hide resolved

jnmoyne force-pushed the jnm/fix_27Sept23 branch 3 times, most recently from 625dd00 to 9d3136c Compare September 27, 2023 23:15

jnmoyne marked this pull request as ready for review September 27, 2023 23:39

derekcollison self-requested a review September 27, 2023 23:41

derekcollison requested changes Sep 27, 2023

View reviewed changes

server/stream.go Outdated Show resolved Hide resolved

server/stream.go Outdated Show resolved Hide resolved

jnmoyne force-pushed the jnm/fix_27Sept23 branch from 9d3136c to 96cfebe Compare September 27, 2023 23:53

derekcollison self-requested a review September 28, 2023 00:03

derekcollison reviewed Sep 28, 2023

View reviewed changes

server/stream.go Outdated Show resolved Hide resolved

jnmoyne force-pushed the jnm/fix_27Sept23 branch from 96cfebe to 71f9688 Compare September 28, 2023 17:51

derekcollison self-requested a review September 28, 2023 18:19

derekcollison approved these changes Sep 28, 2023

View reviewed changes

derekcollison merged commit 783edaa into main Sep 28, 2023
2 checks passed

derekcollison deleted the jnm/fix_27Sept23 branch September 28, 2023 18:22

jnmoyne added a commit that referenced this pull request Oct 11, 2023

Backport #4592 to 2.9

decbcdf

Signed-off-by: Jean-Noël Moyne <jnmoyne@gmail.com>

jnmoyne mentioned this pull request Oct 11, 2023

Backport #4592 to 2.9 #4651

Merged

wallyqs pushed a commit that referenced this pull request Oct 11, 2023

Backport #4592 to 2.9 (#4651)

05fe77f

Backport #4592 to 2.9.23 --------- Signed-off-by: Jean-Noël Moyne <jnmoyne@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIXED] Race condition in some leader failover scenarios leading to messages being potentially sourced more than once. #4592

[FIXED] Race condition in some leader failover scenarios leading to messages being potentially sourced more than once. #4592

jnmoyne commented Sep 27, 2023 •

edited

derekcollison commented Sep 27, 2023

derekcollison commented Sep 28, 2023

jnmoyne commented Sep 28, 2023

derekcollison left a comment

[FIXED] Race condition in some leader failover scenarios leading to messages being potentially sourced more than once. #4592

[FIXED] Race condition in some leader failover scenarios leading to messages being potentially sourced more than once. #4592

Conversation

jnmoyne commented Sep 27, 2023 • edited

Changes proposed in this pull request:

derekcollison commented Sep 27, 2023

derekcollison commented Sep 28, 2023

jnmoyne commented Sep 28, 2023

derekcollison left a comment

Choose a reason for hiding this comment

jnmoyne commented Sep 27, 2023 •

edited