Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] Race condition in some leader failover scenarios leading to messages being potentially sourced more than once. #4592

Merged
merged 1 commit into from Sep 28, 2023

Conversation

jnmoyne
Copy link
Contributor

@jnmoyne jnmoyne commented Sep 27, 2023

  • Tests added
  • Branch rebased on top of current main (git pull --rebase origin main)
  • Changes squashed to a single commit (described here)
  • Build is green in Travis CI
  • You have certified that the contribution is your original work and that you license the work to the project under the Apache 2 license

Changes proposed in this pull request:

Fixes a race condition in some leader failover scenarios leading to messages being potentially sourced more than once.

In some failure scenarios where the current leader of a stream sourcing from other stream(s) gets shutdown while publications are happening on the stream(s) being sourced leads to setLeader(true) being called on the new leader for the sourcing stream before all the messages having been sourced by the previous leader are completely processed such that when the new leader does it's reverse scan from the last message in it's view of the stream in order to know what sequence number to start the consumer for the stream being sourced from, such that the last message(s) sourced by the previous leader get sourced again, leading to some messages being sourced more than once.

The existing TestNoRaceJetStreamSuperClusterSources test would sidestep the issue by relying on the deduplication window in the sourcing stream. Without deduplication the test is a flapper.

This avoid the race condition by adding a small delay before scanning for the last message(s) having been sourced and starting the sources' consumer(s). Now the test (without using the deduplication window) never fails because more messages than expected have been received in the sourcing stream.

(Also adds a guard to give up if setupSourceConsumers() is called and we are no longer the leader for the stream (that check was already present in setupMirrorConsumer() so assuming it was forgotten for setupSourceConsumers())

@jnmoyne jnmoyne requested a review from a team as a code owner September 27, 2023 07:56
@jnmoyne jnmoyne marked this pull request as draft September 27, 2023 09:01
@derekcollison
Copy link
Member

We going to pull this in for 2.10.2?

server/stream.go Outdated Show resolved Hide resolved
server/stream.go Outdated Show resolved Hide resolved
@jnmoyne jnmoyne force-pushed the jnm/fix_27Sept23 branch 3 times, most recently from 625dd00 to 9d3136c Compare September 27, 2023 23:15
@jnmoyne jnmoyne marked this pull request as ready for review September 27, 2023 23:39
server/stream.go Outdated Show resolved Hide resolved
server/stream.go Outdated Show resolved Hide resolved
server/stream.go Outdated Show resolved Hide resolved
…essages being potentially sourced more than once.

- In some failure scenarios where the current leader of a stream sourcing from other stream(s) gets shutdown while publications are happening on the stream(s) being sourced leads to `setLeader(true)` being called on the new leader for the sourcing stream before all the messages having been sourced by the previous leader are completely processed such that when the new leader does it's reverse scan from the last message in it's view of the stream in order to know what sequence number to start the consumer for the stream being sourced from, such that the last message(s) sourced by the previous leader get sourced again, leading to some messages being sourced more than once.

The existing `TestNoRaceJetStreamSuperClusterSources` test would sidestep the issue by relying on the deduplication window in the sourcing stream. Without deduplication the test is a flapper.

This avoid the race condition by adding a small delay before scanning for the last message(s) having been sourced and starting the sources' consumer(s). Now the test (without using the deduplication window) never fails because more messages than expected have been received in the sourcing stream.

- Fix test TestJetStreamWorkQueueSourceRestart that expects the sourcing stream to get all of the expected messages right away by adding a small sleep before checking the number of messages pending on the consumer for that stream.

Signed-off-by: Jean-Noël Moyne <jnmoyne@gmail.com>
@derekcollison
Copy link
Member

If this runs green are we good to pull in?

@jnmoyne
Copy link
Contributor Author

jnmoyne commented Sep 28, 2023

Yes

Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit 783edaa into main Sep 28, 2023
2 checks passed
@derekcollison derekcollison deleted the jnm/fix_27Sept23 branch September 28, 2023 18:22
jnmoyne added a commit that referenced this pull request Oct 11, 2023
Signed-off-by: Jean-Noël Moyne <jnmoyne@gmail.com>
@jnmoyne jnmoyne mentioned this pull request Oct 11, 2023
wallyqs pushed a commit that referenced this pull request Oct 11, 2023
Backport #4592 to 2.9.23

---------

Signed-off-by: Jean-Noël Moyne <jnmoyne@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants