Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test case for concurrent expected last subject sequence #4319

Merged
merged 3 commits into from Jul 18, 2023

Conversation

bruth
Copy link
Member

@bruth bruth commented Jul 18, 2023

Resolves: #4320

Signed-off-by: Byron Ruth <byron@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
@bruth bruth marked this pull request as ready for review July 18, 2023 18:59
@bruth bruth requested a review from a team as a code owner July 18, 2023 18:59
@derekcollison
Copy link
Member

Need to have @neilalexander or @wallyqs take a look and approve. I can handle merging etc..

@derekcollison derekcollison merged commit 80fb29f into main Jul 18, 2023
1 check passed
@derekcollison derekcollison deleted the conc-last-expect-seq branch July 18, 2023 19:37
@kruegernet
Copy link

kruegernet commented Jul 18, 2023

It's hard for me to parse the git history as an outsider so instead if I may ask: was this once working and broken by a regression or may this have existed as a bug for longer than that? Are there implications for deployed, production systems that may have relied on concurrency control here?

@bruth
Copy link
Member Author

bruth commented Jul 18, 2023

I tested back to 2.9.15, but I suspect this always existed. In my testing, the probability of this behavior occurring was dependent on how close the two concurrent requests were. In the specific test on my laptop (for reference), I observed anything above 2ms would not exhibit the behavior. In a production cluster with more latency among nodes, the probability may be up in the 10s or 100s of milliseconds.

The issue was that this expected sequence check was not part of the Raft consensus for the write, rather only a pre-check. As a result if concurrent requests were all "in flight" doing the same pre-flight check, these could all succeed.

@bruth
Copy link
Member Author

bruth commented Jul 18, 2023

Are there implications for deployed, production systems that may have relied on concurrency control here?

The implication would be that the intended behavior is corrected rather than a "last writer wins" situation 😄 This should not have any performance implication since the logic for the check only has been moved to the Raft layer so it will be rejected correctly when concurrent requests come in (with the header) and the "decision to accept" occurs.

@kruegernet
Copy link

What I mean with the implications question was, could it be the case that extant production systems relying on sequence-protected concurrent write guarantees may have data corruption or inconsistent state? Evidently as you describe it could be the case that the latency window for unpredictable winning writes could be in the 10s or 100s of milliseconds? So maybe issue an advisory for folks who may not watch the issues and change logs super carefully? In other words, had I not tested this carefully in production code, and relied on a stated guarantee, I might have issues with KV write integrity in a given production system, had I gone to production with my code prior to observing this issue.

@bruth
Copy link
Member Author

bruth commented Jul 19, 2023

Yes absolutely, awareness will be raised for this issue. In terms of impact, it depends on whether the message being published resulted from a deterministic operation or not among the concurrent actors. This will be highlighted in a less terse way in a blog post indicating the implications so folks can determine the impact it may have for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Serializability of Expected-Last-Subject-Sequence is not guaranteed in clustered stream
4 participants