New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test case for concurrent expected last subject sequence #4319
Conversation
Signed-off-by: Byron Ruth <byron@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Need to have @neilalexander or @wallyqs take a look and approve. I can handle merging etc.. |
It's hard for me to parse the git history as an outsider so instead if I may ask: was this once working and broken by a regression or may this have existed as a bug for longer than that? Are there implications for deployed, production systems that may have relied on concurrency control here? |
I tested back to 2.9.15, but I suspect this always existed. In my testing, the probability of this behavior occurring was dependent on how close the two concurrent requests were. In the specific test on my laptop (for reference), I observed anything above 2ms would not exhibit the behavior. In a production cluster with more latency among nodes, the probability may be up in the 10s or 100s of milliseconds. The issue was that this expected sequence check was not part of the Raft consensus for the write, rather only a pre-check. As a result if concurrent requests were all "in flight" doing the same pre-flight check, these could all succeed. |
The implication would be that the intended behavior is corrected rather than a "last writer wins" situation 😄 This should not have any performance implication since the logic for the check only has been moved to the Raft layer so it will be rejected correctly when concurrent requests come in (with the header) and the "decision to accept" occurs. |
What I mean with the implications question was, could it be the case that extant production systems relying on sequence-protected concurrent write guarantees may have data corruption or inconsistent state? Evidently as you describe it could be the case that the latency window for unpredictable winning writes could be in the 10s or 100s of milliseconds? So maybe issue an advisory for folks who may not watch the issues and change logs super carefully? In other words, had I not tested this carefully in production code, and relied on a stated guarantee, I might have issues with KV write integrity in a given production system, had I gone to production with my code prior to observing this issue. |
Yes absolutely, awareness will be raised for this issue. In terms of impact, it depends on whether the message being published resulted from a deterministic operation or not among the concurrent actors. This will be highlighted in a less terse way in a blog post indicating the implications so folks can determine the impact it may have for them. |
Resolves: #4320