-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: support DiscardNew policy for Jetstream streams #1624
base: main
Are you sure you want to change the base?
fix: support DiscardNew policy for Jetstream streams #1624
Conversation
…rdNew Signed-off-by: Quentin Faidide <quentin.faidide@gmail.com>
@@ -114,7 +114,7 @@ func (jss *jetStreamSvc) CreateBuffersAndBuckets(ctx context.Context, buffers, b | |||
Name: streamName, | |||
Subjects: []string{streamName}, // Use the stream name as the only subject | |||
Retention: nats.RetentionPolicy(v.GetInt("stream.retention")), | |||
Discard: nats.DiscardOld, | |||
Discard: nats.DiscardNew, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make it overridable by the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can set it when users specify "drop on full" ? It's pretty much the exact same behaviour.
Though according to @yhl25 DiscardNew can't be used, but I'm still trying to understand why in the other issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to figure out why it won't work, ideally it should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to make it configurable to the user, but let's set it to DiscardOld
by default since we use the Limits
Policy by default. Also, please add a comment saying DiscardNew
can only be used with WorkQueue
Policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we cannot use DiscardNew
with Limits
policy. we should not let the pipeline even start and validation should fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I will try to implement that, and try to reproduce the issue @yhl25 was mentioning with stuck messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also document the risk of data loss on surge when using DiscardOld on high throughput and be really transparent with why it happens. Right now, users like me who fiddle with the config might be in big trouble if some UDF/Sink create a silent data loss scenario on production.
Something I currently experience with the "surge pipeline situation" with
The
Overall, the number of messages stays nearly stable (it decrease at super slow rate on source buffer) due to the huge pile of retries that tends to immediately refill any missing data. So either this is what gave the impression of undelivered messages staying in the pipe, or either I still didn't reproduce it. |
Signed-off-by: Quentin Faidide <quentin.faidide@gmail.com>
Signed-off-by: Quentin Faidide <quentin.faidide@gmail.com>
I added the doc changes to remove the retention policy parameter and default to WorkQueue as discussed.
What do you guys think we should do ? I've been trying to reproduce the issue a few times with no luck, going to retry but let me know your input. |
as per nats-io/nats-server#5148 (comment) the issue seems to have been resolved in 2.10.12 |
So what's the plan, do we change the new "compatible" jetstream configmap to specify only the new version with the fix, or do we wait for someone to try to reproduce this error enough times to convince us that it's fixed ? |
We can make this change a configurable option, with defaults set to what is currently being used since that is battle-tested. Eventually, this could be the default, but before we do that, we need to make sure it works as expected with a decent amount of run in the production. |
nats-io/nats-server#5270 seems to fix the problem. 2.10.14 release of jetstream seems very promising for WorkQueue. |
might fix #1551 #1554
We would need to ensure that there are no adverse consequence in the handling of the new write error that would happen in the surge scenario.