fix: support DiscardNew policy for Jetstream streams #1624

QuentinFAIDIDE · 2024-04-01T16:12:27Z

might fix #1551 #1554
We would need to ensure that there are no adverse consequence in the handling of the new write error that would happen in the surge scenario.

…rdNew Signed-off-by: Quentin Faidide <quentin.faidide@gmail.com>

vigith · 2024-04-01T17:11:46Z

pkg/isbsvc/jetstream_service.go

@@ -114,7 +114,7 @@ func (jss *jetStreamSvc) CreateBuffersAndBuckets(ctx context.Context, buffers, b
 				Name:       streamName,
 				Subjects:   []string{streamName}, // Use the stream name as the only subject
 				Retention:  nats.RetentionPolicy(v.GetInt("stream.retention")),
-				Discard:    nats.DiscardOld,
+				Discard:    nats.DiscardNew,


can we make it overridable by the user?

Maybe we can set it when users specify "drop on full" ? It's pretty much the exact same behaviour.
Though according to @yhl25 DiscardNew can't be used, but I'm still trying to understand why in the other issue.

we need to figure out why it won't work, ideally it should.

+1 to make it configurable to the user, but let's set it to DiscardOld by default since we use the Limits Policy by default. Also, please add a comment saying DiscardNew can only be used with WorkQueue Policy.

since we cannot use DiscardNew with Limits policy. we should not let the pipeline even start and validation should fail.

Indeed, I will try to implement that, and try to reproduce the issue @yhl25 was mentioning with stuck messages.

Maybe we should also document the risk of data loss on surge when using DiscardOld on high throughput and be really transparent with why it happens. Right now, users like me who fiddle with the config might be in big trouble if some UDF/Sink create a silent data loss scenario on production.

QuentinFAIDIDE · 2024-04-03T19:10:44Z

Something I currently experience with the "surge pipeline situation" with DiscardNew that may or may not be what @yhl25 is referring to:

After letting the sink fail for a moment, I am letting it work and it does has a descent Ack rate
the buffer before the sink stays at 30k and the cpu usage for the message-variator-udf is high while logs are repeating:

2024/04/03 18:51:32 | ERROR | {...,"msg":"Retrying failed messages","pipeline":"super-odd-8","vertex":"msg-variator","protocol":"uds-grpc-map-udf","errors":{"nats: maximum messages exceeded":31}...}

The {"nats: maximum messages exceeded":31} seems to be the new jetstream side "buffer full" error thrown because we now use DiscardNew. The number of these errors tends to slowly decrease and then go up again, indicating it's writing some, and then some other datum arrive from the source .

buffer the source is writing to experiences the same down and up but with a bufferFull! error, which is the normal error we usually get on buffer being fulls.

Overall, the number of messages stays nearly stable (it decrease at super slow rate on source buffer) due to the huge pile of retries that tends to immediately refill any missing data. So either this is what gave the impression of undelivered messages staying in the pipe, or either I still didn't reproduce it.
I'm going to let it sit for some time and confirm I emptied the retries with no losses. I'll keep you updated.

Signed-off-by: Quentin Faidide <quentin.faidide@gmail.com>

QuentinFAIDIDE · 2024-04-05T13:22:17Z

I added the doc changes to remove the retention policy parameter and default to WorkQueue as discussed.
Was not able to reproduce the "stucked messages" issue yet.
My guess is that one of the following is true:

The new issue is due to the new "buffer over capacity" error from jetstream which is the only behavior change since we activated the WorkQueue/DiscardNew setting. The issue would then be a subset of the fixed lost datum issue, because the full buffer management is supposed to prevent that error from ever being returned in the first place.
The new issue lies in Jetstream (feels unlikely but I may be wrong).
The new issue is due to some other Jetstream behaviour change with WorkQueue/DiscardNew that I am not aware of. Is there likely to be any other than the new error ?

What do you guys think we should do ? I've been trying to reproduce the issue a few times with no luck, going to retry but let me know your input.

vigith · 2024-04-05T16:12:47Z

as per nats-io/nats-server#5148 (comment) the issue seems to have been resolved in 2.10.12

QuentinFAIDIDE · 2024-04-10T06:37:54Z

So what's the plan, do we change the new "compatible" jetstream configmap to specify only the new version with the fix, or do we wait for someone to try to reproduce this error enough times to convince us that it's fixed ?

vigith · 2024-04-10T16:36:09Z

So what's the plan, do we change the new "compatible" jetstream configmap to specify only the new version with the fix, or do we wait for someone to try to reproduce this error enough times to convince us that it's fixed ?

We can make this change a configurable option, with defaults set to what is currently being used since that is battle-tested. Eventually, this could be the default, but before we do that, we need to make sure it works as expected with a decent amount of run in the production.

vigith · 2024-04-12T02:52:05Z

nats-io/nats-server#5270 seems to fix the problem. 2.10.14 release of jetstream seems very promising for WorkQueue.

fix: change jetstream buffers discard policy from DiscardOld to Disca…

8ac85f8

…rdNew Signed-off-by: Quentin Faidide <quentin.faidide@gmail.com>

QuentinFAIDIDE requested review from whynowy and vigith as code owners April 1, 2024 16:12

QuentinFAIDIDE marked this pull request as draft April 1, 2024 16:14

vigith reviewed Apr 1, 2024

View reviewed changes

QuentinFAIDIDE changed the title ~~fix: change jetstream buffers discard policy from DiscardOld to DiscardNew~~ fix: support DiscardNew policy for Jetstream streams Apr 2, 2024

QuentinFAIDIDE added 2 commits April 4, 2024 13:13

feat: remove option to set retention policy

cf0e5fc

Signed-off-by: Quentin Faidide <quentin.faidide@gmail.com>

chore: codegen ran with latest as the version

1bd1fbe

Signed-off-by: Quentin Faidide <quentin.faidide@gmail.com>

vigith linked an issue Apr 5, 2024 that may be closed by this pull request

Buffer Usage Calculation and ISB Writing Race Condition: Potential Data Loss in High-Throughput Pipelines #1554

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: support DiscardNew policy for Jetstream streams #1624

fix: support DiscardNew policy for Jetstream streams #1624

QuentinFAIDIDE commented Apr 1, 2024

vigith Apr 1, 2024

QuentinFAIDIDE Apr 1, 2024

vigith Apr 1, 2024

yhl25 Apr 2, 2024

vigith Apr 2, 2024

QuentinFAIDIDE Apr 2, 2024

QuentinFAIDIDE Apr 2, 2024

QuentinFAIDIDE commented Apr 3, 2024

QuentinFAIDIDE commented Apr 5, 2024

vigith commented Apr 5, 2024

QuentinFAIDIDE commented Apr 10, 2024

vigith commented Apr 10, 2024

vigith commented Apr 12, 2024

fix: support DiscardNew policy for Jetstream streams #1624

Are you sure you want to change the base?

fix: support DiscardNew policy for Jetstream streams #1624

Conversation

QuentinFAIDIDE commented Apr 1, 2024

vigith Apr 1, 2024

Choose a reason for hiding this comment

QuentinFAIDIDE Apr 1, 2024

Choose a reason for hiding this comment

vigith Apr 1, 2024

Choose a reason for hiding this comment

yhl25 Apr 2, 2024

Choose a reason for hiding this comment

vigith Apr 2, 2024

Choose a reason for hiding this comment

QuentinFAIDIDE Apr 2, 2024

Choose a reason for hiding this comment

QuentinFAIDIDE Apr 2, 2024

Choose a reason for hiding this comment

QuentinFAIDIDE commented Apr 3, 2024

QuentinFAIDIDE commented Apr 5, 2024

vigith commented Apr 5, 2024

QuentinFAIDIDE commented Apr 10, 2024

vigith commented Apr 10, 2024

vigith commented Apr 12, 2024