Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jetstream KV watcher does not watch after NATS server restart #1094

Closed
tpoxa opened this issue Sep 27, 2022 · 19 comments · Fixed by #1097
Closed

Jetstream KV watcher does not watch after NATS server restart #1094

tpoxa opened this issue Sep 27, 2022 · 19 comments · Fixed by #1097
Assignees
Labels
bug Confirmed reproducible bug

Comments

@tpoxa
Copy link

tpoxa commented Sep 27, 2022

KV watcher does not do its job after Nats server is restarted.

nats.go version v1.17.0
nats server v2.9.1

Steps or code to reproduce the issue:

  • Implement simple watch-all logic.
  • Run nats server and client. Make updates with keys in a bucket. See, updates are coming.
  • Restart nats server
watcher, err := flowsKV.WatchAll(nats.Context(ctx))
	if err != nil {
		return err
	}
	defer watcher.Stop()
	for {
		select {
		case update := <-watcher.Updates():
			if update == nil {
				break
			}
			
		case <-ctx.Done():
			return nil
		}
	}

Expected result:

After nats restart watcher backs to printing latest KV updates

Actual result:

Watcher gives no updates
Error happens time to time nats: consumer not active on connection [101] for subscription on "$KV.bucket.>"


js.Subscribe() without nats.OrderedConsumer() option behaves correctly, but this option is built-in into kv.Watch...

Thanks.

@tpoxa tpoxa added the bug Confirmed reproducible bug label Sep 27, 2022
@ripienaar
Copy link
Contributor

Once reconnected it will just forever log consumer not active rather than recover.

@tpoxa
Copy link
Author

tpoxa commented Sep 27, 2022

I am not sure what you mean @ripienaar.
This simple snippet works without throwing any errors after the NATS server restart. But it requires copying a lot of the code from KV.go (detect type of operation and value of a change, unmarshalling)

As long as I don't use option nats.OrderedConsumer() for js.Subscribe - it works ok.
The problem is that this option is inside of kv.Watch()

	sub, err := js.Subscribe("$KV.bucket.>", func(msg *nats.Msg) {
		spew.Dump(msg.Data)
	})

	if err != nil {
		return err
	}
	defer sub.Unsubscribe()

	<-ctx.Done()

@ripienaar
Copy link
Contributor

I mean normal KV watch fail in that way

@tpoxa
Copy link
Author

tpoxa commented Sep 27, 2022

Ah, yeah, right. Failure would also be good so that It can be handled and restarted.

@ripienaar
Copy link
Contributor

ripienaar commented Sep 27, 2022

$ nats --context nats_development kv watch X '>'
[2022-09-27 10:55:45] PUT X > y: z
19:49:37 Disconnected due to: EOF, will attempt reconnect
19:49:41 Reconnected [nats://0.0.0.0:44855]
19:49:54 Unexpected NATS error from server nats://0.0.0.0:44855: nats: consumer not active
19:50:04 Unexpected NATS error from server nats://0.0.0.0:44855: nats: consumer not active
19:50:14 Unexpected NATS error from server nats://0.0.0.0:44855: nats: consumer not active
19:50:24 Unexpected NATS error from server nats://0.0.0.0:44855: nats: consumer not active
19:50:34 Unexpected NATS error from server nats://0.0.0.0:44855: nats: consumer not active
19:50:44 Unexpected NATS error from server nats://0.0.0.0:44855: nats: consumer not active
19:50:54 Unexpected NATS error from server nats://0.0.0.0:44855: nats: consumer not active
19:51:04 Unexpected NATS error from server nats://0.0.0.0:44855: nats: consumer not active

Here I restarted the server mid-watch, the watcher never worked again and consumer was not recreated. Something not happy with ordered consumers recovery logic.

@derekcollison
Copy link
Member

@ripienaar we feel confident this is a client issue and does not involve server or unsure?

@jnmoyne
Copy link
Contributor

jnmoyne commented Sep 28, 2022

I see the problem happen simply by creating an ordered push consumer between version 0.0.33 (works) of natscli and version 0.0.34 (current version, doesn't work), regardless of server version 2.8.4 or 2.9. Given the code path of my test (using nats bench) didn't change between the two versions of natscli, but the version of nats.go did change between those two versions and I believe that's when the problem was introduced (e.g. somewhere between 1.16 and 1.17).

Symptom is: you create an ephemeral ordered push consumer on a stream in a client application, you then kill and restart the server, the client app reconnects but the ordered push consumer doesn't receive any messages from that point on (and you see those 'consumer not active' errors).

Looking at the traffic when triggering this problem between the two versions of nats.go I noticed the following: before killing the server I see Idle Heartbeat messages being published to the inbox (of my client app with the ephemeral) with either version, but after killing and restarting the server.

  • In the case of the old (working) version of nats.go: I see again those Idle Heartbeat messages
  • In the case of the new (not working) version of nats.go: I do not see any Idle Heartbeat messages after the server restart.

@ripienaar
Copy link
Contributor

After restart the consumer of the watch is gone, nats.go doesn't recreate it ever or does any consumer info's or anything (-DVV trace checked), seems like a client issue

@piotrpio piotrpio self-assigned this Sep 28, 2022
@piotrpio
Copy link
Collaborator

It turned out this happens because of the recent change in ordered consumer (#989). Now, ordered consumer always has mem_storage set to true, which causes the consumer to not survive server restart.

@ripienaar @derekcollison not sure what the approach here should be - at the very least I think we should allow the customer to change the value of MemoryStorage when using ordered consumer (now it's not possible even when using ConsumerMemoryStorage() option because the value is being overwritten). Also, the kv watcher should probably have MemoryStorage set to false and Replicas set to stream.Replicas.

My question is - which setting should be the default one? This was a breaking change in the client so we might want to revert and do a patch release, unless there is a good reason to change this behavior.

@ripienaar
Copy link
Contributor

Seems to me that during a single invocation of ordered consumer it would know it’s state - what msg it received last.

So if the consumer is lost it can recreate it to continue from last known position.

It would have to do this anyway regardless of storage type since many things can happen to consumers.

If it did that it would be reliable even with memory storage.

@Jarema
Copy link
Member

Jarema commented Sep 28, 2022

I think more clients can be vulnerable to this issue.

@ripienaar That's basically applying what we do on sequence mismatch also to reconnects, so should be pretty straightforward fix.

@ripienaar
Copy link
Contributor

Yes since everyone copy go I suspect you are right.

@ripienaar
Copy link
Contributor

You would do the recreate any time the consumer is gone right not just on reconnects?

@Jarema
Copy link
Member

Jarema commented Sep 28, 2022

Well, as we know server does not notify client that consumer is gone, so I would assume two scenarios:
reconnect -> recreate consumer from last known stream sequence
missing idle heartbeats -> probably call consumer info to make sure its actually missing consumer, not anything else -> recreate consumer from last known sequence

Those scenarios will be way more often with recent change to Ordered Consumer being R1 and in memory, so we should make sure all clients are recreating ordered consumers properly.

It's not only about copying Go, but also ADR needing some additional info about recreation.
Will update it if we all agree that this is a proper behaviour.

@ripienaar
Copy link
Contributor

Sounds good to me.

I also noticed we treat timeouts on create - like you are connected but maybe foreign jetstream is down - as a critical failure that returns timeout error to ordered consumer.

Should probably make sure that kind of failures are also handled as for retry rather than fail

@piotrpio
Copy link
Collaborator

Thanks for the comments on this, yeah, recreating the consumer sound good. @ripienaar When working on this I'll also change the timeout on create behavior.

@derekcollison
Copy link
Member

IIRC ordered consumers create new consumers when they miss a sequence number or fail to receive heartbeats, so that should continue to work. Obviously does not, but not sure why the heartbeat missed code that kicks in creating a new consumer not working.

@kozlovic
Copy link
Member

@derekcollison Because this was never implemented: #789. The activityCheck would simply send an error to the async error callback (if one is provided). There was no special handling for ordered consumers (as you can see, the activityCheck was introduced in OrderedConsumer PR but not recreating it).
It was working before because the server - after a restart - would resume HBs, so if there was a change then library would recreate if a seq mismatch is detected.

@derekcollison
Copy link
Member

Ah! Thank you. Got it now, I think we keep them in memory but add logic to recreate on missed heartbeats. Apologies that was not done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed reproducible bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants