Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jetstream could not pull message after nats-server restart #2397

Closed
carr123 opened this issue Aug 3, 2021 · 50 comments
Closed

jetstream could not pull message after nats-server restart #2397

carr123 opened this issue Aug 3, 2021 · 50 comments
Assignees

Comments

@carr123
Copy link

carr123 commented Aug 3, 2021

i was testing jetstream on nats-server v2.3.2. one sender and one receiver program are running for quite a long time.

this is what my stream look like :

	_, err = js.AddStream(&nats.StreamConfig{
		Name:      streamName,
		Subjects:  []string{streamSubjects},
		Storage:   nats.FileStorage,
		Replicas:  3,
		Retention: nats.WorkQueuePolicy,
		Discard:   nats.DiscardNew,
		MaxMsgs:   -1,
		MaxAge:    time.Hour * 24 * 365,
	})

this is how i create the consumer:

	if _, err := js.AddConsumer(streamName, &nats.ConsumerConfig{
		Durable:       durableName,
		DeliverPolicy: nats.DeliverAllPolicy,
		AckPolicy:     nats.AckExplicitPolicy,
		ReplayPolicy:  nats.ReplayInstantPolicy,
		FilterSubject: subjectName,
		AckWait:       time.Second * 30,
		MaxDeliver:    -1,
		MaxAckPending: 1000,
	}); err != nil && !strings.Contains(err.Error(), "already in use") {
		log.Println("AddConsumer fail")
		return
	}

this is what the subscriber look like:

	sub, err := js.PullSubscribe("ORDERS.created", durableName, nats.Bind("ORDERS", durableName))
	if err != nil {
		fmt.Println(" PullSubscribe:", err)
		return
	}
       msgs, err := sub.Fetch(1000, nats.MaxWait(10*time.Second))

when i restart my nats-server cluster nodes(upgrade to nats-server 2.3.3),
the consumer can no longer pull messages even if i restart my consumer program.
the Fetch call just return : "nats: timeout", but i'm sure there are lots of message in the working queue.
only if i delete the consumer by calling js.DeleteConsumer(streamName, durableName), recreate it, my program can
resume fetching messages.
actually, every time i restart nats-server nodes, my consumer program encouter the same problem.

there is another issue, after i restart nats-server nodes, restart my program, it sometimes reports : "PullSubscribe: nats: JetStream system temporarily unavailable"

I expect nats-server nodes restarting action not impacting jetstream clients fetching messages.

@carr123
Copy link
Author

carr123 commented Aug 3, 2021

reproduce this issue:

  1. start a client program fetching messages in a loop
  2. make sure there are NO messages in the jetsteam
  3. restart each nats-server cluster node one by one.
  4. send some messages to jetstream system
  5. you can notice the former consumer client can not fetch any messages even restart it.

@derekcollison
Copy link
Member

Just to confirm this is with current 2.3.3 release yes?

@carr123
Copy link
Author

carr123 commented Aug 3, 2021

Just to confirm this is with current 2.3.3 release yes?

i test it on v2.3.3. i guess former nats-server version has the same problem.
i saw the bug weeks ago, i didn't know how to reproduce it, otherwise i would have reported it.

@derekcollison
Copy link
Member

Will see what @wallyqs or @variadico can come up with.

@wallyqs
Copy link
Member

wallyqs commented Aug 3, 2021

@carr123 you can try the following helper to wait for JS to be ready when it is temporarily unavailable due to a restart: https://github.com/nats-io/nats.go/blob/master/test/js_test.go#L4311-L4328
In some of the restart tests and you could for example back off when get either a 503 or timeout before retrying: https://github.com/nats-io/nats.go/blob/master/test/js_test.go#L5527-L5537

@carr123
Copy link
Author

carr123 commented Aug 3, 2021

@wallyqs hi, i add some sleep codes, it seems bring no difference.

sub, err := js.PullSubscribe("ORDERS.created", durableName, nats.Bind("ORDERS", durableName))
if err != nil {
	fmt.Println("PullSubscribe fail:", err)
	return
}

for {
	if !sub.IsValid() {
		log.Println("sub invalid !!!")
		return
	}

	msgs, err := sub.Fetch(1000, nats.MaxWait(10*time.Second))
	if err == nats.ErrNoResponders || err == nats.ErrTimeout {
		log.Println("Fetch fail 1:", err)
		time.Sleep(100 * time.Millisecond)
		continue
	}

	if err != nil {
		log.Println("Fetch fail 2:", err)
		return
	}

	for _, msg := range msgs {
		msg.Ack()
		log.Println("recv data:", string(msg.Data))
	}
}

i still get "Fetch fail 1: nats: timeout" while there are messages in the working queue, no matter how long i wait or how many times i restart my client program.
only adding "js.DeleteConsumer()" before "js.AddConsumer()", then restarting my program can work.
it seems something is wrong in the Consumer state on the nats-server side. creating a new consumer will continue fetching messages.

@carr123
Copy link
Author

carr123 commented Aug 5, 2021

my nats-server 3 nodes cluster just upgrade to v2.3.4 from v2.3.3, problem still there. even worse.
after i replace and restart nats-server binary for all 3 nodes, i sent some messages to jsm working queue for testing purpose.
my client program failed to fetch messages from now on.
i quit and restart my client program , the js.PullSubscribe reports error: "nats: JetStream system temporarily unavailable", it stays
"unavailable" for a couple of minutes, and seems never become available.
i had to use my final solution, calling js.DeleteConsumer() before js.AddConsumer(), then JetStream become available, js.PullSubscribe succeeded. fetch operation can work too, all messages are fetched and removed from server.
i sent more messages, all are then consumed as expected.

@derekcollison
Copy link
Member

Which Go client version are you using? Maybe you and @wallyqs can sync up.

@carr123
Copy link
Author

carr123 commented Aug 5, 2021

client program, go version go1.16.6
"github.com/nats-io/nats.go" use the latest code, downloaded few minutes age.
nats-server v2.3.4

here are the testing results:

  1. no matter how many messages are there in jsm before nats-servers restart,
    client can not resume fetching messages.
  2. "nats: JetStream system temporarily unavailable", this error doesn't always appear every time i perform nats-server restart tests. sometimes when it is unavailable, it seems not temporarily but permantly.
  3. consumer recreation can solve above 2 problems
  4. it seems consumer infomation on the server side not resilient to cluster restart

@wallyqs
Copy link
Member

wallyqs commented Aug 5, 2021

@carr123 could you share a runnable example gist to try to reproduce? That would help narrowing down the issue you are running into 🙏

@kozlovic
Copy link
Member

kozlovic commented Aug 5, 2021

Also, for v2.3.4, we wrote a test trying to reproduce the exact same steps you have described in the first comment of this report. Please have a look at

func TestJetStreamPanicDecodingConsumerState(t *testing.T) {
and see if that represents pretty well your scenario. This tests passes, so clearly there is something different in your environment.

When you see the problem, I would suggest that you run nats c info and then select the existing stream/consumer and post the output to see if there is something unusual. Other outputs that may be of interest:
nats c report, nats s report. Also nats server report jetstream (for this one, you need to specify a user from the system account (if you have one) using nats -s nats://<system user>:<password>:<a host>:<port> server report jetstream.

Finally, I would recommend that you setup various callbacks in the NATS connection to track when it gets disconnected, if it gets async errors or completely closed and you may not know about it. Example on how to set those callbacks can be seen here.

@carr123
Copy link
Author

carr123 commented Aug 6, 2021

@wallyqs @kozlovic

hello, this is my testing code and cluster config.

https://github.com/carr123/natsjsmdemo

@derekcollison
Copy link
Member

Thanks will take a look in morning.

@derekcollison
Copy link
Member

I took a look and updated the Go client underneath the server and double checked that the test @kozlovic wrote matches what your program is doing and that test passes 100%.

So, something about your setup is different or alluding us for some reason.

@derekcollison
Copy link
Member

@derekcollison
Copy link
Member

That config is trying to express and "allow" block, meaning what the users can publish and subscribe to. Without any permissions you can do everything, what your block is saying is those users can not publish or subscribe to anything.

@carr123
Copy link
Author

carr123 commented Aug 9, 2021

@derekcollison i follow your instruction, remove the "permission " block. problem still there.
I restart the cluster nodes manually when reproducing the issue.

@derekcollison
Copy link
Member

ok something then we are missing that is different with your setup then any of our tests etc.

Will ask @wallyqs and @variadico to try to schedule some time to get on a video call with you and walk through what you are seeing.

@wallyqs
Copy link
Member

wallyqs commented Aug 9, 2021

Hello @carr123 could you DM me in Slack to setup something to take a look? https://slack.nats.io/

@carr123
Copy link
Author

carr123 commented Aug 10, 2021

@wallyqs i'm living in asia, maybe different timezone from you. :)
i test it in my office at work time, there's no camera available, and i have no access to slack possibly due to company's network security policy.
if you want to install some tool in my environment, just give me a download link.
if you want too see how i perform the test, i will make a video recording later.
thanks.

@derekcollison
Copy link
Member

Video would be great, including all setup etc.

Thanks.

@carr123
Copy link
Author

carr123 commented Aug 11, 2021

@derekcollison @wallyqs
hello, i make 2 video files, that show how i perform the test.

https://github.com/carr123/natsjsmdemo/blob/main/out1.mp4
https://github.com/carr123/natsjsmdemo/blob/main/out2.mp4

BTW, there are 3 streams in my nats server, 2 for business usage, 1 for my test purpose.
i wonder whether the 3 streams influence each other.
or, i made too many tests these days, made too many config file changes , left the whole nats-server a mess.

@derekcollison
Copy link
Member

Thanks! @wallyqs and @variadico will take a look.

@carr123
Copy link
Author

carr123 commented Aug 11, 2021

@derekcollison I just stop the cluster, delete all jetstream folders on all the 3 server nodes. make a completely clean environment.
all the server config files keep the same as before. then perform the jetstream test.
luckly, my client program this time can successfully fetch messages after server restart.
i modified the server config files and restart servers many times during the passed month.
maybe there are some dirty data in the jetstream folder that cause the fetching issue.
thanks for your attention. i will make more test.

@derekcollison
Copy link
Member

Let's keep this open til we really figure out what you are experiencing and make sure we understand what is going on.

@carr123
Copy link
Author

carr123 commented Aug 12, 2021

yes, you are right. the problem appear again.
12 hours ago, i made a clean nat-server deployment, did some restart test, everything was ok.
then i keep the reveiver program running, the sender program send 10000 messages every 3 hours, it's not heavy workload.
12 hours passed, i get to office, notice the same amount of messages was recevied as they were sent. nothing wrong.
then testing work started, this time i kill all the 3 nodes one by one, start them one by one, nc.SetDisconnectErrHandler callback indeed received "EOF“,after that all nodes are alive again. i send 5000 message, client program fail to fetch any one, printing fetch timeout. i restart client program, still fetch timeout.
i uncomment the code "DeleteConsumer( ) ", messags are fetched again.
after that, everything seems go back to normal. i send more and more messages, all are received as long as nats-server no been restarted.

i find there are 3 kinds of errors after nats-server restart:

  1. client program can start and try to fetch messages, but always fetch timeout while there are many messages in the working queue.
  2. js.AddConsumer( ) return err: "context deadline exceeded" each time i start the client program, i regard it as init faild and exit program.
  3. sub.Fetch() always return err: "nats: no messages" in every fetch call

@derekcollison
Copy link
Member

ok I introduced a bug that is in the Go client (main branch) that will cause pull subscribers to fail after a reconnect of the client connection. So it could be that. We should have a fix in later tonight or tomorrow.

What is the git tag/version of the Go client you are using?

@carr123
Copy link
Author

carr123 commented Aug 12, 2021

@derekcollison i use the latest code. i have network problem to use "go get github.com/nats-io/nats.go" command.
so i download it manually from "https://github.com/nats-io/nats.go/archive/refs/heads/master.zip" about 7 days ago.
if you find the cause, that's a good news.

@kozlovic
Copy link
Member

@carr123 @derekcollison The bug was introduced only 3 days ago, and again, does not match the experience that unless @carr123 deletes the JS consumer on the server, then restarting the application does not help. (the bug would affect only a running application that reconnects, not an application that is restarted).

@carr123
Copy link
Author

carr123 commented Aug 18, 2021

@derekcollison hi, i wonder if you already have a clue about this issue.
i recently perform a test on windows 10, different environment from the former linux server.
problem still found. i know little about nats source codes, i hava no idea whether my config files have problem or the server side have problem.
this's my client programs and server configs.

https://github.com/carr123/jsmwindows

@derekcollison
Copy link
Member

In the example above, how are you stopping the servers?
What do you mean by changing the account password? Do you reload the server config?

@carr123
Copy link
Author

carr123 commented Aug 24, 2021

on windows 10, servers are running in CMD window, stopping it just issue a CTRL+C
changing the account password means i want to pick up a better password for some account , then restart all servers.
all operations are normal.
just restarting the servers one time may not reproduce the issue. i test it manually for many times, the issue will finally show up.
i mean if you want to see it , repeated manual testing will lead you to it.

@ripienaar
Copy link
Contributor

@carr123 your test programs have nats.MaxReconnects(10), are you 100% sure you dont hit 10 reconnect and then it stalls? I'd make that -1 while testing

@carr123
Copy link
Author

carr123 commented Aug 25, 2021

@carr123 your test programs have nats.MaxReconnects(10), are you 100% sure you dont hit 10 reconnect and then it stalls? I'd make that -1 while testing

@ripienaar @wallyqs @derekcollison hi, i tried, nats.MaxReconnects(-1) still doesn't work.
i find a way to reproduce this issue easier on windows 10 when i perform the restart test.
i start 3 cmd.exe windows running nats cluster nodes locally.
if i close the cmd.exe window by clicking the close button on the top-right, not by issuing an CTRL+C command,
then the problem show up more often.

@carr123
Copy link
Author

carr123 commented Aug 27, 2021

problem still there in nats-server-v2.4.0-windows-amd64, client package use github.com/nats-io/nats.go v1.12.0
testing on a clean cluster, no old data there.
send some messages after server restart, check messages are indeed accepted through http://127.0.0.1:8222/jsz, total_messages.
client can not see the messages even after restarting client program.
using js.DeleteConsumer as i said before can make things go back to normal.

@derekcollison
Copy link
Member

We would need to do a video call with you and be able to inspect your system to help you at this point.

We have tried and no one over here can reproduce. Not saying you are not seeing an issue, we just would need to be able to actually inspect your system to continue.

@carr123
Copy link
Author

carr123 commented Aug 30, 2021

@derekcollison hello, i buy a windows server from a cloud service provider. testing programs are deployed.
and an email has sent to derek@nats.io from thatcat@sina.com, you can check it.
thanks.

@ColinSullivan1
Copy link
Member

@carr123 Thank you so much for your patience with this issue and going to the lengths you have to help us debug this - much appreciated. We were able to reproduce this and already have some ideas about what could be causing the issue. We'll keep you posted.

@carr123
Copy link
Author

carr123 commented Aug 31, 2021

@ColinSullivan1 hi, you can download it from http://101.200.84.208/natstest.zip
this download URL will expire in 6 days.
if you can't download it, let me know, then i will send the file to you by email attachment when i could.

@derekcollison
Copy link
Member

Thanks, link does not seem to work though.. Just hangs. I can see the server is up and pingable but clicking on link or even curl fails.

@carr123
Copy link
Author

carr123 commented Aug 31, 2021 via email

@derekcollison
Copy link
Member

We got it and can reproduce it from time to time. Thanks for your patience, I am looking at it now.

derekcollison added a commit that referenced this issue Sep 1, 2021
When we had partial state due to server failure or being shutdown ungracefully we could enter into a stream reset state.
The stream reset state is harsh but worked, however there was a bug that would not restart consumers that were attached.
Also if no state exists, or state was truncated, we can detect that and not go through a full reset.

Signed-off-by: Derek Collison <derek@nats.io>
@derekcollison derekcollison self-assigned this Sep 1, 2021
@wallyqs
Copy link
Member

wallyqs commented Sep 1, 2021

Hi @carr123 one quick note is that I noticed that you were using QUIT to stop the server. To restart the server gracefully TERM, INT and USR2 would help avoid reaching this state as well: https://github.com/nats-io/nats.docs/blob/030567b573ee75c71bfe3f810c49425391599637/nats-server/nats_admin/signals.md#signals
Thanks again for all the info helping reproduce this issue.

@wallyqs
Copy link
Member

wallyqs commented Sep 1, 2021

if i close the cmd.exe window by clicking the close button on the top-right,

@carr123 we were not handling this event properly and so this would have become a bad exit that hit another bug from the server when restarting. The behavior when stopping the server this way has been improved in the main branch :)

@carr123
Copy link
Author

carr123 commented Sep 1, 2021 via email

@derekcollison
Copy link
Member

We agree 100% in terms of bad exits etc. But Wally found out the normal behavior on Windows was not correct.

@derekcollison
Copy link
Member

Feel free to try this out. Docker synadia/nats-server:nightly

@derekcollison
Copy link
Member

Should be fixed, and thanks again for your patience.

@kozlovic
Copy link
Member

kozlovic commented Sep 2, 2021

Although the Go client issue reported here: nats-io/nats.go#809, may then add to the issue that @carr123 is now experiencing with v2.4.0 and client v1.12.0. I am working on a client fix at the moment.

@carr123
Copy link
Author

carr123 commented Sep 2, 2021

hi, i just download nats-server main branch, build binary, do the node restart test.
Go client v1.12.0
nats-server platform: windows and linux, amd64
everything runs well. the cluster can survive random node restart both on windows and linux.
thanks for your work !

@bruth bruth removed the 🐞 bug label Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants