New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jetstream could not pull message after nats-server restart #2397
Comments
reproduce this issue:
|
Just to confirm this is with current 2.3.3 release yes? |
i test it on v2.3.3. i guess former nats-server version has the same problem. |
Will see what @wallyqs or @variadico can come up with. |
@carr123 you can try the following helper to wait for JS to be ready when it is temporarily unavailable due to a restart: https://github.com/nats-io/nats.go/blob/master/test/js_test.go#L4311-L4328 |
@wallyqs hi, i add some sleep codes, it seems bring no difference.
i still get "Fetch fail 1: nats: timeout" while there are messages in the working queue, no matter how long i wait or how many times i restart my client program. |
my nats-server 3 nodes cluster just upgrade to v2.3.4 from v2.3.3, problem still there. even worse. |
Which Go client version are you using? Maybe you and @wallyqs can sync up. |
client program, go version go1.16.6 here are the testing results:
|
@carr123 could you share a runnable example gist to try to reproduce? That would help narrowing down the issue you are running into 🙏 |
Also, for v2.3.4, we wrote a test trying to reproduce the exact same steps you have described in the first comment of this report. Please have a look at nats-server/server/jetstream_cluster_test.go Line 7763 in 7112ae0
When you see the problem, I would suggest that you run Finally, I would recommend that you setup various callbacks in the NATS connection to track when it gets disconnected, if it gets async errors or completely closed and you may not know about it. Example on how to set those callbacks can be seen here. |
hello, this is my testing code and cluster config. |
Thanks will take a look in morning. |
I took a look and updated the Go client underneath the server and double checked that the test @kozlovic wrote matches what your program is doing and that test passes 100%. So, something about your setup is different or alluding us for some reason. |
Can you try removing the permission blocks that are empty? https://github.com/carr123/natsjsmdemo/blob/main/nats1.conf#L21 |
That config is trying to express and "allow" block, meaning what the users can publish and subscribe to. Without any permissions you can do everything, what your block is saying is those users can not publish or subscribe to anything. |
@derekcollison i follow your instruction, remove the "permission " block. problem still there. |
ok something then we are missing that is different with your setup then any of our tests etc. Will ask @wallyqs and @variadico to try to schedule some time to get on a video call with you and walk through what you are seeing. |
Hello @carr123 could you DM me in Slack to setup something to take a look? https://slack.nats.io/ |
@wallyqs i'm living in asia, maybe different timezone from you. :) |
Video would be great, including all setup etc. Thanks. |
@derekcollison @wallyqs https://github.com/carr123/natsjsmdemo/blob/main/out1.mp4 BTW, there are 3 streams in my nats server, 2 for business usage, 1 for my test purpose. |
Thanks! @wallyqs and @variadico will take a look. |
@derekcollison I just stop the cluster, delete all jetstream folders on all the 3 server nodes. make a completely clean environment. |
Let's keep this open til we really figure out what you are experiencing and make sure we understand what is going on. |
yes, you are right. the problem appear again. i find there are 3 kinds of errors after nats-server restart:
|
ok I introduced a bug that is in the Go client (main branch) that will cause pull subscribers to fail after a reconnect of the client connection. So it could be that. We should have a fix in later tonight or tomorrow. What is the git tag/version of the Go client you are using? |
@derekcollison i use the latest code. i have network problem to use "go get github.com/nats-io/nats.go" command. |
@carr123 @derekcollison The bug was introduced only 3 days ago, and again, does not match the experience that unless @carr123 deletes the JS consumer on the server, then restarting the application does not help. (the bug would affect only a running application that reconnects, not an application that is restarted). |
@derekcollison hi, i wonder if you already have a clue about this issue. |
In the example above, how are you stopping the servers? |
on windows 10, servers are running in CMD window, stopping it just issue a CTRL+C |
@carr123 your test programs have |
@ripienaar @wallyqs @derekcollison hi, i tried, nats.MaxReconnects(-1) still doesn't work. |
problem still there in nats-server-v2.4.0-windows-amd64, client package use github.com/nats-io/nats.go v1.12.0 |
We would need to do a video call with you and be able to inspect your system to help you at this point. We have tried and no one over here can reproduce. Not saying you are not seeing an issue, we just would need to be able to actually inspect your system to continue. |
@derekcollison hello, i buy a windows server from a cloud service provider. testing programs are deployed. |
@carr123 Thank you so much for your patience with this issue and going to the lengths you have to help us debug this - much appreciated. We were able to reproduce this and already have some ideas about what could be causing the issue. We'll keep you posted. |
@ColinSullivan1 hi, you can download it from http://101.200.84.208/natstest.zip |
Thanks, link does not seem to work though.. Just hangs. I can see the server is up and pingable but clicking on link or even curl fails. |
hi, Wally download it, so i close the download link. now, i reopen it, should be accessible.http://101.200.84.208/natstest.zip
|
We got it and can reproduce it from time to time. Thanks for your patience, I am looking at it now. |
When we had partial state due to server failure or being shutdown ungracefully we could enter into a stream reset state. The stream reset state is harsh but worked, however there was a bug that would not restart consumers that were attached. Also if no state exists, or state was truncated, we can detect that and not go through a full reset. Signed-off-by: Derek Collison <derek@nats.io>
Hi @carr123 one quick note is that I noticed that you were using |
@carr123 we were not handling this event properly and so this would have become a bad exit that hit another bug from the server when restarting. The behavior when stopping the server this way has been improved in the |
a resilient system should consider bad exit like process kill,unexpected power off,a node could exit at any time。
it's hard to design distributed systems.
thanks for your work!
发自我的iPhone
… 在 2021年9月2日,06:16,Waldemar Quevedo ***@***.***> 写道:
if i close the cmd.exe window by clicking the close button on the top-right,
@carr123 we were not handling this event properly and so this would have become a bad exit that hit another bug from the server when restarting. The behavior when stopping the server this way has been improved in the main branch :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
We agree 100% in terms of bad exits etc. But Wally found out the normal behavior on Windows was not correct. |
Feel free to try this out. Docker synadia/nats-server:nightly |
Should be fixed, and thanks again for your patience. |
Although the Go client issue reported here: nats-io/nats.go#809, may then add to the issue that @carr123 is now experiencing with v2.4.0 and client v1.12.0. I am working on a client fix at the moment. |
hi, i just download nats-server main branch, build binary, do the node restart test. |
i was testing jetstream on nats-server v2.3.2. one sender and one receiver program are running for quite a long time.
this is what my stream look like :
this is how i create the consumer:
this is what the subscriber look like:
when i restart my nats-server cluster nodes(upgrade to nats-server 2.3.3),
the consumer can no longer pull messages even if i restart my consumer program.
the Fetch call just return : "nats: timeout", but i'm sure there are lots of message in the working queue.
only if i delete the consumer by calling js.DeleteConsumer(streamName, durableName), recreate it, my program can
resume fetching messages.
actually, every time i restart nats-server nodes, my consumer program encouter the same problem.
there is another issue, after i restart nats-server nodes, restart my program, it sometimes reports : "PullSubscribe: nats: JetStream system temporarily unavailable"
I expect nats-server nodes restarting action not impacting jetstream clients fetching messages.
The text was updated successfully, but these errors were encountered: