Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With persistence enabled, delivery of incoming messages can hang #423

Closed
someburner opened this issue May 26, 2020 · 7 comments
Closed

With persistence enabled, delivery of incoming messages can hang #423

someburner opened this issue May 26, 2020 · 7 comments

Comments

@someburner
Copy link

An application I'm working on requires that we have persistence enabled for the broker. While testing persistence, I found that occasionally my go program would start up and appear to "hang" (runs but not receiving messages) when it should be receiving stored and/or new messages from the broker (mosquitto on localhost in this case). Sometimes after restarting the program, the messages come through, other times they don't. But once they do come through, they come through as expected, so it doesn't look like I'm missing messages.

Initially I had my client calling Subscribe after a successful Connect in the same method. After reading another issue here, I tried putting the subscribe inside my onConnect callback. This seemed to lower the rate of the issue happening, but on occasion it would still occur.

After enabling the MQTT.DEBUG logs it looked like everything was fine in terms of data to/from the broker. And once the "hang" did happen, ping req/response were still going through. The startup log is a bit different every time due to goroutines executing at different times, so it's difficult to use those to pinpoint an issue, but I noticed that in the failure case the broker was receiving a message before it had successfully subscribed.

Given all of that, I tried first calling AddRoute on the topics before calling Connect. After that change, I was unable to get my application to hang. In my case I'm subscribing to 2 different wildcard topics and the issue was reproducible with 1.2.0 and current master.

I can provide test code and logs but I wanted to make sure this was not just an undocumented quirk in the library? If not it seems that there's a bug with the library not able to handle incoming messages immediately upon connecting unless AddRoute has been called beforehand.

@MattBrittan
Copy link
Contributor

I'll add some notes below that may explain some of the behaviour but as the options play a big role in issues like this please do submit the code/logs (if messages are not being received then logs from mosquitto can be helpful too). Without full details its difficult to say too much (but I do have quite a few instances that have been running for six months plus on dodgy connections with persistence enabled; so it can work). Please note that there have been some fairly major changes to the package recently so it would be great if you could use @master when producing the logs (the interactions between the go routines is quite different from previous versions and obviouslly I've not had the current code running for long!).

it seems that there's a bug with the library not able to handle incoming messages immediately upon connecting unless AddRoute has been called beforehand.

The library should handle these messages but I would expect them to be dropped. If you subscribe with QOS 1+ and Cleansession = false then the broker may begin delivering messages immediatly (on topics to which you have previously subscribed). Thus these messages can be received before your call to Subscribe executes and, unless you have called AddRoute or SetDefaultPublishHandler, the package does not know what to do with the messages and will drop them. I have not seen this cause a crash but it is possible that a bug has been introduced.

Initially I had my client calling Subscribe after a successful Connect in the same method

This is fine if AutoReconnect is false (or QOS is 1+ and cleansession is false) but otherwise there is a potential that the connection will be lost and automatically re-established (and as you do not recreate the subscription you will not get any messages). Personally I call subscribe in the onConnect callback because that ensures the subscription is active.

@someburner
Copy link
Author

someburner commented May 27, 2020

The library should handle these messages but I would expect them to be dropped. If you subscribe with QOS 1+ and Cleansession = false then the broker may begin delivering messages immediatly (on topics to which you have previously subscribed). Thus these messages can be received before your call to Subscribe executes and, unless you have called AddRoute or SetDefaultPublishHandler, the package does not know what to do with the messages and will drop them. I have not seen this cause a crash but it is possible that a bug has been introduced.

Okay I have not been using SetDefaultPublishHandler, but to minimize code I'd prefer not to handle a default case where I have to check the topics. I definitely do want to receive those messages, and using AddRoute works for that, but perhaps should be a documented feature or logged when it gets messages it doesn't know where to deliver, assuming that is what is happening.

This is fine if AutoReconnect is false (or QOS is 1+ and cleansession is false) but otherwise there is a potential that the connection will be lost and automatically re-established (and as you do not recreate the subscription you will not get any messages). Personally I call subscribe in the onConnect callback because that ensures the subscription is active.

I'm able to reproduce the bug regardless of auto-connect setting, and the bug happens immediately when starting up. Also does not seem to matter where I put the subscribe so much as when it is called. Adding a Sleep in that method gets it to happen pretty much every time, but it's important to note it can still happen without the delay.

I've attached a zip of a small test repo to reproduce the bug complete with a publish shell script, mosquitto.conf, and some instructions on reproducing. Should only take a minute to reproduce.
pahobug.zip

@MattBrittan
Copy link
Contributor

I definitely do want to receive those messages, and using AddRoute works for that, but perhaps should be a documented feature or logged when it gets messages it doesn't know where to deliver, assuming that is what is happening.

I agree that improving the documentation for this would be beneficial. I generally setup a default handler that simply logs anything it receives as an error (on the basis that I should not receive any unexpected messages); there may be cases where unexpected messages are not an error (so building this into the library is probably not the right approach).

I'm able to reproduce the bug regardless of auto-connect setting,

That comment related to your use of calling Subscribe after a successful Connect (really just a general suggestion to call subscribe from within onConnect because that way you can be sure the subscription is active whatever happens).

Before I take a detailed look at your code can you please confirm what "the bug" is (i.e. a hang, lost messages or a failure to subscribe?). Your initial issue mentioned a number of problems so I want to be sure what I am looking for. If the issue is simply that you are loosing messages if you do not call AddRoute prior to connecting then that is what I would expect to happen.

@someburner
Copy link
Author

someburner commented May 28, 2020

there may be cases where unexpected messages are not an error (so building this into the library is probably not the right approach).

Totally agree. I think the important thing is just to make sure the client doesn't lock up when it gets the persisted messages without having a route or default publisher, and then just document the right way to receive these messages if you want them.

Before I take a detailed look at your code can you please confirm what "the bug" is (i.e. a hang, lost messages or a failure to subscribe?). Your initial issue mentioned a number of problems so I want to be sure what I am looking for. If the issue is simply that you are loosing messages if you do not call AddRoute prior to connecting then that is what I would expect to happen.

Yeah I didn't ever really explain the bug in full:

In the test, I am constantly publishing QoS1 messages to a persisted topic. The bug occurs when I am starting up my go program after having already connected to the broker once (so it knows it needs to store/send me messages if I close the app and reopen).

In the test code (with the delay taken out, you'll see that in the comments in the onConnect callback), sometimes the paho library seems to connect, subscribe, and receive persisted messages just fine. But sometimes, in what we'll call the failure case, the paho library seems to stall in delivering messages to my callback(s), permanently. This happens pretty much immediately after the connect/subscribe. After this, no messages come through in my onMsg handler(s) or into the paho go as far as I can tell, despite the paho go logs showing it is sending/receiving pings to the broker.

It appears the paho go client is successfully sending the sub and getting suckack, but due to these "stray" persisted messages that come in beforehand, it appears delivery to onMsg breaks down somewhere. The thing that's weird to me is that the broker seems to stop offloading messages, so if we don't hit the failure case on the next run, the messages come through. I suppose this is because broker is not getting ack'd.

To me it seems like a stalled goroutine, or sending on a nil channel, but I'm not really sure.

The desired behavior can be seen by using AddRoute before connecting/subscribing. Then the persisted messages start coming in before (and I think also after) the subscribe/suback logs. I have broken this up into a different callback to make it easier to see which callback is getting called and when.

@MattBrittan
Copy link
Contributor

This issue is caused by a number of factors:

  • Mosquitto is configured with max_inflight_messages 1
  • The connection is QOS1+ and CleanSession is false
  • There are messages queued up for delivery which Mosquitto will send immediatly and are received before a handler is in place (you ensure this by adding a delay).

My previous comment:

unless you have called AddRoute or SetDefaultPublishHandler, the package does not know what to do with the messages and will drop them.

is both correct and somewhat misleading. The library will ignore packets for which there is no handler completly; this means that no PUBACK message is sent back to the broker. Because of the max_inflight_messages 1 setting mosquitto will not send any additional messages until the first one has been acknowledged (and that never happens) so message delivery grinds to a halt.

This could be addressed by acknowledging such messages but that would lead to lost messages (if we do not acknowledge them mosquitto will resend them when it is next restarted). So while the current behaviour is not ideal it may be the best option available to minimise data loss.

For now I have raised PR #424 which adds debug messages when this occurs to highlight the issue and also suggests adding routes before calling connect in this situation.

@someburner
Copy link
Author

Great analysis, thank you for looking into it. That makes sense with what I was seeing.

This could be addressed by acknowledging such messages but that would lead to lost messages (if we do not acknowledge them mosquitto will resend them when it is next restarted). So while the current behaviour is not ideal it may be the best option available to minimise data loss.

For now I have raised PR #424 which adds debug messages when this occurs to highlight the issue and also suggests adding routes before calling connect in this situation.

I think that's reasonable. Should be pretty clear to anyone that may experience this in the future.

@MattBrittan
Copy link
Contributor

Closing this because I believe it was resolved (and has been inactive for quite a while).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants