Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order matters Ack on QoS > 0 #504

Closed
wants to merge 1 commit into from

Conversation

thomas-tacquet
Copy link
Contributor

@thomas-tacquet thomas-tacquet commented May 5, 2021

So I noticed when using orders matters (OrderMatters = false), if some of the subscribed clients send messages with a Quality of Service (QoS) > 0 with slow callbacks, if you close the connection before callbacks ends, m.Ack() will be called after the connection has been closed, leading to sending message on a closed channel on file net.go here :

This chan is closed here : https://github.com/eclipse/paho.mqtt.golang/blob/master/client.go#L606

So I moved m.Ack() outside the routine, so clients will be notified by Ack (because paho has correctly received the QoS ACK) it's paho's job, but ACK sending is done even if the callback did not finished.

So I noticed when using orders matters, if some of the
subscribred clients send messages with a Quality of Service (QoS) > 0
with slow callbacks, if you close the connexion before callbacks ends,
m.Ack() will be called after the connexion has been closed, leading to sending
message on a closed channel on file net.go, function 'ackFunc' line 451 for QoS 2
or line 458 for QoS1.

So I moved m.Ack() outside the routine, so clients will be notified by Ack (because paho
has correctly recieved the QoS ACK) it's paho's job, but ACK will send even if the callback
did not finished.
@stoxx
Copy link

stoxx commented May 5, 2021

I think sending ACK before processing (save to disk or anything) has finished, is clearly wrong and incorrect.

This is dangerous. Messages can get lost if they are ACKed before saved.

@thomas-tacquet
Copy link
Contributor Author

I think sending ACK before processing (save to disk or anything) has finished, is clearly wrong and incorrect.

This is dangerous. Messages can get lost if they are ACKed before saved.

Thank you for your review 👍 I can't clearly imagine cases where it causes such behavior because in order matters set to false, callbacks must be called safely and goroutines starting the callbacks functions, I'm trying to reproduce such things without success.

As only the sender should close a channel, never the receiver, may incomingPubChan should be closed in ackFunc ?

@MattBrittan
Copy link
Contributor

MattBrittan commented May 5, 2021

Agree with @stoxx - there are a few reasons that the call to m.Ack() is deliberately delayed until the callback exits:

  • Without this it is impossible to guarantee messages will be processed in order (ordered delivery only really makes sense when the broker is configured to limit inflight messages to 1 - for mosquitto that is max_inflight_messages 1 in mosquitto.conf). With this change the ACK may be sent before the handler runs and you could end up processing multiple messages simultaneously which defeats the aim of OrderMatters = true.
  • Should the handler be unable to process the message it has the option of terminating the application (or leaving the goRoutine hanging), thus preventing the ACK() from being sent, this means that the broker will resend the message when the connection is re-established (preventing message loss). This is not a great approach (we are currently looking into how to best handle this in the MQTT v5 client) but at least it provides one way of not Acknowledging a message. Based upon this ongoing discussion I'm tending towards reversing the previous decision in regards to manual ACK because it appears that the consensus has shifted somewhat.

I do agree that you have found an issue here but would ask that you please raise it as an issue and we can have a look at the best way of handling it. Unfortunately it's likely that there will be potential downsides to all options due to the way this client is structured but I think we can come up with a solution that will avoid breaking existing uses (my current thought is that if the handler completes when the connection has gone down then the ACK should be dropped; this will lead to duplicate messages but avoiding that is difficult as the broker will automatically resend the message as soon as the connection comes up anyway; could also update the store but that may have also been closed).

*Update - Sorry I was confused by "So I noticed when using orders matters"; your proposed change only applies when OrderMatters = false. This means the first objection does not apply but the second objection remains valid. Note that whatever fix is applied will also need to apply to the defaultHandler code further down in the function.

@thomas-tacquet
Copy link
Contributor Author

@MattBrittan thank you for your helpful explanations 👍 I'm going to write an issue, some test cases to highlight encoutered issue, and maybe some proposals.

@MattBrittan
Copy link
Contributor

Thanks @thomas-tacquet; unfortunately this client is complex due to historical design decisions. This makes fixing issues like this in a way that does not break something else fairly difficult/time consuming.

I made major changes for @1.3.0 to try and remove a range of potential deadlocks and will have introduced this issue at that point (@1.2 did not close incomingPubChan which was not really ideal either as it will have left hanging goRoutines). From the comment

// WARNING the function returned must not be called if the comms routine is shutting down or not running
// (it needs outgoing comms in order to send the acknowledgement). Currently this is only called from
// matchAndDispatch which will be shutdown before the comms are"

I can see that I spotted the risk but missed the fact that matchAndDispatch is doing exactly what I was warning about! At the time usage of OrderMatters = false was fairly low so most tests do not set this option; I have encouraged the use of this option since due to the number of issues logged that turn out to be caused by long running message handlers blocking keep alives.

Having thought about this further I think the only viable solution is going to be to drop ACK calls if the connection has been lost but note that it is important that matchAndDispatch shuts down cleanly/quickly even if handlers are still running. I'll leave this to you for now ; thanks for offering to do a fuller write up!

@MattBrittan
Copy link
Contributor

Logged an issue for this so that it's visible if anyone else encounters the issue. I'll take a look at this later in the week if you have not had a chance.

@MattBrittan
Copy link
Contributor

Please take a look at PR #508 and let me know your thoughts. I'm not all that happy with the approach but its the best option I could come up without a major rewrite of this section of code (and a rewrite is likely to break other things!).

@MattBrittan
Copy link
Contributor

Closing this pull request because its superseded by #508; thanks very much for raising this PR.

@MattBrittan MattBrittan closed this Jun 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants