New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
occasional deadlock when connections are lost #509
Comments
This is using paho 1.3.1 Stack traces are here: These files are from the same run. HANG_01_02.txt was taken a bit later than HANG_01_01.txt. |
More notes from the stack trace: ackFunc()
stopCommsWorkers()
internalConnLost
startCommsWorkers - go routine 1
startCommsWorkers - go routine 2
startOutgoingComms
|
Thanks for the detailed notes (this kind of issue can be difficult to duplicate and track down!). I'm assuming you are using If I am understanding this correctly what is happening is:
That being the case I think the best solution will be to modify client.go such that sending to
I'm not really happy with this solution but things have become pretty convoluted in an attempt to maintain backwards compatibility and delivering incoming publish messages seems to be what a user would expect. Thoughts? Note: The symptoms may be similar to this issue but there have been major changes since that was raised (this commit) so I don't think the two can be compared. A range of deadlock type issues have been raised and those have been fixed where possible but often there has been insufficient information to track them down (or to assess whether the issue was in this library or the users code). |
Hi @MattBrittan, sorry for the delayed response. Yes, this has been difficult to reproduce. That seems like a reasonable approach. Thanks! |
…e multiple other operations are in progress. Ref eclipse#509
Sorry for the delays on this; I wanted to write a test that duplicated the issue so I could be confident that the fix worked. This took a while (and the test needs to be run a fair number of times to be sure the issue will arise) but I can now reliably replicate the issue and confirm that the fix works (10,000 iterations of the test ran without issue). I have committed the change so would appreciate it if you could try |
Unfortunately, I have lost access to the misbehaving broker that I was using when I ran into this bug. So I'm not quite sure how to reproduce it anymore. Thank you for working to resolve this issue. I appreciate it! |
Normally I'd leave this open but as @dehort no longer has access to the environment within which it arose there is no way to confirm that the initial issue is actually fixed (I was able to confirm that there was a problem, which is now fixed, but cannot be 100% sure that this was the issue @dehort encountered). Thanks very much for the detailed information provided in this issue; without this I doubt I would have found the issue that has now been fixed (it was in code that I have reviewed a number of times in the past but was difficult to spot because it was due to the interactions of multiple goroutines). I have just released v1.3.5 which includes the fix. |
- Disconnect hang: eclipse/paho.mqtt.golang#501 - Occasional deadlock when connections are lost: eclipse/paho.mqtt.golang#509
I am seeing an issue where occasionally my paho client will deadlock. When this happens, it will not be able to send or receive messages.
This appears to be triggered when the connection to broker is lost.
This is what I see in the go routine stack trace dump...at a high level:
messages
commsoboundP
incomingPubChan
errChan
outError
commsErrors
(outError
from startComms()) because its waiting to write toincomingPubChan
messages
(incomingPubChan
) because m.Ack() is blocked (ackFunc)The client doesn't appear to receive any errors from paho when this happens. Eventually the client will see the "publish was broken by timeout" error if it continues to send messages. This makes sense because the publish will not be able to write to obound eventually...so the timeout will get triggered. https://github.com/eclipse/paho.mqtt.golang/blob/master/client.go#L721 This code makes me think the client thinks it is connected.
I will attach the stack trace's that I have gathered.
This appears to be the same issue as #328
The text was updated successfully, but these errors were encountered: