New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disconnect hang #501
Comments
I can see how this could happen; if the connection is lost whilst a Disconnection is in progress then things could hang on I have not encountered this myself as none of my applications Disconnect (other than on shutdown). |
What we are doing is calling Disconnect when we get an onDisconnectListener callback. We do this because we saw cases where we would have some orphaned resources. Following the deadlock here, I can also see that when we get the onDisconnect, and call Disconnect, the client still thinks we are connected, so I'm not sure what to do differently at the application layer. I would expect for Disconnect to early out if we were already properly disconnected, and I'm surprised that it thinks we are still connected despite firing the onDisconnect listener. For what it's worth, autoReconnect is false. |
Are you able to provide further details on this (code and logs)? I've put a fair amount of work into the cleanup code but there is always the possibility of bugs (and you use this package in a different way than I do so these are issues I would not see myself).
onDisconnect is called here the line above (assuming no autoreconnect) is One thing that could be happening is that |
… Also reduce noise from tests. Ref issue eclipse#501
Thanks @MattBrittan . I'm doing some serious code reading to see why the status doesn't line up, maybe something in our layer. I can't test this out easily, the hang was in production at a client and it happened a couple times in ~25days of running. I need to make sure I understand the root cause 100% . Your fix looks promising if the issue is what I believe it be, but I'm not certain yet. The state var being off is making me suspicious - we don't reuse clients, so we aren't calling disco/connect, just disco and then building a new client. |
@gregpassmore unfortunately these issues can be very difficult to trace; concurrency is hard and the way this client has evolved over the years makes it harder (will be grateful if you do spot a bug!). While I'm happy to put some more work on this client I really want to focus on the V5 client because it has a much simpler design which reduces the chance that we will see issues like this (I just need to get persistence implemented). |
@MattBrittan Thanks for doing this Matt. After code reading our code and the library, I still don't understand how its happening in our code but I think this fix would definitely help us. Any idea when 1.3.4 might be expected? We'd prefer to include a tagged release. |
I have released 1.3.4 (the changes are fairly minor so I did not see much benefit to waiting any longer). |
@gregpassmore - have you experience this issue following the upgrade to v1.3.4? Note that I have just released v1.3.5 which fixes another rare deadlock (which would cause |
- Disconnect hang: eclipse/paho.mqtt.golang#501 - Occasional deadlock when connections are lost: eclipse/paho.mqtt.golang#509
Closing this due to inactivity - I believe that 1.3.4 fixed the issue (and 1.3.5 deals with another potential cause). |
I'm experiencing a spurious hang on disconnect. Our code only calls Disconnect from one place and it is under lock. We have many MQTT connections and they reconnect every hour (external dependency), so we have a lot of connect/disconnects. This is rare but is a fatal error for us... I'm tempted to wrap our application layer call to disconnect in something that won't hang indefinitely but it would leave this routine and the resources orphaned. Other recent changes in paho have done these channel writes with a select and timeout, is that a good solution here as well?
The text was updated successfully, but these errors were encountered: