Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnect hang #501

Closed
z-br opened this issue Apr 28, 2021 · 10 comments
Closed

Disconnect hang #501

z-br opened this issue Apr 28, 2021 · 10 comments
Labels

Comments

@z-br
Copy link

z-br commented Apr 28, 2021

I'm experiencing a spurious hang on disconnect. Our code only calls Disconnect from one place and it is under lock. We have many MQTT connections and they reconnect every hour (external dependency), so we have a lot of connect/disconnects. This is rare but is a fatal error for us... I'm tempted to wrap our application layer call to disconnect in something that won't hang indefinitely but it would leave this routine and the resources orphaned. Other recent changes in paho have done these channel writes with a select and timeout, is that a good solution here as well?

goroutine 8765946 [chan send, 6863 minutes]:
github.com/eclipse/paho%2emqtt%2egolang.(*client).Disconnect(0x4a097c0, 0xfa)
 /root/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.3.2/client.go:437 +0x190

@MattBrittan
Copy link
Contributor

I can see how this could happen; if the connection is lost whilst a Disconnection is in progress then things could hang on c.oboundP <- &PacketAndToken{p: dm, t: dt} here. I think the best solution would be to select on the send and client.commsStopped (adding a timeout could also be worthwhile; I'd prefer to avoid too many timeouts as they can mask underlying issues).

I have not encountered this myself as none of my applications Disconnect (other than on shutdown).

@z-br
Copy link
Author

z-br commented Apr 29, 2021

What we are doing is calling Disconnect when we get an onDisconnectListener callback. We do this because we saw cases where we would have some orphaned resources. Following the deadlock here, I can also see that when we get the onDisconnect, and call Disconnect, the client still thinks we are connected, so I'm not sure what to do differently at the application layer. I would expect for Disconnect to early out if we were already properly disconnected, and I'm surprised that it thinks we are still connected despite firing the onDisconnect listener. For what it's worth, autoReconnect is false.

@MattBrittan
Copy link
Contributor

What we are doing is calling Disconnect when we get an onDisconnectListener callback. We do this because we saw cases where we would have some orphaned resources.

Are you able to provide further details on this (code and logs)? I've put a fair amount of work into the cleanup code but there is always the possibility of bugs (and you use this package in a different way than I do so these are issues I would not see myself).

I can also see that when we get the onDisconnect, and call Disconnect, the client still thinks we are connected,

onDisconnect is called here the line above (assuming no autoreconnect) is c.setConnected(disconnected) so I'm struggling to work out how what you are reporting could occur (I've never liked the way c.status works but it's not something I'm going to try to change now!).

One thing that could be happening is that onDisconnect is called in a goRoutine so if you are calling Connect() on the same Client elsewhere then a race condition results which could cause the issue you are seeing (it would be interesting to see if the race detector picks up any issues).

MattBrittan added a commit to ChIoT-Tech/paho.mqtt.golang that referenced this issue Apr 29, 2021
@MattBrittan
Copy link
Contributor

@gregpassmore give @master a go and see if PR #502 fixes this. I did add a test but used a bit of a hack to replicate the situation I think you are facing.

@z-br
Copy link
Author

z-br commented Apr 29, 2021

Thanks @MattBrittan . I'm doing some serious code reading to see why the status doesn't line up, maybe something in our layer. I can't test this out easily, the hang was in production at a client and it happened a couple times in ~25days of running. I need to make sure I understand the root cause 100% . Your fix looks promising if the issue is what I believe it be, but I'm not certain yet. The state var being off is making me suspicious - we don't reuse clients, so we aren't calling disco/connect, just disco and then building a new client.

@MattBrittan
Copy link
Contributor

@gregpassmore unfortunately these issues can be very difficult to trace; concurrency is hard and the way this client has evolved over the years makes it harder (will be grateful if you do spot a bug!). While I'm happy to put some more work on this client I really want to focus on the V5 client because it has a much simpler design which reduces the chance that we will see issues like this (I just need to get persistence implemented).

@z-br
Copy link
Author

z-br commented May 3, 2021

@MattBrittan Thanks for doing this Matt. After code reading our code and the library, I still don't understand how its happening in our code but I think this fix would definitely help us. Any idea when 1.3.4 might be expected? We'd prefer to include a tagged release.

@MattBrittan
Copy link
Contributor

I have released 1.3.4 (the changes are fairly minor so I did not see much benefit to waiting any longer).

@MattBrittan
Copy link
Contributor

@gregpassmore - have you experience this issue following the upgrade to v1.3.4? Note that I have just released v1.3.5 which fixes another rare deadlock (which would cause Disconnect to hang as well due to an earlier deadlock).

mihaitodor added a commit to mihaitodor/benthos that referenced this issue Jul 2, 2021
- Disconnect hang: eclipse/paho.mqtt.golang#501
- Occasional deadlock when connections are lost: eclipse/paho.mqtt.golang#509
@MattBrittan
Copy link
Contributor

Closing this due to inactivity - I believe that 1.3.4 fixed the issue (and 1.3.5 deals with another potential cause).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants