Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MQTT unable to recover when TCP disconnect #468

Closed
williamshaocharles opened this issue Dec 18, 2020 · 7 comments
Closed

MQTT unable to recover when TCP disconnect #468

williamshaocharles opened this issue Dec 18, 2020 · 7 comments

Comments

@williamshaocharles
Copy link

williamshaocharles commented Dec 18, 2020

Hello,

Recently we hit an mqtt go client issue: sometimes mqtt client unable to recover from a network broken anymore. It happens occasionally and may happen about once a day, so very hard to reproduce. We are using the latest master branch. On broker side, we use emqx broker. Here is the log snippet:

[DEBUG] 10:20:59.613822 [pinger] ping check 10.46959611
[DEBUG] 10:21:01.899854 [net] putting puback msg on obound
[DEBUG] 10:21:01.900034 [store] memorystore del: message 116 was deleted
[DEBUG] 10:21:01.900142 [net] done putting puback msg on obound
[DEBUG] 10:21:01.900333 [net] obound priority msg to write, type *packets.PubackPacket
[DEBUG] 10:21:01.900410 [net] logic waiting for msg on ibound
[ERROR] 10:21:01.900538 [net] outgoing oboundp reporting error write tcp 192.168.31.194:59638->192.168.2.79:1883: write: connection reset by peer
[DEBUG] 10:21:01.900612 [net] startIncomingComms: got msg on ibound
[DEBUG] 10:21:01.900979 [net] startIncomingComms: received publish, msgId: 118
[DEBUG] 10:21:01.900636 [net] outgoing waiting for an outbound message
[DEBUG] 10:21:01.900850 [net] startIncoming Received Message
[DEBUG] 10:21:04.613783 [pinger] ping check 15.469548675
[DEBUG] 10:21:09.613852 [pinger] ping check 20.469578741
[DEBUG] 10:21:12.386195 [net] putting puback msg on obound
[DEBUG] 10:21:12.386395 [store] memorystore del: message 118 was deleted
[DEBUG] 10:21:12.386497 [net] done putting puback msg on obound
[DEBUG] 10:21:12.386751 [net] obound priority msg to write, type *packets.PubackPacket
[ERROR] 10:21:12.387096 [net] outgoing oboundp reporting error write tcp 192.168.31.194:59638->192.168.2.79:1883: write: broken pipe
[DEBUG] 10:21:12.387224 [net] outgoing waiting for an outbound message
[ERROR] 10:21:12.387104 [client] Connect comms goroutine - error triggered write tcp 192.168.31.194:59638->192.168.2.79:1883: write: connection reset by peer
[DEBUG] 10:21:12.387360 [client] internalConnLost called
[DEBUG] 10:21:12.387421 [client] stopCommsWorkers called
[DEBUG] 10:21:12.387586 [client] startCommsWorkers output redirector finnished
[DEBUG] 10:21:12.387690 [net] outgoing waiting for an outbound message
[DEBUG] 10:21:12.387760 [net] outgoing waiting for an outbound message
[DEBUG] 10:21:12.387850 [pinger] keepalive stopped
[ERROR] 10:21:12.397670 [client] Connect comms goroutine - error triggered write tcp 192.168.31.194:59638->192.168.2.79:1883: write: broken pipe
[DEBUG] 10:21:12.402813 [client] stopCommsWorkers waiting for workers
[DEBUG] 10:21:12.403000 [client] internalConnLost waiting on workers
[DEBUG] 10:21:12.403918 [client] internalConnLost called
[DEBUG] 10:21:12.404068 [client] stopCommsWorkers called
[DEBUG] 10:21:12.404137 [client] stopCommsWorkers done (not running)
[DEBUG] 10:21:12.404228 [net] logic waiting for msg on ibound
[DEBUG] 10:21:12.404310 [net] startIncomingComms: got msg on ibound
[DEBUG] 10:21:12.404425 [store] memorystore del: message 1618 was deleted
[DEBUG] 10:21:12.404505 [net] startIncomingComms: received puback, id: 1618
[DEBUG] 10:21:12.404593 [net] logic waiting for msg on ibound
[DEBUG] 10:21:12.404668 [net] startIncomingComms: ibound complete
[DEBUG] 10:21:12.404739 [net] startIncomingComms goroutine complete
[DEBUG] 10:21:12.404669 [net] incoming complete
[DEBUG] 10:21:12.405468 [client] enter Publish
[DEBUG] 10:21:12.405605 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:21:22.865802 [net] putting puback msg on obound
[DEBUG] 10:21:22.866153 [store] memorystore del: message 117 was deleted
[DEBUG] 10:21:42.406371 [client] enter Publish
[DEBUG] 10:21:42.406707 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:22:12.407490 [client] enter Publish
[DEBUG] 10:22:12.407775 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:22:42.409085 [client] enter Publish
[DEBUG] 10:22:42.409382 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:23:12.410144 [client] enter Publish
[DEBUG] 10:23:12.410444 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:23:42.411153 [client] enter Publish
[DEBUG] 10:23:42.411454 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:24:12.399703 [client] enter Publish
[DEBUG] 10:24:12.400017 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:24:42.400796 [client] enter Publish

***[ERROR] 10:21:12.387104 [client] Connect comms goroutine - error triggered write tcp 192.168.31.194:59638->192.168.2.79:1883: write: connection reset by peer
[DEBUG] 10:21:12.387360 [client] internalConnLost called
[DEBUG] 10:21:12.387421 [client] stopCommsWorkers called

Here the network broken reported and thus internalConnLost and stopCommsWorkers func are called. Note this is the 1st time stopCommsWorkers is called.

*** [DEBUG] 10:21:12.404068 [client] stopCommsWorkers called
[DEBUG] 10:21:12.404137 [client] stopCommsWorkers done (not running)

Here stopCommsWorkers is called again. The 2nd call it will return immediately indicated by 'not running' in the log.

***[DEBUG] 10:21:12.404228 [net] logic waiting for msg on ibound
[DEBUG] 10:21:12.404310 [net] startIncomingComms: got msg on ibound
[DEBUG] 10:21:12.404425 [store] memorystore del: message 1618 was deleted
[DEBUG] 10:21:12.404505 [net] startIncomingComms: received puback, id: 1618
[DEBUG] 10:21:12.404593 [net] logic waiting for msg on ibound
[DEBUG] 10:21:12.404668 [net] startIncomingComms: ibound complete
[DEBUG] 10:21:12.404739 [net] startIncomingComms goroutine complete
[DEBUG] 10:21:12.404669 [net] incoming complete

Here the last incoming message is processed.

***[DEBUG] 10:21:12.405468 [client] enter Publish
[DEBUG] 10:21:12.405605 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:21:22.865802 [net] putting puback msg on obound
[DEBUG] 10:21:22.866153 [store] memorystore del: message 117 was deleted

Here puback is sent, however, the outgoing channel is occupied by last failed puback message, thus the new puback is blocked.

Because this blockage, mqtt is unable to restart the connect even the network has already been recovered.

Thanks a lot for your time to take a look at this issue. If needed we can have a phone call to explain the issue.

Best regards,
William

@williamshaocharles
Copy link
Author

mqtt_client(1).log

Here is the complete log.

@MattBrittan
Copy link
Contributor

Thanks for the issue report (it's great to get full logs for issues like this and your analysis pointed me to the right spot in the log).

Unfortunately I am struggling to duplicate your error. I have tried bombarding the client with messages whilst processing disconnects (after upgrading Mosquitto to resolve some issues with that!) but could not get it to fail. I suspect that this is because the issue will only occur in a very particular set of circumstances (events need to happen in a set order with fairly precise timings). I guess that this is backed up by how infrequently you are seeing this.

However having spent quite a bit of time going over the code I think the issue comes down to the fact that matchAndDispatch writes to client.oboundP rather than a channel under the control of the comms routines (it looks like I missed this in the refactor last year). This creates a potential deadlock when an incoming publish is processed (and generates an ACK) whilst a shutdown is in progress.

I have raised and merged PR #470 - please give this a go and let me know if it resolves the issue. Unfortunately due to my failure to create a test that duplicates this I cannot confirm if my fix will work.

@MattBrittan
Copy link
Contributor

Note that PR #472 resolves an issue introduced in PR #470 (so if you have pulled that code please update again).

@williamshaocharles
Copy link
Author

Hi Matt,

Thanks so much for such a quick response, I really appreciate it. We'll pull your changes and verify whether the issue disappeared in our environment and get back to you.

Thanks again!

William

@williamshaocharles
Copy link
Author

Hi Matt,

The issue hasn't happened since we updated to the version with your fix. we'll continue watching it. In the meanwhile, I'll close this issue now. If it happens again, I'll reopen it.

Thanks again!

William

@williamshaocharles
Copy link
Author

As in my previous comment, I'll close this issue right now.

@MattBrittan
Copy link
Contributor

Thanks - I have issued a new (minor) release so that other users will get this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants