New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MQTT unable to recover when TCP disconnect #468
Comments
Here is the complete log. |
Thanks for the issue report (it's great to get full logs for issues like this and your analysis pointed me to the right spot in the log). Unfortunately I am struggling to duplicate your error. I have tried bombarding the client with messages whilst processing disconnects (after upgrading Mosquitto to resolve some issues with that!) but could not get it to fail. I suspect that this is because the issue will only occur in a very particular set of circumstances (events need to happen in a set order with fairly precise timings). I guess that this is backed up by how infrequently you are seeing this. However having spent quite a bit of time going over the code I think the issue comes down to the fact that I have raised and merged PR #470 - please give this a go and let me know if it resolves the issue. Unfortunately due to my failure to create a test that duplicates this I cannot confirm if my fix will work. |
Hi Matt, Thanks so much for such a quick response, I really appreciate it. We'll pull your changes and verify whether the issue disappeared in our environment and get back to you. Thanks again! William |
Hi Matt, The issue hasn't happened since we updated to the version with your fix. we'll continue watching it. In the meanwhile, I'll close this issue now. If it happens again, I'll reopen it. Thanks again! William |
As in my previous comment, I'll close this issue right now. |
Thanks - I have issued a new (minor) release so that other users will get this fix. |
Hello,
Recently we hit an mqtt go client issue: sometimes mqtt client unable to recover from a network broken anymore. It happens occasionally and may happen about once a day, so very hard to reproduce. We are using the latest master branch. On broker side, we use emqx broker. Here is the log snippet:
[DEBUG] 10:20:59.613822 [pinger] ping check 10.46959611
[DEBUG] 10:21:01.899854 [net] putting puback msg on obound
[DEBUG] 10:21:01.900034 [store] memorystore del: message 116 was deleted
[DEBUG] 10:21:01.900142 [net] done putting puback msg on obound
[DEBUG] 10:21:01.900333 [net] obound priority msg to write, type *packets.PubackPacket
[DEBUG] 10:21:01.900410 [net] logic waiting for msg on ibound
[ERROR] 10:21:01.900538 [net] outgoing oboundp reporting error write tcp 192.168.31.194:59638->192.168.2.79:1883: write: connection reset by peer
[DEBUG] 10:21:01.900612 [net] startIncomingComms: got msg on ibound
[DEBUG] 10:21:01.900979 [net] startIncomingComms: received publish, msgId: 118
[DEBUG] 10:21:01.900636 [net] outgoing waiting for an outbound message
[DEBUG] 10:21:01.900850 [net] startIncoming Received Message
[DEBUG] 10:21:04.613783 [pinger] ping check 15.469548675
[DEBUG] 10:21:09.613852 [pinger] ping check 20.469578741
[DEBUG] 10:21:12.386195 [net] putting puback msg on obound
[DEBUG] 10:21:12.386395 [store] memorystore del: message 118 was deleted
[DEBUG] 10:21:12.386497 [net] done putting puback msg on obound
[DEBUG] 10:21:12.386751 [net] obound priority msg to write, type *packets.PubackPacket
[ERROR] 10:21:12.387096 [net] outgoing oboundp reporting error write tcp 192.168.31.194:59638->192.168.2.79:1883: write: broken pipe
[DEBUG] 10:21:12.387224 [net] outgoing waiting for an outbound message
[ERROR] 10:21:12.387104 [client] Connect comms goroutine - error triggered write tcp 192.168.31.194:59638->192.168.2.79:1883: write: connection reset by peer
[DEBUG] 10:21:12.387360 [client] internalConnLost called
[DEBUG] 10:21:12.387421 [client] stopCommsWorkers called
[DEBUG] 10:21:12.387586 [client] startCommsWorkers output redirector finnished
[DEBUG] 10:21:12.387690 [net] outgoing waiting for an outbound message
[DEBUG] 10:21:12.387760 [net] outgoing waiting for an outbound message
[DEBUG] 10:21:12.387850 [pinger] keepalive stopped
[ERROR] 10:21:12.397670 [client] Connect comms goroutine - error triggered write tcp 192.168.31.194:59638->192.168.2.79:1883: write: broken pipe
[DEBUG] 10:21:12.402813 [client] stopCommsWorkers waiting for workers
[DEBUG] 10:21:12.403000 [client] internalConnLost waiting on workers
[DEBUG] 10:21:12.403918 [client] internalConnLost called
[DEBUG] 10:21:12.404068 [client] stopCommsWorkers called
[DEBUG] 10:21:12.404137 [client] stopCommsWorkers done (not running)
[DEBUG] 10:21:12.404228 [net] logic waiting for msg on ibound
[DEBUG] 10:21:12.404310 [net] startIncomingComms: got msg on ibound
[DEBUG] 10:21:12.404425 [store] memorystore del: message 1618 was deleted
[DEBUG] 10:21:12.404505 [net] startIncomingComms: received puback, id: 1618
[DEBUG] 10:21:12.404593 [net] logic waiting for msg on ibound
[DEBUG] 10:21:12.404668 [net] startIncomingComms: ibound complete
[DEBUG] 10:21:12.404739 [net] startIncomingComms goroutine complete
[DEBUG] 10:21:12.404669 [net] incoming complete
[DEBUG] 10:21:12.405468 [client] enter Publish
[DEBUG] 10:21:12.405605 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:21:22.865802 [net] putting puback msg on obound
[DEBUG] 10:21:22.866153 [store] memorystore del: message 117 was deleted
[DEBUG] 10:21:42.406371 [client] enter Publish
[DEBUG] 10:21:42.406707 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:22:12.407490 [client] enter Publish
[DEBUG] 10:22:12.407775 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:22:42.409085 [client] enter Publish
[DEBUG] 10:22:42.409382 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:23:12.410144 [client] enter Publish
[DEBUG] 10:23:12.410444 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:23:42.411153 [client] enter Publish
[DEBUG] 10:23:42.411454 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:24:12.399703 [client] enter Publish
[DEBUG] 10:24:12.400017 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:24:42.400796 [client] enter Publish
***[ERROR] 10:21:12.387104 [client] Connect comms goroutine - error triggered write tcp 192.168.31.194:59638->192.168.2.79:1883: write: connection reset by peer
[DEBUG] 10:21:12.387360 [client] internalConnLost called
[DEBUG] 10:21:12.387421 [client] stopCommsWorkers called
Here the network broken reported and thus internalConnLost and stopCommsWorkers func are called. Note this is the 1st time stopCommsWorkers is called.
*** [DEBUG] 10:21:12.404068 [client] stopCommsWorkers called
[DEBUG] 10:21:12.404137 [client] stopCommsWorkers done (not running)
Here stopCommsWorkers is called again. The 2nd call it will return immediately indicated by 'not running' in the log.
***[DEBUG] 10:21:12.404228 [net] logic waiting for msg on ibound
[DEBUG] 10:21:12.404310 [net] startIncomingComms: got msg on ibound
[DEBUG] 10:21:12.404425 [store] memorystore del: message 1618 was deleted
[DEBUG] 10:21:12.404505 [net] startIncomingComms: received puback, id: 1618
[DEBUG] 10:21:12.404593 [net] logic waiting for msg on ibound
[DEBUG] 10:21:12.404668 [net] startIncomingComms: ibound complete
[DEBUG] 10:21:12.404739 [net] startIncomingComms goroutine complete
[DEBUG] 10:21:12.404669 [net] incoming complete
Here the last incoming message is processed.
***[DEBUG] 10:21:12.405468 [client] enter Publish
[DEBUG] 10:21:12.405605 [client] sending publish message, topic: mkt/up/ls/2156001/HS005
[DEBUG] 10:21:22.865802 [net] putting puback msg on obound
[DEBUG] 10:21:22.866153 [store] memorystore del: message 117 was deleted
Here puback is sent, however, the outgoing channel is occupied by last failed puback message, thus the new puback is blocked.
Because this blockage, mqtt is unable to restart the connect even the network has already been recovered.
Thanks a lot for your time to take a look at this issue. If needed we can have a phone call to explain the issue.
Best regards,
William
The text was updated successfully, but these errors were encountered: