Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client reconnects frequently due to not receiving ping resp (stuck in putting msg on onPubChan) #367

Closed
narasimma23 opened this issue Oct 17, 2019 · 5 comments

Comments

@narasimma23
Copy link

2019/10/17 08:04:59 connected successfully...!

DEBUG: 08:04:59 [client] enter Subscribe
DEBUG: 08:04:59 [client] SUBSCRIBE: dup: false qos: 1 retain: false rLength: 0 MessageID: 0 topics: [tenant/+/dealer/+/hw/+/up]
DEBUG: 08:04:59 [client] exit Subscribe
DEBUG: 08:04:59 [net] obound priority msg to write, type *packets.SubscribePacket
DEBUG: 08:04:59 [net] outgoing waiting for an outbound message
DEBUG: 08:04:59 [net] Received Message
DEBUG: 08:04:59 [net] logic got msg on ibound
WARN:08:04:59 [store] memorystore del: message 1 not found
DEBUG: 08:04:59 [net] received suback, id: 1
DEBUG: 08:04:59 [net] granted qoss [0]
DEBUG: 08:04:59 [net] logic waiting for msg on ibound
2019/10/17 08:04:59 [Subscribe successful!]
DEBUG: 08:05:04 [pinger] ping check 4.999670036
DEBUG: 08:05:09 [pinger] ping check 9.999687229
DEBUG: 08:05:14 [pinger] ping check 14.999720018
DEBUG: 08:05:19 [pinger] ping check 19.999703669
DEBUG: 08:05:24 [pinger] ping check 24.999680719
DEBUG: 08:05:29 [pinger] ping check 29.999712045
DEBUG: 08:05:34 [pinger] ping check 34.999756461
DEBUG: 08:05:34 [pinger] keepalive sending ping
DEBUG: 08:05:34 [net] Received Message
DEBUG: 08:05:34 [net] logic got msg on ibound
DEBUG: 08:05:34 [net] received pingresp
DEBUG: 08:05:34 [net] logic waiting for msg on ibound
DEBUG: 08:05:35 [net] Received Message
DEBUG: 08:05:35 [net] logic got msg on ibound
DEBUG: 08:05:35 [net] received publish, msgId: 0
DEBUG: 08:05:35 [net] putting msg on onPubChan
DEBUG: 08:05:39 [pinger] ping check 4.99970203
DEBUG: 08:05:44 [pinger] ping check 9.999761822
DEBUG: 08:05:49 [pinger] ping check 14.999733028
DEBUG: 08:05:54 [pinger] ping check 19.999695588
DEBUG: 08:05:55 [net] Received Message
DEBUG: 08:05:59 [pinger] ping check 24.999749212
DEBUG: 08:06:04 [pinger] ping check 29.999763178
DEBUG: 08:06:09 [pinger] ping check 34.999693891
DEBUG: 08:06:09 [pinger] keepalive sending ping
DEBUG: 08:06:14 [pinger] ping check 4.999982142
DEBUG: 08:06:19 [pinger] ping check 9.999991527
CRIT:08:06:19 [pinger] pingresp not received, disconnecting
ERROR: 08:06:19 [net] error triggered, stopping
WARN:08:06:19 [net] incoming dropped a received message during shutdown
DEBUG: 08:06:19 [net] done putting msg on incomingPubChan
DEBUG: 08:06:19 [net] logic waiting for msg on ibound
DEBUG: 08:06:19 [net] outgoing stopped
WARN:08:06:19 [net] logic stopped
DEBUG: 08:06:19 [net] incoming stopped
DEBUG: 08:06:19 Connection lost: pingresp not received, disconnecting
DEBUG: 08:06:19 [client] enter reconnect
DEBUG: 08:06:19 [client] about to write new connect msg
DEBUG: 08:06:19 [client] socket connected to broker
DEBUG: 08:06:19 [client] Using MQTT 3.1.1 protocol
DEBUG: 08:06:19 [net] connect started
DEBUG: 08:06:20 [net] received connack
DEBUG: 08:06:20 [client] client is reconnected
DEBUG: 08:06:20 [net] incoming started
2019/10/17 08:06:20 connected successfully...!

DEBUG: 08:06:20 [pinger] keepalive starting
DEBUG: 08:06:20 [net] logic started
DEBUG: 08:06:20 [net] logic waiting for msg on ibound
DEBUG: 08:06:20 [client] enter Subscribe
DEBUG: 08:06:20 [client] SUBSCRIBE: dup: false qos: 1 retain: false rLength: 0 MessageID: 0 topics: [tenant/+/dealer/+/hw/+/up]
DEBUG: 08:06:20 [net] outgoing started
DEBUG: 08:06:20 [net] outgoing waiting for an outbound message
DEBUG: 08:06:20 [net] obound priority msg to write, type *packets.SubscribePacket
DEBUG: 08:06:20 [client] exit Subscribe
DEBUG: 08:06:20 [net] outgoing waiting for an outbound message
DEBUG: 08:06:20 [net] Received Message
DEBUG: 08:06:20 [net] logic got msg on ibound
WARN:08:06:20 [store] memorystore del: message 1 not found
DEBUG: 08:06:20 [net] received suback, id: 1
DEBUG: 08:06:20 [net] granted qoss [0]
DEBUG: 08:06:20 [net] logic waiting for msg on ibound
2019/10/17 08:06:20 [Subscribe successful!]
DEBUG: 08:06:25 [pinger] ping check 4.999849245
DEBUG: 08:06:30 [pinger] ping check 9.999894398
DEBUG: 08:06:35 [pinger] ping check 14.999852209
DEBUG: 08:06:35 [net] Received Message
DEBUG: 08:06:35 [net] logic got msg on ibound
DEBUG: 08:06:35 [net] received publish, msgId: 0
DEBUG: 08:06:35 [net] putting msg on onPubChan
DEBUG: 08:06:39 [net] Received Message
DEBUG: 08:06:40 [pinger] ping check 19.999948479
DEBUG: 08:06:45 [pinger] ping check 24.999919526
DEBUG: 08:06:50 [pinger] ping check 29.999882864
DEBUG: 08:06:50 [pinger] keepalive sending ping
DEBUG: 08:06:55 [pinger] ping check 4.999650423
DEBUG: 08:07:00 [pinger] ping check 9.999707226
CRIT:08:07:00 [pinger] pingresp not received, disconnecting
ERROR: 08:07:00 [net] error triggered, stopping
WARN:08:07:00 [net] incoming dropped a received message during shutdown
DEBUG: 08:07:00 [net] done putting msg on incomingPubChan
DEBUG: 08:07:00 [net] logic waiting for msg on ibound
WARN:08:07:00 [net] logic stopped
DEBUG: 08:07:00 [net] incoming stopped
DEBUG: 08:07:00 [net] outgoing stopped
DEBUG: 08:07:00 Connection lost: pingresp not received, disconnecting
DEBUG: 08:07:00 [client] enter reconnect
DEBUG: 08:07:00 [client] about to write new connect msg
DEBUG: 08:07:00 [client] socket connected to broker
DEBUG: 08:07:00 [client] Using MQTT 3.1.1 protocol
DEBUG: 08:07:00 [net] connect started
DEBUG: 08:07:00 [net] received connack
DEBUG: 08:07:00 [client] client is reconnected
DEBUG: 08:07:00 [pinger] keepalive starting
2019/10/17 08:07:00 connected successfully...!

DEBUG: 08:07:00 [net] logic started
DEBUG: 08:07:00 [net] logic waiting for msg on ibound
DEBUG: 08:07:00 [client] enter Subscribe
DEBUG: 08:07:00 [net] outgoing started
DEBUG: 08:07:00 [net] outgoing waiting for an outbound message
DEBUG: 08:07:00 [client] SUBSCRIBE: dup: false qos: 1 retain: false rLength: 0 MessageID: 0 topics: [tenant/+/dealer/+/hw/+/up]
DEBUG: 08:07:00 [client] exit Subscribe
DEBUG: 08:07:00 [net] obound priority msg to write, type *packets.SubscribePacket
DEBUG: 08:07:00 [net] incoming started
DEBUG: 08:07:00 [net] outgoing waiting for an outbound message
DEBUG: 08:07:00 [net] Received Message
DEBUG: 08:07:00 [net] logic got msg on ibound
WARN:08:07:00 [store] memorystore del: message 1 not found
DEBUG: 08:07:00 [net] received suback, id: 1
DEBUG: 08:07:00 [net] granted qoss [0]
DEBUG: 08:07:00 [net] logic waiting for msg on ibound
2019/10/17 08:07:00 [Subscribe successful!]
DEBUG: 08:07:05 [pinger] ping check 4.999579558
DEBUG: 08:07:10 [pinger] ping check 9.999610958
^C

@MattBrittan
Copy link
Contributor

A bit more info would be needed assist with this (ideally code that demonstrates the issue). It's difficult to troubleshoot from the logs alone because the issue may be in your code.

(stuck in putting msg on onPubChan)
Based upon this I suspect that the issue is in your code; within the call back you are registering when you call Subscribe.

Please check whether your call back is being called when the message is received and confirm that it is exiting. I would expect to see something like the log you posted in cases where the call back is not returning (e.g. it's getting blocked pushing the new message onto a channel).
If you are interested in the details: matchAndDispatch, the routine that empties incomingPubChan, waits for the callback to complete before continuing; this means that if the callback does not return then the net routine gets blocked. The end result is that ping responses do not get processed so the connection times out.

@narasimma23
Copy link
Author

narasimma23 commented Oct 17, 2019

Hi Matt,

Here is the test code, which is running fine for a day or sometime in docker, but after that it's starting reconnecting frequently, shared logs already...
i'm using own Mosquitto broker hosted in ec2 machine...

during this issue the msgHandler is not called, might be the matchAndDispatch routine is not clearing the channel..

this issue occurs during long run...

test.txt

@MattBrittan
Copy link
Contributor

Thanks @narasimma23,

That code looks fine to me; just want to confirm that the code is exactly what you were running when you got the errors or have you removed some bits? (I note that your log includes "Subscribe successful!" but I cannot see that in the code or library; but I may be looking at a later version).

What version of paho.mqtt.golang are you using? (if you are using go modules then it will probably be v1.2 but running go get github.com/eclipse/paho.mqtt.golang@master will get you the latest fixes; its possible that one of the fixes implemented since 1.2 will address this problem).

One other question - what kind of message volume are you processing? (i.e. multiple messages per second, one a minute?).

If the code is exactly what you are running, and you are using the latest version, then it looks like there is an issue with the library. This may take a bit of time to track down (I run very similar code that runs for weeks without issue so it will be a case of finding whats different about your situation). Can I suggest that you add a few extra debugging messages into matchAndDispatch (router.go) because I'm guessing that this is stopping so it seems a good place to start (using something like DEBUG.Println("[test] ", "messageZZZZ")). I'd suggest:

"matchAndDispatch starting" - before the for (line 138)
"matchAndDispatch got msg" - before the sent := false (line 141)
"matchAndDispatch Calling handler" - before the handler(client, m) (line 172)
"matchAndDispatch Handler called" - after the handler(client, m) (line 172)
"matchAndDispatch Msg Processed" - before '''case <-r.stop:''' (line 176)
"matchAndDispatch complete" - before the return (line 177)

Hopefully this will help track the issue down (unfortunately the message flow is quite complex so this may take a few attempts).

Matt

@MattBrittan
Copy link
Contributor

MattBrittan commented Oct 17, 2019

Having said that I had not seen this issue previously one of my systems stopped sending data an hour ago with the last thing logged being:

[pinger] pingresp not received, disconnecting
[net] error triggered, stopping

Traffic did not resume after that. Its possible this is the same issue (that system had been stable for a month or so but I deployed an update yesterday); I'll add some extra logging and see if this happens again (did not have debug logging enabled).

*Update: It's happened again. I suspect that I had not updated this to the latest revision until a couple of days ago. Have added extra logging and will see if it happens again (do have some idea as to what could be happening).
*Update2: So the issue was in my code but only exposed by a change to the library. In a connection handler I was publishing a message and waiting for it to complete sending before returning. This used to work but #352 changed a few channels from buffered to unbuffered (when autoreconnect was true the default used to be a 100 element buffer). This led to a deadlock (easily resolved by moving some of my code into a goroutine).

@MattBrittan
Copy link
Contributor

Closing due to age/inactivity. Please feel free to reopen if this issue is occurring with the current release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants