Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

struck on Connection timeout error #54

Closed
golthitarun opened this issue May 27, 2021 · 10 comments · Fixed by #56
Closed

struck on Connection timeout error #54

golthitarun opened this issue May 27, 2021 · 10 comments · Fixed by #56

Comments

@golthitarun
Copy link

golthitarun commented May 27, 2021

Describe the bug
When connection timeout occurs during client connection the code is blocked in cleanup method.

I was able to make it run by removing the sync mutex lock code added here recently

c.mu.Lock()

To reproduce
Try to reproduce connection timeout in connect method

Debug output
Not seeing any output, during debug I observed the code being struck in gopark method of go. https://golang.org/pkg/runtime/?m=all#gopark

Expected behaviour
An error response

Software used:

Additional context

@golthitarun
Copy link
Author

@fracasula looks like you added that change recently, can you help me look into it.

@golthitarun golthitarun changed the title Connection timeout struck struck on Connection timeout error May 27, 2021
@MattBrittan
Copy link
Contributor

Looks like the issue is due to the use of defer here:

c.mu.Lock()
defer c.mu.Unlock()

and then calling cleanup within the same function (e.g. here). This means that cleanup will be called before the deferred c.mu.Unlock() and thus will block here.

As cleanup is only called when Connect() already has mu locked the lock within it seems unnecessary.

@golthitarun
Copy link
Author

golthitarun commented May 27, 2021

Looks like there is also issue with defering waitgroup wait method in cleanup function, Its basically struck waiting.

I added the defer waitgroup.Wait() to the start of connect method and it was working for me. I have the change in my local commit here -> golthitarun@324831c

I can create a PR if this looks good, lemme know

@fracasula
Copy link
Contributor

The issue seems to be with the mutex like @MattBrittan pointed out. It's acquired by Connect() already so there is no need to do that in the cleanup function too. I'll try to write a quick test to replicate and make sure that the method terminates properly waiting for the workers.

@fracasula
Copy link
Contributor

fracasula commented May 28, 2021

I was able to replicate with a simple test and found 2 issues so far:

  1. we get a panic when the ctx is canceled before Connect is called, this is the affected code: https://github.com/eclipse/paho.golang/blob/master/paho/client.go#L236
  2. even if we remove the first 3 lines from the cleanup function and add a defer c.workers.Wait() in Connect we get a deadlock, the worker that doesn't terminate seems to be the Incoming method

I'll investigate further and I also think we should handle all context errors in Connect like this:

c.debug.Println("waiting for CONNACK")
var cap *packets.Connack
select {
case <-connCtx.Done():
	if e := connCtx.Err(); e != nil {
		switch e {
		case context.DeadlineExceeded:
			c.debug.Println("timeout waiting for CONNACK")
		case context.Canceled:
			c.debug.Println("context canceled")
		default:
			c.debug.Println(fmt.Sprintf("unknown context error: %v", e))
		}
		cleanup()
		return nil, e
	}
case cap = <-c.caCtx.Return:
}

@MattBrittan
Copy link
Contributor

Well spotted on the issue handling the context - It looks like the same issue is in a few other places (e.g. here, here, here etc).

This is definitely a bug - the parent context may be cancelled for another reason and this should result in the process terminating (so if context.Error ever returns a non-nil value then the operation should end regardless of the specific error). As the docs say:

// If Done is not yet closed, Err returns nil.
// If Done is closed, Err returns a non-nil error explaining why:
// Canceled if the context was canceled
// or DeadlineExceeded if the context's deadline passed.
// After Err returns a non-nil error, successive calls to Err return the same error.
Err() error

I think it's enough to just add the context error to the log (e.g. c.debug.Println(fmt.Sprintf("terminate due to context: %v", e))); (having a switch that only impacts debugging code seems excessive).

@fracasula
Copy link
Contributor

Yes I agree @MattBrittan, it does make sense to return if ctx.Err() is not nil 👍

Also, thanks for the other links, I'll have a look at those too.

About the deadlock I was finally able to identify it. The Incoming() loop calls c.Error() when it can't read the packet here. Then c.Error() calls c.close. Unfortunately c.close tries to acquire the lock taken by Connect and it deadlocks.

Here's the stack:

github.com/eclipse/paho.golang/paho.(*Client).close(client.go:456)
github.com/eclipse/paho.golang/paho.(*Client).Error(client.go:479)
github.com/eclipse/paho.golang/paho.(*Client).Incoming(client.go:349)
github.com/eclipse/paho.golang/paho.(*Client).Connect.func3(client.go:208)

@MattBrittan
Copy link
Contributor

I've been working on a somewhat similar issue in the v3 library and had meant to check how the v5 client dealt with this :-) ... Managing mutexs with lots of goRoutines is not much fun.

Your stack trace is from a modified version so the line numbers don't match up with the current master but I can see the issue. I think this one will be a bit of work to resolve because I can't really see a 'nice' way. The simple approach would be to set a connecting flag whilst the operation is in progress (checked in close()) and free the mutex but that might create other issues (and is not particularly elegant!).

This was referenced May 28, 2021
@fracasula
Copy link
Contributor

Please have a look at this PR: #56

@alsm alsm closed this as completed in #56 Jun 3, 2021
@golthitarun
Copy link
Author

Thank you @fracasula

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants