Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection: fix: reader go-routine is leaked on connection close #70

Merged
merged 1 commit into from Apr 19, 2022

Conversation

fho
Copy link
Contributor

@fho fho commented Apr 13, 2022

When a message was sent and it's response was received while the
connection was closed or an error happened, the reader go-routine could
get stuck and be leaked.

The reader go routine tries to send a received message to the unbuffered
c.rpc channel via the dispatch0() and dispatchN() methods.
The call() method reads from the rpc channel.
If an error happened while the dispatch method sends a message to the
rpc channel, the call() method could terminate because it read an error
from c.errors or because c.errors was closed.

To prevent the scenario:

  • the reader go-routine now closes c.rpc when it terminates,
  • The call() method, reads from c.rpc until a message was received or it
    is closed. When c.rpc is closed, it reads an error from c.errors or
    wait until c.errors is closed.
    When it reads an error, it returns it. If it is closed it returns
    ErrClosed.

This ensures that the messages is read from c.rpc before call() returns.
It also ensures that when a message was received that it is processed.
Previously it could happen that the message was silently ignored because
c.errors returned an error or was closed.

This fixes #69.
A testcase can be found in PR #71

@lukebakken
Copy link
Contributor

@fho no rush at all, but when you get time could you check out the integration test failures? I'm checking it out too when I have time.

@fho
Copy link
Contributor Author

fho commented Apr 14, 2022

@lukebakken please create seperate commits for the changes that you are doing, instead of editing mine, also please don't force-push my branch.
It would be best if we could reset it to the original state, merging the other branch containing the testcase into this one (if you prefer) and adding your changes in new commits.

I'll have a look regarding the ci failures next week, one seems to be because of missing go.sum changes.

@lukebakken
Copy link
Contributor

I've reverted the squash, but please note that the changes will probably be squashed when merged into main.

I added a step to GitHub actions to run go mod download go.uber.org/goleak. I'm not sure if that's the best solution to the missing dependency.

@fho
Copy link
Contributor Author

fho commented Apr 19, 2022

I've reverted the squash, but please note that the changes will probably be squashed when merged into main.

thanks!

I added a step to GitHub actions to run go mod download go.uber.org/goleak. I'm not sure if that's the best solution to the missing dependency.

Running go mod download explicitly should not be needed.
If there is not special reason why you would like to do it explicitly, I would recommend to let Go handle it.

could you check out the integration test failures?

The tests failed because the go routine leak detector found a heartbeat go-routine still running when the new TestReaderGoRoutineTerminatesWhenMsgIsProcessedDuringClose finished.

Those go-routine leaks already exists in the main branch and happen because some testcases did not close the connection they opened.
I created a separate PR (#73) to introduce the goleak detector and fix the issue in the existing testcases.
This makes it easier to distinguish between existing issues and new ones caused by changes in this PR.

@lukebakken
Copy link
Contributor

If there is not special reason why you would like to do it explicitly, I would recommend to let Go handle it

I had to add it explicitly because the CI runs failed without it 🤷

@lukebakken
Copy link
Contributor

lukebakken commented Apr 19, 2022

Thanks for your contributions. I merged #73 into main and then merged main into this branch. I do still see found unexpected goroutines on successful test runs, and I'm not quite sure the best way to track down the cause as there is no indication of which test they are associated with.

@fho
Copy link
Contributor Author

fho commented Apr 19, 2022

Thanks for your contributions. I merged #73 into main and then merged main into this branch. I do still see found unexpected goroutines on successful test runs, and I'm not quite sure the best way to track down the cause as there is no indication of which test they are associated with.

I'll have a look.

I'm doing it like described here:

go test -tags=integration -c -o tests
for test in $(go test -tags integration -list . | grep -E "^(Test|Example)"); do ./tests -test.v -test.run "^$test\$" && echo -n "." || echo -e "\n$test failed"; done

@fho
Copy link
Contributor Author

fho commented Apr 19, 2022

hrm, I could not reproduce the test failure locally so far.
I ran go test -v -count=1 -tags integration -cpu 1,2 multiple times and it succeeds, without any unexpected running go routines.

Another test fails on my machine sometimes that does not fail on CI though :-)

    integration_test.go:1432: expected error only on publish, got error on channel.open: Exception (504) Reason: "channel/connection is not open"
--- FAIL: TestRepeatedChannelExceptionWithPublishAndMaxProcsIssue46 (0.06s)

@fho
Copy link
Contributor Author

fho commented Apr 19, 2022

@lukebakken 💦 I found the leaked go-routine, it was in ExampleConnection_reconnect.

The tests now succeeded.

@lukebakken
Copy link
Contributor

I found the leaked go-routine, it was in ExampleConnection_reconnect.

Well done! You found it just by reviewing the code?

When a message was sent and it's response was received while the
connection was closed or an error happened, the reader go-routine could
get stuck and be leaked.

The reader go routine tries to send a received message to the unbuffered
c.rpc channel via the dispatch0() and dispatchN() methods.
The call() method reads from the rpc channel.
If an error happened while the dispatch method sends a message to the
rpc channel, the call() method could terminate because it read an error
from c.errors or because c.errors was closed.

To prevent the scenario:
- the reader go-routine now closes c.rpc when it terminates,
- The call() method, reads from c.rpc until a message was received or it
  is closed. When c.rpc is closed, it reads an error from c.errors or
  wait until c.errors is closed.
  When it reads an error, it returns it. If it is closed it returns
  ErrClosed.

This ensures that the messages is read from c.rpc before call() returns.
It also ensures that when a message was received that it is processed.
Previously it could happen that the message was silently ignored because
c.errors returned an error or was closed.

tests: add testcase to ensure reader routine terminates

Add a testcase for the bug that the reader go-routine tries to send a
message to the buffered rpc channel but call() terminated because it
read an error from the errors chan or the errors chan was closed.
It cause that reader routine gets stuck forever and does not terminate
when the connection is closed.
More information: rabbitmq#69.

This testcase does not reproduce the issue reliably, but it is triggered
in ~80% of executions.

Bump GH actions versions

add missing go.sum file

tests/TestRequiredServerLocale: close connection on testcase termination

The TestRequiredServerLocale testcase was not closing the connection
that it opened.
This caused the goleak detector in the
TestReaderGoRoutineTerminatesWhenMsgIsProcessedDuringClose testcase to
complain about a leaked heartbeat go-routine.

tests: remove duplicate goleak invocation

goleak is now called in TestMain().
The invocation in the
TestReaderGoRoutineTerminatesWhenMsgIsProcessedDuringClose testcase can
be removed.

tests/ExampleConnection_reconnect: close connection on termination

ExampleConnection_reconnect was not closing the opened connection on
termination. This caused the goleak checker to complain about a leaked
heartbeat go routine.

Close the connection.
@lukebakken lukebakken merged commit 5c7877f into rabbitmq:main Apr 19, 2022
@lukebakken
Copy link
Contributor

Thank you!

@lukebakken lukebakken self-assigned this Apr 19, 2022
@fho
Copy link
Contributor Author

fho commented Apr 19, 2022

@lukebakken I somewhen tried running the tests via make tests and then I got the test failure.
ExampleConnection_reconnect only does something when the AMQP_URI is defined, which make tests sets explicitly.

Before I was running it via the commands from #70 (comment) or simply via go tests -tags integration.
I did not had the AMQP_URI environment variable defined, the tests must fallback to some default URI that worked with my local docker container in that case. So the example testcase did not triggered the error before.

When I understood that I defined the env variable, run the tests as described in #70 (comment) and then I found the culprit. :-)

@fho
Copy link
Contributor Author

fho commented Apr 19, 2022

@lukebakken thank you too :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reader go routine hangs and leaks when Connection.Close() is called multiple times
2 participants