Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not skip flaky test in CI #85

Merged
merged 1 commit into from May 20, 2022
Merged

Do not skip flaky test in CI #85

merged 1 commit into from May 20, 2022

Conversation

lukebakken
Copy link
Contributor

Part of #77

@lukebakken lukebakken added this to the 1.4.0 milestone May 18, 2022
@lukebakken lukebakken self-assigned this May 18, 2022
@lukebakken lukebakken mentioned this pull request May 19, 2022
3 tasks
@lukebakken
Copy link
Contributor Author

cc @fho

Well, this sure is interesting. With only 1 cpu in use the TestRepeatedChannelExceptionWithPublishAndMaxProcsIssue46 test never fails, even without these changes. As soon as you add CPUs the likelihood of failures goes up. My guess is that this was never a problem until Go 1.15 because of how GOMAXPROCS used to work. On the GitHub runners there may only be 1 virtual CPU available so we wouldn't see this behavior without specifying -cpu 1,2.

Anyway, on my 16-core laptop, I tested this out with the following command:

go test -cpu 1,2,4,8,16 -failfast -race -v -tags integration -run TestRepeatedChannelExceptionWithPublishAndMaxProcsIssue46

This is a typical sequence of errors in the RabbitMQ logs:

2022-05-19 20:17:24.475000-07:00 [error] <0.29923.2> Channel error on connection <0.29739.2> (127.0.0.1:62619 -> 127.0.0.1:5672, vhost: '/', user: 'guest'), channel 22:
2022-05-19 20:17:24.475000-07:00 [error] <0.29923.2> operation basic.publish caused a channel exception not_found: no exchange 'not-existing-exchange' in vhost '/'
2022-05-19 20:17:24.476000-07:00 [error] <0.29739.2> Error on AMQP connection <0.29739.2> (127.0.0.1:62619 -> 127.0.0.1:5672, vhost: '/', user: 'guest', state: running), channel 22:
2022-05-19 20:17:24.476000-07:00 [error] <0.29739.2>  operation basic.publish caused a connection exception channel_error: "expected 'channel.open'"
2022-05-19 20:17:24.477000-07:00 [info] <0.29739.2> closing AMQP connection <0.29739.2> (127.0.0.1:62619 -> 127.0.0.1:5672, vhost: '/', user: 'guest')
2022-05-19 20:17:24.477000-07:00 [debug] <0.29942.2> Closing all channels from connection '127.0.0.1:62619 -> 127.0.0.1:5672' because it has been closed

The changes in this PR take into account the fact that you shouldn't use a channel after any channel exception (which was not the case prior to these changes). Notice the connection exception due to the fact that the Go client snuck in a basic.publish frame when the channel was closed.

So my guess at this time is that the increased concurrency available via GOMAXPROCS > 1 makes it likely that "bad" data will make it to RabbitMQ even after a channel-level exception has occurred. Since the TestRepeatedChannelExceptionWithPublishAndMaxProcsIssue46 test behaved abusively it made failures much more likely.

@lukebakken lukebakken marked this pull request as ready for review May 20, 2022 03:51
Part of #77

Break on channel error, logging

Add enough checks so that test passes, comment out some logging.
@lukebakken
Copy link
Contributor Author

Cool, a 👍 from @fho is as good as any review. Merging once the tests pass. Have a great weekend.

@lukebakken lukebakken merged commit 0290987 into rabbitmq:main May 20, 2022
@lukebakken lukebakken deleted the lukebakken/gh-77-flaky-tests branch May 20, 2022 15:07
@fho
Copy link
Contributor

fho commented May 20, 2022

@lukebakken 😅 I actually only read your comment so far.
I'll have a deeper look next week and let you know if I find something post-mortem.
have a nice weekend! :-)

@lukebakken
Copy link
Contributor Author

Sounds good, thanks.

Copy link
Contributor

@fho fho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR changes the TestRepeatedChannelExceptionWithPublishAndMaxProcsIssue46 testcase in a way that it does not fail on the same condition then before.

The actual bug, that a connection becomes unusable after a message is published on a channel for a non-existing exchange still happens from now and then.
I can reproduce the issue with the previous version of the TestRepeatedChannelExceptionWithPublishAndMaxProcsIssue46 testcase via go test -count=100 -cpu 24 -race -v -tags integration -run=TestRepeatedChannelExceptionWithPublishAndMaxProcsIssue46 ./... on top of ffeb46d

integration_test.go Show resolved Hide resolved
integration_test.go Show resolved Hide resolved
integration_test.go Show resolved Hide resolved
channel.go Show resolved Hide resolved
@lukebakken
Copy link
Contributor Author

lukebakken commented May 23, 2022

Thanks for taking the time to look at this @fho

The actual bug, that a connection becomes unusable after a message is published on a channel for a non-existing exchange

Publishing to an exchange that doesn't exist causes a channel-level exception from RabbitMQ. After such an exception, you can't continue to use that channel. The test used to continue publishing to the same channel which violates the protocol.

In addition, the test might also sneak a basic.publish to RabbitMQ after RabbitMQ closes the channel (due to the channel exception), in which case you cause a connection-level exception, which is why the test must also check for this.

So, the original test was flawed but it worked because the tests always used to run with GOMAXPROCS=1. I'm not exactly sure why but my guess is that the channel closed message from RabbitMQ was processed by this library prior to trying to publish anything else 🤷

Really this test exposes why a RabbitMQ publisher must use publisher confirms, or at a bare minimum, the mandatory flag, and to also not use a channel after a channel-level exception.

We can continue discussion in a new issue if you'd like. I'll wait to publish version 1.4.0 of this library in case I'm way off-base here!

@fho
Copy link
Contributor

fho commented May 24, 2022

@lukebakken thanks for the extended explanation.

In addition, the test might also sneak a basic.publish to RabbitMQ after RabbitMQ closes the channel (due to the channel exception), in which case you cause a connection-level exception, which is why the test must also check for this.

My understanding of the testcase is that it should ensure that the connection stays usable and that it is always possible to open a new channel. The error that is caused by publishing a message to an non-existing exchange should only render the channel unusable.
Is that correct?

After the testcase changes, these conditions are not ensured anymore:

  • The connection is now allowed to break/close,
  • opening a new channel is now allowed to fail with 504 instead of requiring to always succeed,

Therefore I think the original issue can not be caught anymore after those changes.

@lukebakken
Copy link
Contributor Author

@fho I just woke up in the middle of the night and thought again about the test and realized there is indeed a bug somewhere in the library. Then of course you've already commented to say as much.

I'll open a new issue in the morning. Thanks!

lukebakken added a commit that referenced this pull request May 24, 2022
Follow-up to #85

When this test fails, RabbitMQ logs the following connection exception:

```
2022-05-24 11:00:12.747989+00:00 [error] <0.19502.2> Channel error on connection <0.19347.2> (172.17.0.1:46318 -> 172.17.0.2:5672, vhost: '/', user: 'guest'), channel 20:
2022-05-24 11:00:12.747989+00:00 [error] <0.19502.2> operation basic.publish caused a channel exception not_found: no exchange 'not-existing-exchange' in vhost '/'
2022-05-24 11:00:12.748614+00:00 [error] <0.19347.2> Error on AMQP connection <0.19347.2> (172.17.0.1:46318 -> 172.17.0.2:5672, vhost: '/', user: 'guest', state: running), channel 20:
2022-05-24 11:00:12.748614+00:00 [error] <0.19347.2>  operation basic.publish caused a connection exception channel_error: "expected 'channel.open'"
```
lukebakken added a commit that referenced this pull request May 24, 2022
Follow-up to #85

When this test fails, RabbitMQ logs the following connection exception:

```
2022-05-24 11:00:12.747989+00:00 [error] <0.19502.2> Channel error on connection <0.19347.2> (172.17.0.1:46318 -> 172.17.0.2:5672, vhost: '/', user: 'guest'), channel 20:
2022-05-24 11:00:12.747989+00:00 [error] <0.19502.2> operation basic.publish caused a channel exception not_found: no exchange 'not-existing-exchange' in vhost '/'
2022-05-24 11:00:12.748614+00:00 [error] <0.19347.2> Error on AMQP connection <0.19347.2> (172.17.0.1:46318 -> 172.17.0.2:5672, vhost: '/', user: 'guest', state: running), channel 20:
2022-05-24 11:00:12.748614+00:00 [error] <0.19347.2>  operation basic.publish caused a connection exception channel_error: "expected 'channel.open'"
```
@lukebakken
Copy link
Contributor Author

#87

lukebakken added a commit that referenced this pull request May 25, 2022
Follow-up to #85

When this test fails, RabbitMQ logs the following connection exception:

```
2022-05-24 11:00:12.747989+00:00 [error] <0.19502.2> Channel error on connection <0.19347.2> (172.17.0.1:46318 -> 172.17.0.2:5672, vhost: '/', user: 'guest'), channel 20:
2022-05-24 11:00:12.747989+00:00 [error] <0.19502.2> operation basic.publish caused a channel exception not_found: no exchange 'not-existing-exchange' in vhost '/'
2022-05-24 11:00:12.748614+00:00 [error] <0.19347.2> Error on AMQP connection <0.19347.2> (172.17.0.1:46318 -> 172.17.0.2:5672, vhost: '/', user: 'guest', state: running), channel 20:
2022-05-24 11:00:12.748614+00:00 [error] <0.19347.2>  operation basic.publish caused a connection exception channel_error: "expected 'channel.open'"
```
lukebakken added a commit that referenced this pull request May 26, 2022
* Revert test to demonstrate actual bug

Follow-up to #85

When this test fails, RabbitMQ logs the following connection exception:

```
2022-05-24 11:00:12.747989+00:00 [error] <0.19502.2> Channel error on connection <0.19347.2> (172.17.0.1:46318 -> 172.17.0.2:5672, vhost: '/', user: 'guest'), channel 20:
2022-05-24 11:00:12.747989+00:00 [error] <0.19502.2> operation basic.publish caused a channel exception not_found: no exchange 'not-existing-exchange' in vhost '/'
2022-05-24 11:00:12.748614+00:00 [error] <0.19347.2> Error on AMQP connection <0.19347.2> (172.17.0.1:46318 -> 172.17.0.2:5672, vhost: '/', user: 'guest', state: running), channel 20:
2022-05-24 11:00:12.748614+00:00 [error] <0.19347.2>  operation basic.publish caused a connection exception channel_error: "expected 'channel.open'"
```

* Extend number of iterations

* Looks like this has a good chance of fixing the issue related to #85

* Indicate that a channel is closed immediately after decoding a channelClose frame

* Close the channel prior to the Once call in the same manner as the Connection

* No need to set closed again, wording

* Convert to the correct error type, thanks to @Gsantomaggio

* Conversion fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants