Implementation of Context based method for Channel #389

AlexisMontagne · 2019-02-22T18:32:18Z

Related to #266

Implement a second set of RPC method on the Channel having a context.Context as first argument

michaelklishin

Thank you so much for working on this!

I like this but would like to suggest a couple of amendments to this PR.

How about we rename *Context functions (e.g. CloseContext) to *WithContext (so, CloseWithContext and such)?
It would be great to add a few basic tests for key channel operations (open, close, queue declaration, etc) that use these new functions

streadway · 2019-02-25T09:07:35Z

This is great. To bikeshed the names a bit, I think the existing names like QosContext etc make sense because they follow the standard library adoption of context aware IO like Dial and DialContext.

With the async nature of client or server channel closure, cancelling the protocol is prone to subtle bugs. Please add some test coverage for the context cancellation cases.

AlexisMontagne · 2019-02-25T14:45:42Z

@michaelklishin @streadway thanks for the feedback. I wanted to make sure the implementation would be ok with you.

I will add some tests and documentation for the new public methods.

To avoid out-of-sequence received messages, i cancel the current channel if a rpc call is cancelled. Do you think it is the right thing to do?

michaelklishin · 2019-02-25T16:37:54Z

OK, if this follows an existing naming convention then we should use those names.

@AlexisMontagne I'm not sure I understand the last question of yours, sorry.

The naming follows an existing convention.

AlexisMontagne · 2019-02-25T17:27:02Z

@michaelklishin sorry, i submitted my comment by mistake.

I meant: In order to deal with canceled RPC calls. Should we close the channel? or keep it open? cf. https://github.com/streadway/amqp/pull/389/files#diff-861e77702f49a71d32be4b3c7600818fR185

streadway · 2019-02-25T17:32:27Z

I meant: In order to deal with canceled RPC calls. Should we close the channel? or keep it open?

My first reaction is to keep the channel open. AMQP Channel lifecycle is complicated and shares client and server resources. If the context methods only relate to RPC round trips instead of server resources, it'd would be easier to document.

streadway · 2019-02-25T17:33:32Z

When looking at error values, can you suggest a way of promoting a client side context.Context.Err() into an *amqp.Error?

AlexisMontagne · 2019-02-25T19:16:18Z

@streadway you can find a standardized *amqp.Error here: 95dafc3

I meant: In order to deal with canceled RPC calls. Should we close the channel? or keep it open?

My first reaction is to keep the channel open. AMQP Channel lifecycle is complicated and shares
client and server resources. If the context methods only relate to RPC round trips instead of server resources, it'd would be easier to document.

If we keep the channel open, i'm scared the message could cause issues

Let's say the following case:
T1: Channel X sends frame for RPC call Y
T2: The context for the RPC call Y is canceled
T3: Channel X sends frame for RPC call Z
T4: The dispatcher forward to the rpc channel the response for the RPC call Y

Is there any way to correlate request and response frame (sequence ID?)? In order to discard frame received for canceled calls

michaelklishin · 2019-02-25T22:18:36Z

@AlexisMontagne other clients address this by not allowing channel sharing between threads/routines/etc. There are no sequence identifiers in the protocol, only response order guarantees for synchronous methods.

Note that sharing a channel for concurrent publishing, for example, is inherently unsafe because a published messages is 2 or 3 frames on the wire and you can end up with incorrect frame interleaving.

While closing the channel might be a decent option since there is no way to undo a send frame, the above makes me wonder if

Context support in this case would really achieve the goals it might in other libraries
Concurrent access to a channel should be advertised as supported (it may be in some cases for consumers but as with any mutable state, eliminating sharing avoids a lot of hard to dig out issues)

Keeping a channel open and documenting known limitations would be closer to what virtually every other RabbitMQ client does.

WDYT?

streadway · 2019-02-26T08:32:27Z

Note that sharing a channel for concurrent publishing, for example, is inherently unsafe because a published messages is 2 or 3 frames on the wire and you can end up with incorrect frame interleaving.

This library is concurrent safe for all messages, including publishing, by holding a channel specific mutex until all publish frames are sent before sending others.

streadway · 2019-02-26T08:56:02Z

Context support in this case would really achieve the goals it might in other libraries

The primary use case for the context package is to represent a distributed, cancelable transaction.

It sounds like this is a discussion of "does a cancelled context propagate to the server or not?"

If we were to close a channel on cancellation during any RPC, would we also close the connection should we not be able to open a channel (RPC on channel id 0)?

Let's start with discarding the in-flight reply after cancellation to keep the channel open, and let the application determine how to propagate a cancellation to the server based on what the application considers a transaction.

michaelklishin · 2019-02-26T09:40:33Z

Then it's up to the client to filter out the responses it cannot/no longer needs to handle. Java client and Bunny do this by yes, ignoring the responses they can detect as stale/mismatching the pending continuation (pending operation).

The only scenario where one cannot open a channel is if you run into the channel_max limit (2047 in modern RabbitMQ versions by default) and that is pretty rare and usually indicates a channel leak. Closing a connection in that case would still be pretty disruptive and unexpected :/

AlexisMontagne · 2019-02-26T17:06:45Z

@streadway i updated the pull request according to your last message. And added a test to ensure the messages received after the call is canceled are dropped.

michaelklishin

I got one of the new tests to hang and also hit this failure in one of the runs:

=== RUN   TestTimeoutQosContextCanceled
panic: Fail in goroutine after TestTimeoutQosContextTimeoutDropping has completed

goroutine 397 [running]:
testing.(*common).Fail(0xc0000e4500)
	/usr/local/Cellar/go/1.11.5/libexec/src/testing/testing.go:546 +0x1c6
testing.(*common).FailNow(0xc0000e4500)
	/usr/local/Cellar/go/1.11.5/libexec/src/testing/testing.go:568 +0x39
testing.(*common).Fatalf(0xc0000e4500, 0x149810b, 0x13, 0xc000115da0, 0x1, 0x1)
	/usr/local/Cellar/go/1.11.5/libexec/src/testing/testing.go:634 +0x95
github.com/streadway/amqp.(*server).recv(0xc0001dc870, 0x1, 0x14fd940, 0xc0004d11e8, 0x0, 0x0)
	/path/to/gopath/src/github.com/streadway/amqp/client_test.go:105 +0x1a6
github.com/streadway/amqp.TestTimeoutQosContextTimeoutDropping.func1(0xc0001dc870)
	/path/to/gopath/src/github.com/streadway/amqp/client_test.go:766 +0x170
created by github.com/streadway/amqp.TestTimeoutQosContextTimeoutDropping
	/path/to/gopath/src/github.com/streadway/amqp/client_test.go:758 +0x10a
exit status 2

while running tests in a loop with

for i in {1..10}; do go test -v -cpu 1,3 -tags integration -race; done

streadway

This is tricky to get right because of the message ordering requirement for protocol sequencing.

I don't think select { default: } dropping messages on send/recv to the rpc is the right thing to do because it ties the ordering to the runtime scheduler.

Can you come up with a different approach where the reply from every cancelled request message is explicitly consumed, either in the dispatch() or call()?

streadway · 2019-02-28T08:03:23Z

connection.go

-			c.rpc <- m
+			select {
+			case c.rpc <- m:
+			default:


Why does this now drop if there is no longer a receiver on the rpc buffered channel?

streadway · 2019-02-28T08:04:03Z

channel.go

-		ch.rpc <- msg
+		select {
+		case ch.rpc <- msg:
+		default:


Same as on the connection, this seems like it could drop replies for unintended reasons.

streadway · 2019-02-28T08:37:42Z

channel.go

 	default:
 	}

+	if req.wait() {


The reply from a cancelled request may be in-flight and not reside on the rpc at this point and it will not drain.

Take the scenario all with different contexts:

client send(a) client send(b) client wait(a) - cancelled - returned server reply(a) client wait(b) - recv reply(a)

i don't think this use case is possible. Every RPC call is surounded by a mutex.

Therefore that could end up as:

client send(a) client wait(a) - cancelled - returned server reply(a) - message dropped client send(b) client wait(b) server reply(b) recv reply(b)

Or (the case i exposed here #389 (comment))

client send(a) client wait(a) - cancelled - returned client send(b) client wait(b) server reply(a) - message dropped recv reply(a)

And it causes issues.

But i can not figure out any way to handle this case with the current protocol. IMHO the safest and most deterministic way of dealing with this situation is to close the channel in case of cancelation.

I still think the most obvious behavior is not to close the channel when a context is Done(). I believe context done is closer to a late no-wait rather than a client error.

This is a matter of sequencing the consumption of the replies in the same order as the requests were written. It looks like communication between the writer goroutine (application) and reader goroutine (dispatch) is how the sequencing would be implemented.

AlexisMontagne · 2019-03-04T16:06:38Z

@streadway what do you think of #389 (comment)?

AlexisMontagne · 2019-03-04T19:40:28Z

@streadway i think i got what you mean. Tell me what you think of that last commit.

Are we really sure the server will always reply to the client? Will it never give up on replying to any message?

All my reasoning was based on the fact the server could be flaky and skip replies.

streadway · 2019-03-05T08:25:32Z

AMQP requires TCP which offers transport reliability. The heartbeater will close the connection on read timeout, which will return an error should we experience any message loss.

I've found RabbitMQ very reliable and not drop or show flakey behavior, so we can count on every protocol request receiving a reply.

A counter is one approach, but that assumes that contexts will be cancelled in order which I don't think we would be able to depend on. Consider the message exchange:

req(a)
req(b)
req(c)
req(d)
cancel(a) 0->1
cancel(c) 1->2
reply(a) - dropped 2->1
reply(b) - dropped 1->0
reply(c) - dispatch(b)
reply(d) - dispatch(d)

AlexisMontagne · 2019-03-05T18:13:06Z

@streadway for some reason i was thinking a was mutex surrounding every call() calls 🤔 .

I updated my PR to handle the use case you described (and added a test to ensure it).

AlexisMontagne · 2019-03-11T19:45:10Z

@streadway it should be good now. I wrapped the "sending" part of the call method in a mutex.

It also makes the frame sending thread safe. Previously https://github.com/streadway/amqp/blob/master/channel.go#L219 was not limited by any Channel wide mutex.

@michaelklishin The added tests should now pass with the -race options

michaelklishin · 2019-03-12T00:38:27Z

@AlexisMontagne thank you for fixing the problematic test.

channel.go

michaelklishin

I admire the amount of effort you've put into the docs and perhaps 50 to 65% of the docs can be retained. Unfortunately the docs are not always correct and reinvent RabbitMQ doc guides where it's not really necessary. I'd recommend linking to the relevant RabbitMQ guides more. If something is missing in the guides (e.g. a guide on bindings besides the tutorials) I'd be happy to work on that.
This will benefit all RabbitMQ users regardless of the client.

Thank you!

channel.go

AlexisMontagne · 2019-03-12T13:04:10Z

@michaelklishin i did not write any fo this documentation. I used the one already in the code for the context-free methods

michaelklishin · 2019-03-12T15:47:05Z

@AlexisMontagne OK, thanks. I will revisit that together with the new functions then.

The docs were copied from existing functions

michaelklishin · 2019-03-15T00:26:44Z

@streadway any opinion on this final version?

streadway

This PR represents 4 different changes - support for in-flight cancellation of RPC messages, context propagation to RPC, exported API change, and linter fixes. I'd like to merge each of these changes separately.

Can you please split this up?

The linter fixes can get merged right away.

The inflight cancellation (call() + dispatch() coordination) is the next change that shouldn't affect anything but needs to be released to capture any deadlocks or panics from unknown reasons.

The context propagation is next and this can be done via non-exported methods first:

func (ch *Channel) Qos(...) error {
  return ch.qosContext(context.Background(), ...)
}

After releasing those three changes then we can make sure there is enough time for adoption before exporting the Context based methods along with the copy of the documentation strings.

streadway · 2019-03-06T09:17:08Z

client_test.go

+	}
+
+	select {
+	case <-ch.NotifyClose(make(chan *Error)):


NotifyClose will not have been registered prior to Channel.Qos so there will never be a message to receive.

streadway · 2019-03-06T09:19:19Z

client_test.go

+	defer cancel()
+
+	if err := ch.QosContext(ctx, 2, 0, false); err != ErrCanceled {
+		t.Fatalf("wrong timeout error: %v", err)


When writing test messages, it's important to communicate the why behind a test failure. The messages in Errorf and Fatalf should clearly communicate why the condition is considered "wrong" and what that means for someone to fix it.

Please review all your test Fatalf and Errorf calls to answer "if the test name and this message is all I know about the library, do I know what to do next?"

streadway · 2019-03-15T09:40:40Z

client_test.go

+		srv.channelOpen(1)
+
+		srv.recv(1, &basicQos{})
+		time.Sleep(150 * time.Millisecond)


Please coordinate client/server conditions with channels or wait groups rather than with wall clock. I have witnessed TravisCI have over 500ms pauses during test execution which would cause this test to fail with a false negative.

streadway · 2019-03-15T09:42:30Z

client_test.go

+		t.Fatalf("could not open channel: %v (%s)", ch, err)
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)


You can use WithCancel to not depend on time.

streadway · 2019-03-15T09:45:26Z

client_test.go

+		srv.recv(1, &basicQos{})
+		srv.recv(1, &channelFlow{})
+
+		time.Sleep(10 * time.Millisecond)


Please remove the dependency on time for this and other tests.

streadway · 2019-03-15T09:47:30Z

types.go

@@ -19,6 +19,11 @@ const (
 	ExchangeHeaders = "headers"
 )

+// Constants for custom error code
+const (
+	Canceled = 542


@michaelklishin how do other clients represent client specific error codes? This number is not defined in the AMQP spec, so I wonder what the closest one would be or if we need a different error type for client specific errors.

streadway · 2019-03-15T09:51:43Z

channel.go

+func (ch *Channel) call(ctx context.Context, req message, res ...message) error {
+	select {
+	case <-ctx.Done():
+		return ErrCanceled


From context

// If Done is closed, Err returns a non-nil error explaining why: // Canceled if the context was canceled // or DeadlineExceeded if the context's deadline passed.

This return value should also represent the underlying reason and either be ErrCancelled or ErrDeadlineExceeded

streadway · 2019-03-15T10:00:12Z

channel.go

 	if req.wait() {
 		select {
+		case <-ctx.Done():
+			atomic.StoreUint32(&status, 1)


There is a race conditions between receiving ctx.Done(), storing status = 1 and loading status in dispatch().

Instead, let's collect the in-flight contexts, so we will coordinate by communicating over Context.Done() rather than sharing the status state.

I think this would mean a responses slice of []atomic.Value and storing and loading context.Context types.

Alternatively, the responses slice could contain a buffered chan of messages and that would be used instead of rpc.

streadway · 2019-03-15T10:02:38Z

channel.go

+
+		ch.responsesM.Lock()
+
+		if len(ch.responses) == 0 {


This should never happen. I think we ought to panic to reveal any bugs.

streadway · 2019-03-15T10:05:28Z

channel.go

+		status, ch.responses = ch.responses[0], ch.responses[1:]
+		ch.responsesM.Unlock()
+
+		if atomic.LoadUint32(status) == 0 {


If we were using channels instead of state, this would be:

switch { case <-status.Done(): default: ch.rpc <- msg }

streadway · 2019-03-15T10:18:53Z

I admire the amount of effort you've put into the docs and perhaps 50 to 65% of the docs can be retained. Unfortunately the docs are not always correct and reinvent RabbitMQ doc guides where it's not really necessary. I'd recommend linking to the relevant RabbitMQ guides more. If something is missing in the guides (e.g. a guide on bindings besides the tutorials) I'd be happy to work on that.
This will benefit all RabbitMQ users regardless of the client.

@michaelklishin This would be great! The method docs were written in 2012 and tried to strike the balance between sharing enough context about AMQP principles, RabbitMQ extensions/constraints and Client library choices. I'd very much welcome documentation review and improvements from the RabbitMQ team!

michaelklishin · 2019-03-15T14:17:08Z

@streadway and I discussed this on a Hangout. Besides splitting this PR into a few (@AlexisMontagne thank you for your truly exemplary patience with us!), we'd like to revisit the internal concurrency model used by this client to both simplify Context adoption and make it closer to other clients. It has a chance of improving CPU resource utilisation on some workloads, too ;)

@streadway is to file a new issue about that but basically operation dispatch should be synchronised per channel and not over the entire connection. Channels are assumed to be "independent entities" in the protocol. Publishing on a shared [protocol] channel is not supported so other clients guarantee total ordering on a channel, not across channels. A single channel in the server never dispatches its operations concurrently either. This would make Context support more straightforward.

michaelklishin · 2021-06-09T12:21:52Z

Hey folks,

I'm posting this on behalf of the core team.

As you have noticed, this client hasn't seen a lot of activity recently.
Many users are unhappy about that and we fully recognize that it's a popular
library that should be maintained more actively. There are also many community
members who have contributed pull requests and haven't been merged for various reasons.

Because this client has a long tradition of "no breaking public API changes", certain
reasonable changes will likely never be accepted. This is frustrating to those who
have put in their time and effort into trying to improve this library.

We would like to thank @streadway
for developing this client and maintaining it for a decade — that's a remarkable contribution
to the RabbitMQ ecosystem. We this now is a good time to get more contributors
involved.

Team RabbitMQ has adopted a "hard fork" of this client
in order to give the community a place to evolve the API. Several RabbitMQ core team members
will participate but we think it very much should be a community-driven effort.

What do we mean by "hard fork" and what does it mean for you? The entire history of the project
is retained in the new repository but it is not a GitHub fork by design. The license remains the same
2-clause BSD. The contribution process won't change much (except that we hope to review and accept PRs
reasonably quickly).

What does change is that this new fork will accept reasonable breaking API changes according
to Semantic Versioning (or at least our understanding of it). At the moment the API is identical
to that of streadway/amqp but the package name is different. We will begin reviewing PRs
and merging them if they make sense in the upcoming weeks.

If your PR hasn't been accepted or reviewed, you are welcome to re-submit it for rabbitmq/amqp091-go.
RabbitMQ core team members will evaluate the PRs currently open for streadway/amqp as time allows,
and pull those that don't have any conflicts. We cannot promise that every PR would be accepted
but at least we are open to changing the API going forward.

Note that it is a high season for holidays in some parts of the world, so we may be slower
to respond in the next few weeks but otherwise, we are eager to review as many currently open PRs
as practically possible soon.

Thank you for using RabbitMQ and contributing to this client. On behalf of the RabbitMQ core team,
@ChunyiLyu and @michaelklishin.

michaelklishin previously requested changes Feb 24, 2019

View reviewed changes

AlexisMontagne closed this Feb 25, 2019

AlexisMontagne reopened this Feb 25, 2019

michaelklishin requested changes Feb 28, 2019

View reviewed changes

streadway requested changes Feb 28, 2019

View reviewed changes

AlexisMontagne added 8 commits March 8, 2019 17:35

Implement shadown public methods with context

685a9ae

Add OpenContext and CloseContext method

db5ff2b

shutdown the channel if a call is canceled

4645a9e

Check the context is not canceled before even sending the frame

5ab6a51

Add some test to ensure the context canceling behavour

af03090

Add some documentation about the Connection *Context methods

5d507ce

Add doc for the Channel *Context methods

83c35c3

Add standard AMQP error to notify channels on context canceling

5484139

AlexisMontagne added 6 commits March 8, 2019 17:35

Fix the constant definition format

a47e2cd

Do not close channel / connection on call canceled

b03ca88

Add dropped counter to keep in sync dispatcher and caller

8e76861

Use an array to keep track of the sequencing of the canceled requests

bb6df2b

Only appened response status if request.wait()

6f9692d

remvoe response dropping mechanism in the connection

cf3d360

AlexisMontagne force-pushed the ctx-methods branch from 2fc1c86 to cf3d360 Compare March 8, 2019 22:35

AlexisMontagne added 2 commits March 8, 2019 17:39

Fix comment format for the linter

6648380

ensure the timeout in the tests

5addcec

AlexisMontagne force-pushed the ctx-methods branch from 8e87c57 to 5addcec Compare March 8, 2019 23:17

Make TestTimeoutQosContextCanceled -race compatible

55624d8

michaelklishin reviewed Mar 12, 2019

View reviewed changes

channel.go Show resolved Hide resolved

michaelklishin reviewed Mar 12, 2019

View reviewed changes

channel.go Show resolved Hide resolved

michaelklishin previously requested changes Mar 12, 2019

View reviewed changes

michaelklishin requested a review from streadway March 12, 2019 21:59

michaelklishin approved these changes Mar 12, 2019

View reviewed changes

streadway mentioned this pull request Mar 15, 2019

Basic interfaces #387

Open

2 tasks

streadway requested changes Mar 15, 2019

View reviewed changes

Implementation of Context based method for Channel #389

Are you sure you want to change the base?

Implementation of Context based method for Channel #389

Conversation

AlexisMontagne commented Feb 22, 2019

michaelklishin left a comment

Choose a reason for hiding this comment

streadway commented Feb 25, 2019

AlexisMontagne commented Feb 25, 2019 • edited

michaelklishin commented Feb 25, 2019

AlexisMontagne commented Feb 25, 2019

streadway commented Feb 25, 2019

streadway commented Feb 25, 2019

AlexisMontagne commented Feb 25, 2019 • edited

michaelklishin commented Feb 25, 2019

streadway commented Feb 26, 2019

streadway commented Feb 26, 2019

michaelklishin commented Feb 26, 2019

AlexisMontagne commented Feb 26, 2019

michaelklishin left a comment

Choose a reason for hiding this comment

streadway left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexisMontagne commented Mar 4, 2019

AlexisMontagne commented Mar 4, 2019

streadway commented Mar 5, 2019

AlexisMontagne commented Mar 5, 2019

AlexisMontagne commented Mar 11, 2019

michaelklishin commented Mar 12, 2019

michaelklishin left a comment

Choose a reason for hiding this comment

AlexisMontagne commented Mar 12, 2019

michaelklishin commented Mar 12, 2019

michaelklishin commented Mar 15, 2019

streadway left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

streadway commented Mar 15, 2019

michaelklishin commented Mar 15, 2019

michaelklishin commented Jun 9, 2021

AlexisMontagne commented Feb 25, 2019 •

edited

AlexisMontagne commented Feb 25, 2019 •

edited