Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the publishWithContext interface will not return when it times out #178

Closed
liuxiaomeiG opened this issue Feb 24, 2023 · 9 comments · Fixed by #181
Closed

the publishWithContext interface will not return when it times out #178

liuxiaomeiG opened this issue Feb 24, 2023 · 9 comments · Fixed by #181
Assignees

Comments

@liuxiaomeiG
Copy link

When there is a memory alarm in the rabbit, the publishWithContext interface will not return when it times out, and will be stuck all the time. In addition, you cannot manually close the channel and connection at this time

@Zerpet Zerpet self-assigned this Feb 28, 2023
@Zerpet
Copy link
Contributor

Zerpet commented Feb 28, 2023

This happens because rabbitmq stops reading from the connection socket when there's a memory alarm; therefore, the I/O call to write the publish frame to the socket never returns. Unfortunately, the io.Writer does not provide a way to pass down the context. In summary, if the library is waiting on a blocking I/O operation, the context cancellation is ignored.

The same applies to Channel.Close and Connection.Close. Those methods write a frame to the socket to communicate the server a clean channel/connection shutdown. That is how the AMQP 0.9.1 Spec defines the Close method.

We could add an extension to the protocol, and provide Connection.ForceClose(error) function to forcibly close the underlying TCP connection and return immediately. However, I'm not sure how a forceful connection closure would help in this situation. RabbitMQ won't read anything from the connection until the memory alarm is cleared, and for the memory alarm to clear, usually is a matter of consumer apps consuming some messages.

@lukebakken
Copy link
Contributor

Missed heartbeats should eventually close the connection, right?

@Zerpet
Copy link
Contributor

Zerpet commented Mar 1, 2023

I don't think so, for the same reason as above. The heartbeater will try to write to the socket, and it will be blocked on I/O

amqp091-go/connection.go

Lines 707 to 710 in 190c143

case at := <-sendTicks:
// When idle, fill the space with a heartbeat frame
if at.Sub(lastSent) > interval-time.Second {
if err := c.send(&heartbeatFrame{}); err != nil {

The problem is the WriteFrame() function, that does blocking I/O:

err := c.writer.WriteFrame(f)

I have an idea to set a deadline on the heartbeater I/O calls, to address this situation, but that can break other things in some obscure, hard to diagnose, ways.

@liuxiaomeiG it would be super helpful if you could dump the stack trace of all go routines in your program when this situation happens. This post gives some guidance to do this:
https://stackoverflow.com/questions/19094099/how-to-dump-goroutine-stacktraces

@lukebakken
Copy link
Contributor

It looks like we don't continue to set any deadlines after the connection is initialized. My understanding (after reading the docs briefly) is that prior to each write a deadline should be set for the write to complete. If the deadline passes, the blocking operation will throw an exception.

@Zerpet
Copy link
Contributor

Zerpet commented Mar 3, 2023

Indeed we don't set deadlines. It's a but tricky to do because a connection can have multiple channels, and if we set deadlines on each I/O operation on every channel, deadlines will overlap, because the deadline is set in the TCP connection i.e. in the net.Conn object returned in Dial. I think a safe compromise would be to set and renew an I/O deadline in the heartbeater.

@lukebakken lukebakken self-assigned this Mar 3, 2023
@Zerpet
Copy link
Contributor

Zerpet commented Mar 8, 2023

hey everyone 👋 I have a test to reproduce this issue. However, I was slightly wrong about the heartbeater. Even when the connection is blocked (because memory alarm), the heartbeater keeps sending and receiving heartbeats. I verified that heartbeats keep flowing via WireShark and a network traffic capture.

The problem is that Channel.Close and Connection.Close try to do a graceful shutdown by sending a close frame to the server. Since the connection is blocked, the server does not read this frame, and the calls to Close() never return.

I'd like to verify what happens in other clients, so that we take a consistent solution across our client implementations.

@lukebakken
Copy link
Contributor

I'm pretty sure in this case the .NET client will throw on operation timeout exception.

@Zerpet
Copy link
Contributor

Zerpet commented Mar 9, 2023

The Java client also provides a timeout mechanism. This is going to be a bit tricky to do in this client, because Dial does not keep the refence to net.Conn, making it challenging to set a deadline a.k.a. timeout. I'm going to explore some options, but I'm leaning towards a new CloseCtx function, which takes a context and observes context cancellation.

@Zerpet
Copy link
Contributor

Zerpet commented Mar 14, 2023

@liuxiaomeiG we introduced a new function Connection.CloseDeadline(), which accepts a time.Time type parameter to set a deadline to close the connection. I hope this helps your app to not block during a memory alarm.

Regarding publishWithContext, this function is expected to block on I/O during a memory alarm. It will unblock once the memory alarm is cleared. I understand this can be counter-intuitive, given that the function accepts a context.Context and it is expected to observe context cancellation. However, the problem is not related to context cancellation, but blocking I/O.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants