Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only create keepalive timeout if there isn't one pending #2304

Closed
wants to merge 1 commit into from

Conversation

nicknotfun
Copy link

From https://github.com/grpc/proposal/blob/master/A8-client-side-keepalive.md we have the keepalive time and timeout; the intention being that a keepalive ping is sent at-least every time and if there is no bytes of response within timeout the connection is considered dead.

The current grpc-js implementation has an issue in that if time is less than the actual ping response time (due to network, GC, whatever at the remote end) - a new ping would be scheduled, with a new timeout, and the old reference to the timeout is lost and thus will never be cancelled. This inevitably drops the connection.

As the intention of the gRFC [in my reading] was 'any byte' of response, I think it is then reasonable to only have a single timeout active at a time, and despite potentially multiple ping() calls, any response will close that timeout until the next ping.

Unsure on intended testing as I couldn't find an example current test- but from detailed gRPC tracing in our application I believe we are running into this issue, an example log:

28-12-2022T16:37:41 D 2022-12-28T21:37:41.594Z | keepalive | (44) 127.0.0.1:3052 Received ping response
28-12-2022T16:37:42 D 2022-12-28T21:37:42.594Z | keepalive | (44) 127.0.0.1:3052 Sending ping with timeout 10000ms
28-12-2022T16:37:42 D 2022-12-28T21:37:42.885Z | call_stream | [7] receive HTTP/2 data frame of length 17
28-12-2022T16:37:42 D 2022-12-28T21:37:42.885Z | call_stream | [7] parsed message of length 17
28-12-2022T16:37:42 D 2022-12-28T21:37:42.885Z | call_stream | [7] filterReceivedMessage of length 17
28-12-2022T16:37:42 D 2022-12-28T21:37:42.885Z | call_stream | [7] pushing to reader message of length 12
28-12-2022T16:37:42 D 2022-12-28T21:37:42.885Z | keepalive | (44) 127.0.0.1:3052 Received ping response
28-12-2022T16:37:43 D 2022-12-28T21:37:43.594Z | keepalive | (44) 127.0.0.1:3052 Ping timeout passed without response
28-12-2022T16:37:43 D 2022-12-28T21:37:43.594Z | subchannel | (44) 127.0.0.1:3052 READY -> TRANSIENT_FAILURE

It is not clear which keepalive is responsible for that timeout, however the connection did get a ping response less than a second prior so I don't think it should be considered a timeout (our settings are 1000ms/10000ms for time and timeout)

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Dec 28, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: nicknotfun / name: Nick Cooper (7b43638)

@nicknotfun
Copy link
Author

nicknotfun commented Dec 28, 2022

Note, an alternative would be to quash any new ping if one is in-flight (early exit instead of conditional create) - I wasn't sure which is preferred. As the data should be reliable, i'd also expect the first ping to eventually get response.

One reason not to do this is it makes the keepalive-load from a client predictable and constant.

@nicknotfun
Copy link
Author

Oh, i'm also willing to rework this to be closer to the spec overall if desired, namely a keepalive should be scheduled relative to the last ack.

Copy link
Member

@murgatroid99 murgatroid99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good change, with just style issue to fix.

@@ -121,7 +121,7 @@ export class Subchannel {
/**
* Timer reference tracking when the most recent ping will be considered lost
*/
private keepaliveTimeoutId: NodeJS.Timer;
private keepaliveTimeoutId?: NodeJS.Timer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field should be possibly null instead of optional. See for example session and channelzSocketRef in this class.

@murgatroid99
Copy link
Member

Update: I opened #2308, which is a big refactor of the Subchannel class that moves all of this code around. I integrated this change into it, so I don't think it makes sense to keep this PR open. If necessary, we can further refine the keepalive behavior after that PR is merged.

@nicknotfun
Copy link
Author

Sounds good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants