Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition at-or-near keep-alive expiration #3141

Closed
mweberxyz opened this issue Apr 19, 2024 · 6 comments · Fixed by #3145
Closed

Race condition at-or-near keep-alive expiration #3141

mweberxyz opened this issue Apr 19, 2024 · 6 comments · Fixed by #3145
Labels
bug Something isn't working

Comments

@mweberxyz
Copy link
Contributor

mweberxyz commented Apr 19, 2024

Bug Description

I've managed to (fairly) reliably replicate an error when a request is made right around the time a connection is approaching it's keep alive timeout. I have seen it happen in both:

  • Node 20
  • Node 21

using each of the following clients:

  • Node-native fetch
  • Undici 6.13.0
  • Node http.get

Reproducible By

Run server.js:

const { createServer } = require("node:http");

const server = createServer(
  { keepAlive: true, keepAliveTimeout: 5000 },
  (req, res) => {
    res.writeHead(200, { "Content-Type": "application/json" });
    res.end(
      JSON.stringify({
        data: "Hello World!",
      })
    );
  }
);

server.listen(3000, async () => {
  const endpoint = `http://localhost:${server.address().port}`;
  console.log(endpoint);
});

Then run client-undici.js:

const undici = require("undici");

let timeout = 4993;
const start = async () => {
  timeout += 0.5;
  console.log({ timeout });

  const res = await undici.fetch("http://127.0.0.1:3000");
  await res.text();
  setTimeout(start, timeout);
};

start();

OR client-http.js:

const http = require("http");
let timeout = 4995;
const start = async () => {
  timeout += 0.5;
  console.log({ timeout });

  http.get("http://127.0.0.1:3000", (res) => {
    res.on("data", () => {});
    res.on("end", () => {
      setTimeout(start, timeout);
    });
  });
};

start();

A few runs might need to be tried, and the starting let timeout = ... might need to be adjusted a bit.. but give it a few tries and eventually you'll see a crash like:

      Error.captureStackTrace(err, this)
            ^

TypeError: fetch failed
    at Object.fetch (/Users/matt/code/replicate-undici-3133/node_modules/undici/index.js:109:13)
    at async Timeout.start [as _onTimeout] (/Users/matt/code/replicate-undici-3133/client.js:8:15) {
  [cause]: Error: read ECONNRESET
      at TCP.onStreamRead (node:internal/stream_base_commons:217:20) {
    errno: -54,
    code: 'ECONNRESET',
    syscall: 'read'
  }
}

Node.js v21.7.3

To speed it finding the sweet spot, run with NODE_DEBUG=net, if you see the socket being torn down, then the connection being re-established successfully, then the timeout needs to be reduced.

Expected Behavior

Node does not crash when making a request near keep alive expiration.

Logs & Screenshots

Using a similar approach, I managed to catch a similar error in-the-wild over the internet, but much harder to reliably reproduce.

CleanShot 2024-04-19 at 17 58 57@2x

Full NODE_DEBUG=* logs for each of the clients:

Environment

MacOS arm64
Node v20.12.1/v21.7.3

Additional context

Not sure if this is purely a core node problem, or just similar bugs in both node.http and undici, but figured I'd start with the report here as I stumbled upon it in the midst of trying to replicate #3133.

@mweberxyz mweberxyz added the bug Something isn't working label Apr 19, 2024
@mweberxyz
Copy link
Contributor Author

mweberxyz commented Apr 20, 2024

Looking closer at the logs, I don't think there's anything undici can do. The socket is being closed by the server after the 2nd request has been sent, as far as I can tell.

...
FETCH 27254: sending request to GET http://127.0.0.1:3000//
TIMER 27254: 4990 list empty
NET 27254: destroy
NET 27254: close
NET 27254: close handle
NET 27254: emit close
FETCH 27254: request to GET http://127.0.0.1:3000// errored - read ECONNRESET
node:internal/deps/undici/undici:12500
      Error.captureStackTrace(err, this);
            ^

TypeError: fetch failed
    at node:internal/deps/undici/undici:12500:13
    at async Timeout.start [as _onTimeout] (/Users/matt/code/replicate-undici-3133/client.js:6:15) {
  [cause]: Error: read ECONNRESET
      at TCP.onStreamRead (node:internal/stream_base_commons:217:20) {
    errno: -54,
    code: 'ECONNRESET',
    syscall: 'read'
  }
}

Node.js v21.7.3

@mcollina
Copy link
Member

one option could be to always add the retry interceptor for gets (and just retry them).

for posts and all non-idempotent requests is a bit more tricky.

@mweberxyz
Copy link
Contributor Author

What is the goal of keepAliveTimeoutThreshold? With a Keep-Alive: timeout=5 returned by the server, and a default keepAliveTimeoutThreshhold of 1000ms, is it unexpected that after 4900ms a request sent on the existing connection?

@mcollina
Copy link
Member

I think we should always be keeping 2s (or more) to account for slow networks. So if the server gives us 5s, we should actually consider it 3s.

@mweberxyz
Copy link
Contributor Author

Regardless of the default, it appears to not be working as expected, and the cause is looking to be in lib/util/timers:

If I stub out the fast timers implementation, then no errors:

module.exports = {
  setTimeout (callback, delay, opaque) {
    return setTimeout(callback, delay, opaque)
  },

I will continue looking into lib/util/timers.

@mweberxyz
Copy link
Contributor Author

mweberxyz commented Apr 20, 2024

The problem is timers always fire at least 1000ms after the provided timeout in the best case, and 2000ms after provided in the "best worst case" - ie just unlucky with the scheduling time, but no event loop contention.

Test case

const timers = require("./timers.js")

// needed because timers in `./timers.js` are unref-ed
setTimeout(() => {
    console.log("10s elapsed, exiting")
    process.exit(0)
}, 10000)

for(const target of [1000, 1500, 1990, 1999, 2000, 2001, 2500, 2990, 2999, 3000, 4000, 5000]) {
    const start = process.hrtime.bigint()
    timers.setTimeout(() => {
        const end = process.hrtime.bigint()
        console.log(`target = ${target}, actual = ${(end - start) / 1_000_000n}`)
    }, target)
}

Result

% node test-latency.js
target = 1000, actual = 2002
target = 1500, actual = 3020
target = 1990, actual = 3020
target = 1999, actual = 3020
target = 2000, actual = 3020
target = 2001, actual = 3021
target = 3000, actual = 4022
target = 2500, actual = 4023
target = 2990, actual = 4023
target = 2999, actual = 4024
target = 4000, actual = 5026
target = 5000, actual = 6028
10s elapsed, exiting

Which means for a server sent 5000 ms keep-alive and 1000ms default keepAliveTimeoutThreshold, the target is 4000, and the actual timeout is always too late - 5000ms or more.

Happy to propose a solution via PR, or @ronag jump in if you prefer.

Edit: I'm think best course of action is:

  1. increase default keepAliveTimeoutThreshold to 2000ms
  2. double the "tick rate" of the fast timeout code by reducing the setTimeout calls from 1000 to 500
  3. subtract 500 from the requested timeout duration -- that means in the best case, the timer will fire exactly when called for.

Taking all three changes together, that means given a 5000ms keep alive, best case scenario we close the connection after 3000ms, best-worst case (ie- timeout scheduled just after a previous timeout was scheduled) it takes 3500ms, and then that leaves 1500ms of breathing room in the worst-worst case if the event loop is blocked.

mweberxyz added a commit to mweberxyz/nodejs-undici that referenced this issue Apr 21, 2024
ronag pushed a commit that referenced this issue Apr 22, 2024
* fix: tweak keep-alive timeout implementation

Closes #3141

* fix test impacted by default keepAliveTimeoutThreshold change

* adjust timer registration strategy

* only use fast timers with delay > 1000ms
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants