Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

requestTimeout and long running uploads #46574

Open
julien-f opened this issue Feb 8, 2023 · 17 comments
Open

requestTimeout and long running uploads #46574

julien-f opened this issue Feb 8, 2023 · 17 comments
Labels
http Issues or PRs related to the http subsystem.

Comments

@julien-f
Copy link
Contributor

julien-f commented Feb 8, 2023

Version

v18.14.0

Platform

Linux prometheus 6.1.9-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 2 00:21:48 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

http

What steps will reproduce the bug?

Create an HTTP server with a big requestTimeout (a day in this example):

require('node:http').createServer({
  requestTimeout: 24 * 60 * 60 * 1e3, // a day
}, (req, res) => {
  console.log(req.method, req.url);

  req.resume().on('end', () => {
    res.end('Ok');
  }).on('error', console.warn);
}
).listen(8080);

Do a long running request to it:

$ curl -T /dev/urandom http://localhost:8080 -v
*   Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> PUT /urandom HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.85.0
> Accept: */*
> Transfer-Encoding: chunked
> Expect: 100-continue
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* Mark bundle as not supporting multiuse
< HTTP/1.1 408 Request Timeout
< Connection: close
< 
* Send failure: Connection reset by peer
* Closing connection 0
curl: (55) Send failure: Connection reset by peer

How often does it reproduce? Is there a required condition?

It happens everytime, and I haven't been able to pinpoint the exact value at which it becomes problematic.

What is the expected behavior?

It should interrupt the request after 24 hours.

What do you see instead?

It interrupt the request after ~30 seconds.

Additional information

I can work-around by disabling the timeout (requestTimeout: 0) completely but setting a value, even high, seemed preferable.

Any other suggestions on how to handle long running uploads?

@VoltrexKeyva VoltrexKeyva added the http Issues or PRs related to the http subsystem. label Feb 8, 2023
@theanarkh
Copy link
Contributor

Try this.

require('node:http').createServer({
  requestTimeout: 24 * 60 * 60 * 1e3, // a day
  connectionsCheckingInterval: 24 * 60 * 60 * 1e3
}, (req, res) => {
  console.log(req.method, req.url);

  req.resume().on('end', () => {
    res.end('Ok');
  }).on('error', console.warn);
}
).listen(8080);

@julien-f
Copy link
Contributor Author

julien-f commented Feb 9, 2023

@theanarkh Why? What do you expect to see with this?

From what I understand, there is still an issue in how requestTimeout is handled, and even if connectionsCheckingInterval works, it will check all connections every 24 hours, this will not guarantee that a specific connection can stay open for 24 hours.

julien-f added a commit to vatesfr/xen-orchestra that referenced this issue Feb 9, 2023
Fixes https://xcp-ng.org/forum/post/58146

Caused by nodejs/node#46574

It caused requests to timeout after 0-30 seconds, which broke all uploads.
@julien-f
Copy link
Contributor Author

@ShogunPanda Any opinions on this?

@ShogunPanda
Copy link
Contributor

Inside the connection handling algorithm I always use uv_hrtime to track connections.

@nodejs/libuv Do you think this might explain it?

@bnoordhuis
Copy link
Member

@ShogunPanda explain what exactly?

@ShogunPanda
Copy link
Contributor

@bnoordhuis I re-read it and got it wrong probably. I'll try to reproduce and see what happens. I'll get back here shortly.

@StephenKelly94
Copy link

StephenKelly94 commented Feb 14, 2023

I have been experiencing the same issue. I set a requestTimeout to 2 hours, and was seeing the same result (connection reset by peer after ~30 seconds).

I noticed though after 2 hours of uptime on the system it started to work correctly. After this I set the requestTimeout to 5 minutes and sure enough it started working after 5 minutes uptime. So perhaps it's related to system uptime?

So I opted to set the requestTimeout to 0, and this fixes the issue for now.

@ShogunPanda
Copy link
Contributor

Unfortunately I cannot reproduce this. I tried both on a Mac and on Linux.
Given you were mentioning uptime of the system and time of the day, can you provide more info so I can take a look?

@julien-f
Copy link
Contributor Author

I don't see anything related to the uptime of my machine.

@ShogunPanda What behavior do you get with my test script above? On my system (and others like Ubuntu), curl get an ECONNRESET before 30 seconds.

@ShogunPanda
Copy link
Contributor

@julien-f In my case the server keeps receiving the request without issues.

@StephenKelly94
Copy link

StephenKelly94 commented Mar 1, 2023

So I altered the @julien-f script to work more closely with how we have encountered it.

Node Version: 18.12.1

const http = require('node:http');
const srv = http.createServer((req, res) => {
    console.log(req.method, req.url);

    req.resume().on('end', () => {
        res.end('Ok');
    }).on('error', console.warn);
}
).listen(10080);
srv.requestTimeout = 10 * 60 * 1000; // 10 minute timeout

I get the following output before 10 minutes uptime:

$ uptime
07:44:46 up 5 min,  0 users,  load average: 29.40, 19.92, 8.74
$ curl -T /dev/urandom http://localhost:10080 | cat
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  211M    0     0    0  211M      0  20.0M --:--:--  0:00:10 --:--:-- 20.6M
curl: (55) Send failure: Broken pipe
$ curl -T /dev/urandom http://localhost:10080 | cat
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  256M    0     0    0  256M      0  20.8M --:--:--  0:00:12 --:--:-- 21.3M
curl: (55) Send failure: Broken pipe

After 10 minutes of uptime it looks like this:

$ uptime
 07:56:26 up 16 min,  0 users,  load average: 2.90, 6.76, 8.16
$ curl -T /dev/urandom http://localhost:10080 | cat
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.3G    0     0    0 13.3G      0  21.9M --:--:--  0:10:25 --:--:-- 22.3M
curl: (55) Send failure: Broken pipe
$ curl -T /dev/urandom http://localhost:10080 | cat
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.8G    0     0    0 12.8G      0  21.9M --:--:--  0:10:01 --:--:-- 21.2M
curl: (55) Send failure: Broken pipe

I'm not sure if I'm missing anything or if setting the requestTimeout after the server is instantiated makes a difference.

MathieuRA pushed a commit to vatesfr/xen-orchestra that referenced this issue Mar 7, 2023
Fixes https://xcp-ng.org/forum/post/58146

Caused by nodejs/node#46574

It caused requests to timeout after 0-30 seconds, which broke all uploads.
@nsgundy
Copy link

nsgundy commented Apr 3, 2023

Running into this as well it seems. Its an ARMv7 architecture. So far I have only manually reproduced it by uploading a 230 MB file, takes about 10 seconds to complete, handled by express and multer (certainly not a minimal reproduction setup). The timeout can happen anywhere within the transfer. This only seems to happen when the uptime of the system is between 2 and 5 minutes. Running node 18.15.0. requestTimeout is unaltered at 300 seconds.

@sheepa
Copy link

sheepa commented Apr 3, 2023

Getting this issue as well, there seem to be some upper limit to requestTimeout when it stops working and any connections gets timed out whenever connectionsCheckingInterval fires.

EDIT: Can't say for sure what's causing this, or under what circumstances, but requestTimeout is definitely bugged.

@das7pad
Copy link

das7pad commented Aug 31, 2023

Hello,

We are seeing a lot of unexpected status 408 responses as well. Most notably in our CI environment running Node.js 18.17.1 in Google Cloud Build on n1-highcpu-32 VMs (x86_64). We see 408s for pretty much all requests, even very simple GET requests. The test HTTP servers use default options/timeouts. Our busiest test pipeline saw at least one unexpected 408 request timeout in about 50% of the builds.

We traced this down to a bug in the logic behind connectionsCheckingInterval: The connection cleanup routine uses uv_hrtime() readings for tracking the last activity per connection. It uses uv_hrtime()-requestTimeout*1e6/uv_hrtime()-headersTimeout*1e6 as reference for staleness, stored in an unsigned 64-bit integer.
Per the docs of uv_hrtime(), it returns arbitrary but monotonic clock readings.

.. c:function:: uint64_t uv_hrtime(void)
Returns the current high-resolution real time. This is expressed in
nanoseconds. It is relative to an arbitrary time in the past. It is not
related to the time of day and therefore not subject to clock drift. The
primary use is for measuring performance between intervals.
.. note::
Not every platform can support nanosecond resolution; however, this value will always
be in nanoseconds.

By definition, uv_hrtime() readings may be less than the configured request or header timeouts. In turn, the subtractions may overflow and flag all connections as stale. I'll add a few example readings below.

The default request timeout is 300000000000, so 242653195462 - 300000000000 = -57346804538 or as uint64 18446744016362747078.

  • 239934240956
  • 233476307396
  • 233536410987
  • 233606060330
  • 238853987548
  • 239820072736
  • 240095063948
  • 240250386896
  • 240297219170
  • 240032647966
  • 238283618445
  • 242653195462
  • 221756831153
  • 233199677409

The overflow bug has been fixed on the main and 20.x branches already: #48291.
The patch applies cleanly to 18.17.1 and our CI pipeline is happy again!
WDYT about backporting the fix to Node.js 18 via 18.x-staging? I'm happy to open a PR.

Also, WDYT about updating the comment in the above patch to explain that this affects ALL environments, not just IoT/embedded devices? Something like this s/On IoT or embedded devices the uv_hrtime() may return the timestamp/By definition, uv_hrtime() may return a timestamp/? Again, I'm happy to open a PR for main and include the comment tweak in the backport PR for 18.x-staging and open another one for 20.x-staging (if needed).

@sheepa
Copy link

sheepa commented Aug 31, 2023

The #48291 fix will set the timeout to infinite if the hrtime is too low? If a shorter timeout is requested is that not unexpected?

@das7pad
Copy link

das7pad commented Sep 1, 2023

You can see uv_hrtime() as a counter that gets incremented for every nanosecond of program runtime. The counter is initialized with a random number x, which is picked from 0 < x < Date.now()*1e6. For two consecutive calls of uv_hrtime(), we get values y and z, with x <= y <= z (the counter only increments and we can either observe the same value or a greater one). y will be our last_message_start_ as assigned when a new message starts and z will be now as read by the connection expiry check. In case uv_hrtime() returns now < timeout we can deduce that last_message_start_ is also last_message_start_ < timeout, via x <= y <= z/x <= last_message_start_ <= now.
In other words, when the counter was not yet incremented timeout times, we can deduce that no prior counter reading was greater than timeout, so subtracting timeout from now would yield a negative number or zero (aka point at a time before the program start or the program start).
Which in turn allows us to deduce that no connection can be stale for more than timeout nanoseconds yet, and we do not need to check any connections for staleness regarding this timeout. We flag a timeout as "do not check" by setting it to zero.

@das7pad
Copy link

das7pad commented Sep 25, 2023

#48291 has been backported to Node 18.18 via 977e9a3. We were able to revert our workaround now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem.
Projects
None yet
Development

No branches or pull requests

10 participants