Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple parallel long fetches with overridden, non-zero, timeouts fails all the time #1887

Closed
xconverge opened this issue Jan 26, 2023 · 22 comments · Fixed by #1904
Closed
Labels
bug Something isn't working

Comments

@xconverge
Copy link
Contributor

Bug Description

When running multiple LONG running fetches in parallel, some of them seem to just die and hit a timeout that doesnt really seem relevant

name: HeadersTimeoutError
    message: Headers Timeout Error
    stack: >-
      HeadersTimeoutError: Headers Timeout Error
          at onParserTimeout (/src/archive/undici/lib/client.js:908:28)
          at callTimer (/src/archive/undici/node_modules/@sinonjs/fake-timer
  s/src/fake-timers-src.js:743:24)
          at doTickInner (/src/archive/undici/node_modules/@sinonjs/fake-tim
  ers/src/fake-timers-src.js:1311:29)
          at doTick (/src/archive/undici/node_modules/@sinonjs/fake-timers/s
  rc/fake-timers-src.js:1392:20)
          at Object.tick (/src/archive/undici/node_modules/@sinonjs/fake-tim
  ers/src/fake-timers-src.js:1400:20)
          at Server.<anonymous>
  (/src/archive/undici/test/fetch/fetch-timeouts.js:32:11)
          at Server.emit (node:events:513:28)
          at parserOnIncoming (node:_http_server:1065:12)
          at HTTPParser.parserOnHeadersComplete (node:_http_common:117:17)
    code: UND_ERR_HEADERS_TIMEOUT

Reproducible By

Tweaking minutesPerRequest, numParallelRequests, and timeout here makes for some pretty confusing results. When timeout is 0, everything seems to work correctly. I suspect there are a few timers that are overlapping eachother or getting shared somehow.

Simply set timeout to 0 and see that the test passes, and set timeout to 30 minutes and see that it fails

"use strict";

const { test } = require("tap");

const { fetch, Agent } = require("../..");
const { createServer } = require("http");
const FakeTimers = require("@sinonjs/fake-timers");

// Tweaking request length can have an impact too potentially (Setting to 1 seems to pass, and setting it to 10 min fails on my machine)
const minutesPerRequest = 10;

// On my machine setting this to 0 passes, and setting it to something very high fails with a UND_ERR_HEADERS_TIMEOUT
const timeout = 1000 * 60 * (minutesPerRequest + 5);
// const timeout = 0;

// Tweaking num requests to either 1 or multiple has different impacts
const numParallelRequests = 5;

test("Fetch very long parallel requests, timeout overridden", (t) => {
  const msToDelay = 1000 * 60 * minutesPerRequest;

  t.setTimeout(undefined);
  t.plan(numParallelRequests);

  const clock = FakeTimers.install();
  t.teardown(clock.uninstall.bind(clock));

  const server = createServer((req, res) => {
    setTimeout(() => {
      res.end("hello");
    }, msToDelay);
    clock.tick(msToDelay + 1);
  });
  t.teardown(server.close.bind(server));

  server.listen(0, () => {
    for (let i = 0; i < numParallelRequests; i++) {
      fetch(`http://localhost:${server.address().port}`, {
        path: "/",
        method: "GET",
        dispatcher: new Agent({
          headersTimeout: timeout,
          connectTimeout: timeout,
          bodyTimeout: timeout,
        }),
      })
        .then((response) => response.text())
        .then((response) => {
          t.equal("hello", response);
        })
        .catch((err) => {
          // This should not happen, a timeout error should not occur
          t.error(err);
        });
    }

    clock.tick(msToDelay - 1);
  });
});

Expected Behavior

The fetches to succeed

Logs & Screenshots

Environment

Node v18.13.0

Additional context

@mcollina
Copy link
Member

Does this problem happen only with fake timers or "live" too?

@xconverge
Copy link
Contributor Author

xconverge commented Jan 26, 2023

Does this problem happen only with fake timers or "live" too?

I am battling something brutal with doing 5x longish 20 minute requests not working on a real, deployed machine. 1 or 2 of the shorter requests work, but then the other few lingering just fully "die" and eventually trigger the headers timeout error if that is set. Running the requests individually have no issues. I dont necessarily experience the same issue on my local machine yet, so am still working on figuring out repro cases.

So there is definitely SOMETHING happening live, I will run this test right now and edit this comment with the results running it without faketimers

I tried changing the timeouts to 0 and still experience the issue so I suspect there is more to this than maybe just this specific example, but am only able to do one step at a time

I have had to switch back to axios for these requests and everything is fine with axios while I try to resolve this on the side.

@xconverge
Copy link
Contributor Author

I am unable to get this same test to fail on my machine without faketimers... will keep trying to find ways to repro what I am seeing on my server, but any guidance on why the test is failing that might be a clue would be great

@xconverge
Copy link
Contributor Author

xconverge commented Jan 30, 2023

I have to give up and go back to axios for this, I have tried so many different things to reproduce this locally and debug this but I cant and I understand this is not that actionable/helpful unfortunately. I cant change/control the server I am accessing and cant reproduce anything like this with any API servers (using the benchmarks/server as a base) I spin up (on a remote machine on my LAN) with various levels of timeouts, data sizes returned, etc.

I am doing 4 parallel 5 to 25 minute requests from a single node app on a linux (ubuntu) server on EC2 to 1 server that is running on Google Compute Engine. I can get a response for 1 or 2 of the 4 requests based on what I have changed, but never all 4 at the same time. The sockets show as all 4 get established (ss command). If I do all 4 requests in series (wait for the previous to complete before starting the next), they all work fine.

On my local dev machine (ubuntu), all 4 long fetches happen fine, the code all works fine, no issues at all.

I suspect there is something that is specific to the server I am accessing or some weird edge of an edge case here but have just run out of ideas and things to try unfortunately.

Things I have tried on the ec2 server (after first checking that all 4 requests worked with the change locally) without luck:

  1. 1 agent shared for all 4 requests
  2. 1 agent per request
  3. pipelining: 0
  4. pipelining: 10
  5. header and body timeouts of 0
  6. header and body timeouts of very high numbers ~60 minutes
  7. Client() API instead of fetch
  8. blocking: true flag
  9. reset: true flag
  10. keepAliveMaxTimeout increased to 60 minutes

The query looks like this, pretty straightforward other than the SSL pinning:

    fetch(`https://serverxyz`,
    {
      method: "GET",
      headers: {
        authorization: xyz,
      },
      dispatcher: new Agent({
                      headersTimeout: 0,
                      bodyTimeout: 0,

                      connect: {
                        ca,
                        checkServerIdentity: (
                          hostname: string,
                          cert: PeerCertificate
                        ): Error | undefined => {
                          const fingerprint = cert.fingerprint
                          if (fingerprints.includes(fingerprint)) {
                            return undefined
                          } else {
                            return Error(`SSL Pin Failed`)
                          }
                      },
                    }
                  })
    }).then((response) => xyz, not, getting to here on the remote for some of the requests)

@mcollina
Copy link
Member

My experience with these kind of issues tells me the fakeTimers problem is not really related.

What this looks like: a firewall is truncating the connection in a way that we do not recognize, generating an error that we ignore.

@xconverge
Copy link
Contributor Author

I just ran a bash script from the server fetching these requests in parallel with curl. A total of 8 requests, ranging from 2 minutes to 20 minutes all at the same time, same as I call from my node app. All 8 finished fine with no issues

I do think there is an error that is being ignored somehow, but am unsure how to even debug that.

The only other thing I can think to try right now is replace with node-fetch and simply say "yep it works or no it doesnt". If there is anywhere I can put some more debugging I am open to it!

@xconverge
Copy link
Contributor Author

xconverge commented Jan 30, 2023

Aha! node-fetch acts nearly identically to undici and some of the requests fail!

This looks very interesting to me so I will take a look a bit to see if it makes a difference for me

https://github.com/node-fetch/node-fetch#custom-highwatermark

#923

Initial testing of changing from highWaterMark: 0 to highWaterMark: 5 made no difference, doesnt even seem to be getting this far

It kind of looks like the last request that is asked for completely overrides the earlier request. I added quite a bit of logging and it really seems like the requests just disappear. I dont know where to go from here with my debugging :/

@mcollina
Copy link
Member

mcollina commented Feb 1, 2023

Can you try undici.request instead of fetch()?

@xconverge
Copy link
Contributor Author

Can you try undici.request instead of fetch()?

undici.request behaves the same as fetch for this case after some testing

@mcollina
Copy link
Member

mcollina commented Feb 1, 2023

I wonder what axios is doing to make the request succeed all the time.

@xconverge
Copy link
Contributor Author

xconverge commented Feb 1, 2023

I wonder what axios is doing to make the request succeed all the time.

Yea I am really out of my depth here. How would I go about instrumenting the .js to really make sure keep-alives are working as intended for these long queries? Tracing around kKeepAliveTimeoutValue and trying to understand the keepalive within the codebase has me a bit lost. I did already try and remove this as a variable by using "pipelining: 0" but that might not have been the right way to go about it.

https://github.com/axios/axios/blob/54d3facb3b032665e6ae84e157073702b5c2e4d9/lib/adapters/http.js#L575

It just really seems like the shortest request is causing the others (longer ones) to be destroyed without any notice/events being thrown/shown to the user

I am also looking at the util.destroy(socket calls, maybe I will put a log statement in destroy() and see if that's getting called

@xconverge
Copy link
Contributor Author

xconverge commented Feb 2, 2023

I am surprised how readable axios is, I never looked into it. Doing a tiny bit of digging but it is going to be something small. Really a shame I still cant find a local/developer friendly repro. I can repro it EVERY time though on my prod environment though, so at least it is not ephemeral or intermittent

This is nuanced and seems to differ a bit with how undici handles it axios/axios#1752
This is also nuanced and the attached issue is interesting axios/axios#3916

@mcollina
Copy link
Member

mcollina commented Feb 2, 2023

I think those longs connections are timing out at the TCP level due to inactivity. We could think of providing an option to enable TCP keepalive.

https://nodejs.org/api/net.html#socketsetkeepaliveenable-initialdelay

@xconverge
Copy link
Contributor Author

I think those longs connections are timing out at the TCP level due to inactivity. We could think of providing an option to enable TCP keepalive.

https://nodejs.org/api/net.html#socketsetkeepaliveenable-initialdelay

I will try hard coding that somewhere when I get a chance

@xconverge
Copy link
Contributor Author

Great news, thanks for sticking with me here

I added a

  const keepAliveVal = 1000 * 60
  socket.setKeepAlive(true, keepAliveVal);
  console.log(`set socket keepalive to ${keepAliveVal}`)

right here

const socket = client[kSocket]

And tested it and success, my requests all go through and the performance is on par with axios!

I verified the keep-alive behavior before/after with ss -te

Excellllllent! Now the question is would you be open to me opening a PR, and how should it relate to the existing ClientOptions parameters

https://undici.nodejs.org/#/docs/api/Client?id=parameter-clientoptions

keepAliveMaxTimeout number | null (optional) - Default: 600e3 - The maximum allowed keepAliveTimeout when overridden by keep-alive hints from the server. Defaults to 10 minutes.
keepAliveTimeout number | null (optional) - Default: 4e3 - The timeout after which a socket without active requests will time out. Monitors time between activity on a connected socket. This value may be overridden by keep-alive hints from the server. See [MDN: HTTP - Headers - Keep-Alive directives](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Keep-Alive#directives) for more details. Defaults to 4 seconds.
keepAliveTimeoutThreshold number | null (optional) - Default: 1e3 - A number subtracted from server keep-alive hints when overriding keepAliveTimeout to account for timing inaccuracies caused by e.g. transport latency. Defaults to 1 second.

@mcollina
Copy link
Member

mcollina commented Feb 2, 2023

It should be a new parameter: tcpKeepAliveInterval: number | null - Default: null.
@xconverge you can go ahead and start a PR and we can discuss about the default value for this later.

WDYT @ronag? Should this be enabled by default? It might actually help with some ECONNRESETs.

@ronag
Copy link
Member

ronag commented Feb 3, 2023

It should be a new parameter: tcpKeepAliveInterval: number | null - Default: null. @xconverge you can go ahead and start a PR and we can discuss about the default value for this later.

WDYT @ronag? Should this be enabled by default? It might actually help with some ECONNRESETs.

Is there any downside to it? I don't mind having it as default.

@ronag
Copy link
Member

ronag commented Feb 3, 2023

@xconverge would you mind opening a PR?

@ronag
Copy link
Member

ronag commented Feb 3, 2023

Actually, I think this might already work?

setGlobalDispatcher(new Agent({
  connect: {
    keepAlive: true
  }
}))

@ronag
Copy link
Member

ronag commented Feb 3, 2023

We just need to override the defaults in connect.js and document that we have different default than net/tls.connect.

@xconverge
Copy link
Contributor Author

xconverge commented Feb 3, 2023

@ronag I made the latest changes in the PR, but yes I now see exactly what you are saying

Let me run a test with the current version of undici with:

setGlobalDispatcher(new Agent({
  connect: {
    keepAlive: true
    keepAliveInitialDelay: 60000
  }
}))

If this is true then should this just be a documentation exercise or do we still want to continue with some logic/parsing of an additional option where we set both of these by default if not defined explicitly by the user? I feel like since it is 2 flags being controlled by 1 undici specific parameter, it might be kind of weird but doable. (ie a user sets keepAlive: false, and tcpKeepAliveInterval to non-zero), just need a decent sized validation before overriding them but I am ok with that

@xconverge
Copy link
Contributor Author

xconverge commented Feb 3, 2023

setting these in socket = tls.connect({}) doesnt work because we are upgrading/assigning the socket, but manually setting socket.setKeepAlive with them after connect() works, so I think this will still require a code change

keepAlive [<boolean>](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) If set to true, it enables keep-alive functionality on the socket immediately after the connection is established, similarly on what is done in [socket.setKeepAlive([enable][, initialDelay])](https://nodejs.org/api/net.html#socketsetkeepaliveenable-initialdelay). Default: false.
keepAliveInitialDelay [<number>](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type) If set to a positive number, it sets the initial delay before the first keepalive probe is sent on an idle socket.Default: 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants