Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help debugging ECONNRESET problems #1062

Closed
1 task done
rarkins opened this issue Feb 11, 2020 · 35 comments
Closed
1 task done

Help debugging ECONNRESET problems #1062

rarkins opened this issue Feb 11, 2020 · 35 comments

Comments

@rarkins
Copy link
Contributor

rarkins commented Feb 11, 2020

What would you like to discuss?

My tool uses got@9.6.0 and regularly get these types of error from npmjs:

{
    "name": "RequestError",
    "code": "ECONNRESET",
    "host": "registry.npmjs.org",
    "hostname": "registry.npmjs.org",
    "method": "GET",
    "path": "/@typescript-eslint%2Feslint-plugin",
    "protocol": "https:",
    "url": "https://registry.npmjs.org/@typescript-eslint%2Feslint-plugin",
    "gotOptions": {
      "path": "/@typescript-eslint%2Feslint-plugin",
      "protocol": "https:",
      "slashes": true,
      "auth": null,
      "host": "registry.npmjs.org",
      "port": null,
      "hostname": "registry.npmjs.org",
      "hash": null,
      "search": null,
      "query": null,
      "pathname": "/@typescript-eslint%2Feslint-plugin",
      "href": "https://registry.npmjs.org/@typescript-eslint%2Feslint-plugin",
      "headers": {
        "user-agent": "Renovate Bot (GitHub App 2740)",
        "authorization": "** redacted **",
        "cache-control": "no-cache",
        "accept": "application/json",
        "accept-encoding": "gzip, deflate"
      },
      "hooks": {
        "beforeError": [],
        "init": [],
        "beforeRequest": [],
        "beforeRedirect": [],
        "beforeRetry": [],
        "afterResponse": []
      },
      "retry": {
        "methods": {},
        "statusCodes": {},
        "errorCodes": {}
      },
      "decompress": true,
      "throwHttpErrors": true,
      "followRedirect": true,
      "stream": false,
      "form": false,
      "json": true,
      "cache": false,
      "useElectronNet": false,
      "method": "GET"
    },
    "message": "read ECONNRESET",
    "stack": "RequestError: read ECONNRESET\n    at ClientRequest.request.once.error (/home/ubuntu/renovateapp/node_modules/got/source/request-as-event-emitter.js:178:14)\n    at Object.onceWrapper (events.js:286:20)\n    at ClientRequest.emit (events.js:203:15)\n    at ClientRequest.EventEmitter.emit (domain.js:448:20)\n    at ClientRequest.origin.emit.args (/home/ubuntu/renovateapp/node_modules/@szmarczak/http-timer/source/index.js:37:11)\n    at TLSSocket.socketErrorListener (_http_client.js:392:9)\n    at TLSSocket.emit (events.js:198:13)\n    at TLSSocket.EventEmitter.emit (domain.js:448:20)\n    at emitErrorNT (internal/streams/destroy.js:91:8)\n    at emitErrorAndCloseNT (internal/streams/destroy.js:59:3)"
  },

I'm wondering, does the retry part above indicate that there's actually no retrying? Or could it indicate an incomplete stringifying of the object, etc?

Additionally, is there any additional debugging of the err object I can do once such errors are thrown?

Checklist

  • I have read the documentation.
@szmarczak
Copy link
Collaborator

szmarczak commented Feb 11, 2020

I'm wondering, does the retry part above indicate that there's actually no retrying? Or could it indicate an incomplete stringifying of the object, etc?

Could be both. Try

console.log('methods', [...error.gotOptions.retry.methods]);
console.log('statusCodes', [...error.gotOptions.retry.statusCodes);
console.log('errorCodes', [...error.gotOptions.retry.errorCodes]);

Also the stack trace seems to be incomplete. Can you try executing Error.stackTraceLimit = Infinity; (or just use longjohn)?

Additionally, is there any additional debugging of the err object I can do once such errors are thrown?

Unfortunately no.

I've got some questions anyway:

  1. Can you reproduce the program locally?
  2. Are you sure there are no small network outages?
  3. Are you using an HTTP proxy?
  4. Does it happen only for specific requests? If so, which one? Please provide the most accurate information you can (request path, etc.)
  5. What Node.js version are you running? Have you tried the latest one?

Please try the following:

  1. Set your DNS server to 1.1.1.1 or 8.8.8.8 or
  2. Disable caching if present or
  3. Pass agent: false option.

If none of the above helps, you may want to try Got 10. There are some breaking changes compared to Got 9, but it's got some nice features: fixed retry functionality (twice), DNS cache, Brotli support, Node.js 13 support, better async stacktraces, experimental pagination API and other bug fixes. I'm willing to send a PR with the upgrade if necessary :)

@szmarczak
Copy link
Collaborator

Got 10 has some bugs but I don't think they would affect Renovate. Anyway half of the issues is gonna be fixed in an upcoming rewrite of Got because the current design has some limitations and the performance can be improved.

@szmarczak
Copy link
Collaborator

Also Got 10 provides native TypeScript typings :)

@rarkins
Copy link
Contributor Author

rarkins commented Feb 12, 2020

Hi @szmarczak, I appreciate the very quick response!

FYI this is only happening in production, so I am a little constrained as to how "debuggy" I can get or how much risk I can take, because there are well over 100k installed repos and any mistake can have large consequences (ask me how..)

I am adding specific logging for errorCodes and statusCodes like you suggest, and will wait for some results.

Also the stack trace seems to be incomplete. Can you try executing Error.stackTraceLimit = Infinity; (or just use longjohn)?

I need to check that this is OK to do in production. I notice longjohn say not to, but do you think they're being overly cautious? Do you expect got to throw a more detailed error?

  • Can you reproduce the program locally?

No. I've run Renovate locally with some of the commonly failing ones, but they download fine.

  • Are you sure there are no small network outages?

This would seem unlikely. The problem occurs hour after hour, and has "clusters" of particular package failures that indicate it's not random.

  • Are you using an HTTP proxy?

No

  • Does it happen only for specific requests? If so, which one? Please provide the most accurate information you can (request path, etc.)

There seems to be a correlation that the ones that fail are mostly large. e.g. https://registry.npmjs.org/@typescript-eslint%2Feslint-plugin is 14MB. However there are other even larger ones that have had zero failures.

  • What Node.js version are you running? Have you tried the latest one?

Running Node 12. Node 13 is not really an option consider it's unstable. I am a few releases behind on Node 12 though so will upgrade that in production today.

  • Set your DNS server to 1.1.1.1 or 8.8.8.8 or

I'd prefer not to, unless we have a reason to think it's the problem.

  • Disable caching if present or

No got caching present.

  • Pass agent: false option.

Can you give an example of this in a got@9 command? e.g. is it got(url, { agent: false })?

@rarkins
Copy link
Contributor Author

rarkins commented Feb 12, 2020

Unfortunately my logger doesn't want to share errorCodes and statusCodes:

    "errorCodes": "[Circular]",
    "statusCodes": "[Circular]",

Any trick for this? Here is the code in case that helps: https://github.com/renovatebot/renovate/blob/237eeffe6fa74335c817d308431ee9551b431c3b/lib/datasource/npm/get.ts#L226-L235

We use bunyan logger but have a custom serializer for err, largely for customizing got output: https://github.com/renovatebot/renovate/blob/master/lib/logger/err-serializer.ts

@rarkins
Copy link
Contributor Author

rarkins commented Feb 12, 2020

Something else surprising to me now. I have increased retries to 5: https://github.com/renovatebot/renovate/blob/237eeffe6fa74335c817d308431ee9551b431c3b/lib/datasource/npm/get.ts#L111

How likely is it that we still end up with the same number of ECONNRESET errors? i.e. 5 times in total it connects, starts transfer, and then gets terminated by the remote end? This is why I am wondering if it's really retrying or not.

@szmarczak
Copy link
Collaborator

I need to check that this is OK to do in production. I notice longjohn say not to, but do you think they're being overly cautious? Do you expect got to throw a more detailed error?

Yup. If you believe that unlimited stack trace is bad, you can do Error.stackTraceLimit = 20; which should be enough. (what it does is just it increases the maximum number of lines in the stack trace)

I've run Renovate locally with some of the commonly failing ones, but they download fine.

This could be related to the network speed and/or the highWaterMark option. Try setting the highWaterMark (it may be missing in the types) option to 104857600 (100MB) and see if that resolves the problem.

There seems to be a correlation that the ones that fail are mostly large.

Do they look like @scope/package? Or there are some global packages too?

Running Node 12. Node 13 is not really an option consider it's unstable. I am a few releases behind on Node 12 though so will upgrade that in production today.

Well, I know from the experience that http modules are the most stable in the newest Node.js versions...

I'd prefer not to, unless we have a reason to think it's the problem.

My point is that your current DNS provider can point to overloaded servers or has outdated entries. Try providing a host option (npm's IP).

is it got(url, { agent: false })?

Yep, it is.

Any trick for this?

Normalized errorCodes is a Set in Got 9. Convert it to an array:

      logger.warn(
        {
          err,
          errorCodes: [...err.gotOptions?.retry?.errorCodes],
          statusCodes: [...err.gotOptions?.retry?.statusCodes],
          regUrl,
          depName: name,
        },
        'npm registry failure'
      );

How likely is it that we still end up with the same number of ECONNRESET errors?

Assuming the NPM uptime is ~100%, the chance is almost 0%.

This is why I am wondering if it's really retrying or not.

Add a beforeRetry hook and log something in there.

I'll send you a PR to upgrade to Got 10 today. That should fix the problem with retries.

@rarkins
Copy link
Contributor Author

rarkins commented Feb 12, 2020

Yup. If you believe that unlimited stack trace is bad, you can do Error.stackTraceLimit = 20;

Thanks, I'll do that.

This could be related to the network speed and/or the highWaterMark option.

I'm not using got.stream - do you know if it's still applicable?

Do they look like @scope/package? Or there are some global packages too?

It's mostly scoped packages and mostly large ones, but still a few counterexamples too. I think scoped packages have a different "backend" within npmjs so scoped vs non-scoped can definitely explain some differences at times. I saw the next package failing frequently too for example (another large one).

Well, I know from the experience that http modules are the most stable in the newest Node.js versions...

FYI updating to Node v12.15.0 did not appear to have any impact on the stats.

I worked something else out that may be making the problem worse. It appears that the custom got cache I added a long time ago blocked my own attempts at ECONNRESET retries. It's possible that I inserted it at a point in got's flow that it is also blocking got's own "built in" retries. FYI the point of that caching is to catch when multiple queries to the same URL are all "in flight" at once - the goal/effect was to limit it to one by caching the promise.

@rarkins
Copy link
Contributor Author

rarkins commented Feb 12, 2020

After adding in "manual" retries (catching got errors and recursing) I was finally able to lower the error rate. Anecdotally it seems like most are recovering first-time, as I think we had both thought should happen.

My best guess is that Renovate's custom got caching behaviour prevents got's own retry mechanism from succeeding. I wonder if it's possible to add a catch here and delete the cache in case of failure?

@szmarczak
Copy link
Collaborator

do you know if it's still applicable?

Yes, because http is based on streams.

After adding in "manual" retries (catching got errors and recursing) I was finally able to lower the error rate. Anecdotally it seems like most are recovering first-time, as I think we had both thought should happen.

So it seems that retries in Got 9 are buggy. I'd prefer to switch to Got 10.

My best guess is that Renovate's custom got caching behaviour prevents got's own retry mechanism from succeeding.

Dunno. Possibly.

I wonder if it's possible to add a catch here and delete the cache in case of failure?

Should be possible. Just remember to rethrow the error.

I'll send you a PR with Got 10 soon.

@szmarczak
Copy link
Collaborator

I did some basic work here: renovatebot/renovate#5469 but you still need to fix the types and run the tests. If something fails please let me know straight away :)

@rarkins
Copy link
Contributor Author

rarkins commented Feb 14, 2020

I discovered that unfortunately our ECONNRESET errors, once retried, then often become parseError errors on the next attempt. I would like to try increasing the highWaterMark as you suggested but am not sure how to do that as I can't find any code samples. Could you advise?

@szmarczak
Copy link
Collaborator

I discovered that unfortunately our ECONNRESET errors, once retried, then often become parseError errors on the next attempt.

What does the ParseError look like?

I would like to try increasing the highWaterMark as you suggested but am not sure how to do that as I can't find any code samples. Could you advise?

Note that it may be missing in the TS types:

// 10MB
got(url, {...options, readableHighWaterMark: 1024 * 1024 * 10})

It just stores (max) 10MB of the upcoming response data instead of the default 16KB.

@rarkins
Copy link
Contributor Author

rarkins commented Feb 15, 2020

What does the ParseError look like?

Unfortunately, it looks like a multi-megabyte dump in my log files whenever it happens! I checked a few and they appear to just be truncated JSON responses from npmjs, as if receiving data finished early.

It just stores (max) 10MB of the upcoming response data instead of the default 16KB.

Thanks. I did not find many references to this in my searches. Do you know if there are any references for people implementing this type of increase in a client and it not causing something else to break?

@szmarczak
Copy link
Collaborator

Do you know if there are any references for people implementing this type of increase in a client and it not causing something else to break?

I did the research on my own.

  1. agent.addRequest(request, options)
  2. agent.createSocket(request, options, handler)
  3. agent.createConnection(options, handler)
  4. net.createConnection(options, handler)
  5. All sockets are Duplex streams
  6. Duplex streams accept readableHighWaterMark option

As you can see, the options are passed all along, so got(url, {...options, readableHighWaterMark: 1024 * 1024 * 10}) should work.

You don't need to set it manually on the IncomingMessage (response) instance, because Node.js already does that for you.

Unfortunately, it looks like a multi-megabyte dump in my log files whenever it happens! I checked a few and they appear to just be truncated JSON responses from npmjs, as if receiving data finished early.

That's what I expected. Please let me know if the readableHighWaterMark option does the trick.

@rarkins
Copy link
Contributor Author

rarkins commented Feb 16, 2020

Assuming I inserted the code correctly, it unfortunately has had zero effect on the number of errors observed in production: renovatebot/renovate@8598c5e

The errors are still about 75% ECONNRESET and 25% ParseError. Retrying solves the problem in each case though.

@szmarczak
Copy link
Collaborator

Weird. I'll try to reproduce the issue locally.

@szmarczak
Copy link
Collaborator

I think you should upgrade to Got 10 and see if it's gone.

@rarkins
Copy link
Contributor Author

rarkins commented Feb 16, 2020

Got 10 doesn’t seem stable/ battle tested enough yet to deploy to something of the size of Renovate’s production (1m+ requests per hour). I think I’d be jumping out of the course on and into the fire..

@szmarczak
Copy link
Collaborator

Well, if you have many instances of Renovate you can just deploy to one and roll back if something doesn't work as it should. If that's still a no, I'll try to reproduce the issue locally :)

@rarkins
Copy link
Contributor Author

rarkins commented Feb 16, 2020

I would like to soon, but I need to think how I can measure errors before and after to try to detect the things that would go wrong

@viceice
Copy link

viceice commented Feb 17, 2020

got 10 does not acceppt readableHighWaterMark. Where to put it now?

@rarkins
Copy link
Contributor Author

rarkins commented Feb 17, 2020

I think we can remove it, as it seems to have no effect (assuming I configured it correctly)

@szmarczak
Copy link
Collaborator

as it seems to have no effect (assuming I configured it correctly)

It has no effect because it didn't work. Seems like net sockets are immutable :(

One more question:

Do the requests fail in bulk? Like if request A fails, then request B fails even 1s later?

@rarkins
Copy link
Contributor Author

rarkins commented Feb 17, 2020

There is some clustering in time, yes. Certainly the problems tend to be more likely to be together than separate, including when one repository depends on multiple of the problem libraries. I'll try to sort a snapshot of logs and see how pronounced the clustering is.

@rarkins
Copy link
Contributor Author

rarkins commented Mar 12, 2020

Is there any way to determine if we have too many got instances at once, or too many sockets, or other resource problems like that?

The past two days has seen a big problem in our production app that I've tentatively traced down to got never returning from await calls, even though we're supplying a timeout. I'm wondering if it's some type of "silent" problem such as resource exhaustion. Referring to got@9 still.

@szmarczak
Copy link
Collaborator

Is there any way to determine if we have too many got instances at once

By looking at your code I can say that it's impossible unless you're using https://github.com/sindresorhus/import-fresh

or too many sockets

If you're using a custom instance of Agent you can use https://runkit.com/szmarczak/5e5e64d2c38f7e0013896198 or if you use the global one you can do https://runkit.com/szmarczak/5e5e6a5010e7d500137cb9ae

or other resource problems like that

Well, if there was a memory leak then the Node.js app would have crashed. If your CPU spins up to 100% then I'm pretty sure there's an infinite loop. It's best to use some resource monitoring tools like https://prometheus.io/

The past two days has seen a big problem in our production app that I've tentatively traced down to got never returning from await calls, even though we're supplying a timeout.

That's definitely a bug. As of now I cannot say whether it's in the Got core or the Got instance was configured improperly. When did this start to occur? Was it before renovatebot/renovate@18d2c52 or after?

@viceice
Copy link

viceice commented Mar 12, 2020

Definitely before that commit, it has started without any change. Maybe we can try the custom agent pattern to log sockets

@rarkins
Copy link
Contributor Author

rarkins commented Mar 12, 2020

CPU sits at 0%. Adding copious logging messages appears to show the last thing that happened is a got() call. This is part of a layer Promise.all so if one doesn't return/resolve, then that seems to explain the hanging.

Renovate does support the global proxy agent, although it's not in use for the hosted app as it has direct internet connectivity. I'd need to do some code tracing to work out if it's ever called/initialized or not.

@szmarczak
Copy link
Collaborator

Renovate does support the global proxy agent, although it's not in use for the hosted app as it has direct internet connectivity. I'd need to do some code tracing to work out if it's ever called/initialized or not.

What are the Got options for these requests? Are they the same as in the first post?

@rarkins
Copy link
Contributor Author

rarkins commented Aug 10, 2020

FYI I was finally able to verify got v11 in production and the npm errors immediately stopped. Thank you for the assistance.

@rarkins rarkins closed this as completed Aug 10, 2020
@szmarczak
Copy link
Collaborator

@rarkins By the way, did you upgrade Node.js also when switching to Got v11?

@viceice
Copy link

viceice commented Aug 11, 2020

Nope, we switched to v12 long time ago.

@kirillgroshkov
Copy link

I know posting comments on closed issues are not very helpful, but we're seeing many read ECONNRESET errors from latest got@11 on Node 16.10.
Stacktrace:

RequestError: read ECONNRESET
    at ClientRequest.<anonymous> (node_modules/got/dist/source/core/index.js:962:111)
    at Object.onceWrapper (node:events:510:26)
    at ClientRequest.emit (node:events:402:35)
    at ClientRequest.emit (node:domain:475:12)
    at ClientRequest.origin.emit (node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)
    at TLSSocket.socketErrorListener (node:_http_client:447:9)
    at TLSSocket.emit (node:events:390:28)
    at TLSSocket.emit (node:domain:475:12)
    at emitErrorNT (node:internal/streams/destroy:157:8)
    at emitErrorCloseNT (node:internal/streams/destroy:122:3)
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:220:20)
    at TLSWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
  code: 'ECONNRESET',
  timings: {
    start: 1645706450837,
    socket: 1645706450839,
    lookup: 1645706450840,
    connect: 1645706450843,
    secureConnect: 1645706450858,
    upload: 1645706450858,
    response: 1645706451042,
    end: 1645706451088,
    error: 1645706451113,
    abort: undefined,
    phases: {
      wait: 2,
      dns: 1,
      tcp: 3,
      tls: 15,
      request: 0,
      firstByte: 184,
      download: 46,
      total: 276
    }
  }
}

@szmarczak
Copy link
Collaborator

@kirillgroshkov It comes directly from the TCP socket, meaning either your network was disrupted or the end server is having issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants