Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consecutive HTTP requests cause socket hang up on node v20 #5765

Closed
nflaig opened this issue Jul 17, 2023 · 5 comments · Fixed by #5811
Closed

Consecutive HTTP requests cause socket hang up on node v20 #5765

nflaig opened this issue Jul 17, 2023 · 5 comments · Fixed by #5811
Labels
meta-bug Issues that identify a bug and require a fix. prio-high Resolve issues as soon as possible.
Milestone

Comments

@nflaig
Copy link
Member

nflaig commented Jul 17, 2023

Describe the bug

Consecutive HTTP requests cause socket hang up on node v20. This only happens with some servers, so far I could only reproduce this with Lighthouse. This issue causes the sim tests to fail and might cause other issues, e.g. when running Lodestar VC with a Lighthouse BN.

Expected behavior

No socket hang up / requests failures

Steps to reproduce

Running the following script will cause the first request to pass but the second one fails with socket hang up error

import { getClient } from "@lodestar/api";
import { config } from "@lodestar/config/default";

const lighthouse = "http://localhost:5052"
const api = getClient({ baseUrl: lighthouse }, { config });

await api.beacon.getBlockV2("head");
await api.beacon.getBlockV2("head");
> node index.js 

/home/nico/projects/test/debug-getStatev2/node_modules/node-fetch/lib/index.js:1505
                        reject(new FetchError(`request to ${request.url} failed, reason: ${err.message}`, 'system', err));
                               ^
FetchError: request to http://localhost:5052/eth/v2/beacon/blocks/head failed, reason: socket hang up
    at ClientRequest.<anonymous> (/home/nico/projects/test/debug-getStatev2/node_modules/node-fetch/lib/index.js:1505:11)
    at ClientRequest.emit (node:events:512:28)
    at Socket.socketOnEnd (node:_http_client:519:9)
    at Socket.emit (node:events:524:35)
    at endReadableNT (node:internal/streams/readable:1378:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  type: 'system',
  errno: 'ECONNRESET',
  code: 'ECONNRESET'
}

Node.js v20.4.0

Yielding to macro queue in-between the two requests does not produce an error

// ...
await api.beacon.getBlockV2("head");
setTimeout(async () => await api.beacon.getBlockV2("head"), 0);

Additional context

The issue is already reported upstream

There are also some suggested workarounds but those can't easily applied in our case since we are using cross-fetch and require browser compatiblity.

Operating system

Linux

Lodestar version or commit hash

6e01421

@nflaig nflaig added the meta-bug Issues that identify a bug and require a fix. label Jul 17, 2023
@philknows philknows modified the milestone: v1.10.0 Jul 18, 2023
@nflaig
Copy link
Member Author

nflaig commented Jul 18, 2023

Would be also interesting to find out why this only happens with Lighthouse as all other CLs don't have this issue. Maybe something related to set headers / keep-alive / connection header.

@nflaig
Copy link
Member Author

nflaig commented Jul 20, 2023

Maybe something related to set headers / keep-alive / connection header

Does not seem to be related to HTTP response headers

  • Lodestar and Nimbus set Connection: keep-alive header
  • Lighthouse, Prysm, and Teku do not set the header

socket hang up only happens with Lighthouse

@matthewkeil
Copy link
Member

After digging a bit this is related to how the socket is being handled by the two sides below the http layer. Why only Lighthouse is sending the RESET and the other clients do not I am not sure but it is a function of how they default handle half-open sockets. Similar to what was mentioned here in reference to the various implementations within node itself. I took a peak at the rust net and http crates but after a short dig I realized that it would not be a great use of time to try and figure out how they handle post FIN packages differently than the Nim and Go packages.

As a tl/dr; for the short term it appears that using an Agent with the keepAlive set will prevent node-fetch from including the Connection: close header and will keep the server side from sending the RESET. I have not tested this yet to confirm but @dhedey mentioned it works in his PR to node-fetch

That PR's in node-fetch/node-fetch#1736 is approved and waiting for merge as of Apr 12. It defaults to the node agent's keep-alive setting which will keep the socket open for reuse without having to pass an agent to skip a conditional check that is removed. In the mean time the workaround for passing a keep-alive agent suggested in that PR should work.

The core issue though is how the sockets are being recycled in node though.

After reading the other thread linked in "additional context" I started my spelunking in the node repo.

mcolina mentioned in this comment that he agrees with benbnoordhuis how its possible to pop an invalid socket which may resolve part of the issue. Adjusting that may help the situation and could provide a performance boost commensurate with this comment.

Although if that was the only issue it would not explain why 100% of the time an error gets thrown as there should be some percentage of the sockets that get shifted from the checked side of the array under loads.

@DevasiaThomas mentioned just below that he is working on a fix as of Apr 21. I tend to agree with @DevasiaThomas though that it will take more than just updating that while loop as there are a few places that emitter callbacks are required for resolution which is why we are seeing no error with a setTimeout(0).

I have pinged him here to see if he has made headway and would like help collaborating on the solution. I will be happy to help as this issue causing issues with us upgrading our project to node 20.

I dropped some hints to myself and others below as I dig. I will update the below as I go with how to resolve the issue.

Some code notes for potential resolution:

There is a note in Socket.prototype._destroy about error event being handled in next tick que:
https://github.com/nodejs/node/blob/4166d40d0873b6d8a0c7291872c8d20dc680b1d7/lib/net.js#L781

I found this line in the socket that appears to be getting called after the write side of the client socket is closed by the server and the socket is written to again. Reference onReadableStreamEnd that throws the error during the microtask que with process.nextTick.

@philknows philknows added the prio-high Resolve issues as soon as possible. label Jul 25, 2023
@philknows philknows added this to the v1.10.0 milestone Jul 25, 2023
@nflaig
Copy link
Member Author

nflaig commented Jul 25, 2023

node-fetch/node-fetch#1736 seems to fix the issue, if sim tests are also passing on node 20 I think we are good to go

@dhedey
Copy link

dhedey commented Jul 27, 2023

@matthewkeil - just saw your above post.

With respect to this:

Although if that was the only issue it would not explain why 100% of the time an error gets thrown as there should be some percentage of the sockets that get shifted from the checked side of the array under loads.

I don't know if you saw some of musings on this in the post here node-fetch/node-fetch#1735 - but yes, I came to the same conclusion as you (from my investigations and code digging) - the issue I believe is that packets can still be sent to the socket before it is cleared up in a following microtask. Any packets that get sent trigger an ECONNRESET

I'm not sure if nodejs/node#47130 is really moving ahead sadly. Perhaps it could do with another pair of eyes, but I don't have the time right now sadly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-bug Issues that identify a bug and require a fix. prio-high Resolve issues as soon as possible.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants