Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECONNRESET while doing http 1.1 keep alive requests and server closes the connections #47130

Open
DevasiaThomas opened this issue Mar 17, 2023 · 53 comments
Labels
help wanted Issues that need assistance from volunteers or PRs that need help to proceed. http Issues or PRs related to the http subsystem.

Comments

@DevasiaThomas
Copy link

DevasiaThomas commented Mar 17, 2023

Version

v18.15.0

Platform

Running this in a Kubernetes cluster inside nodejs:lts container. Details for K8s node "nodeInfo": { "architecture": "amd64", "bootID": "a5847523-d870-4bc0-9f1a-7a4ed9885ca3", "containerRuntimeVersion": "docker://20.10.12", "kernelVersion": "3.10.0-1160.53.1.el7.x86_64", "kubeProxyVersion": "v1.21.12", "kubeletVersion": "v1.21.12", "machineID": "", "operatingSystem": "linux", "osImage": "Red Hat Enterprise Linux", "systemUUID": "290B2642-439A-2099-6F2B-7ED84D7C284B" },

Subsystem

http or net

What steps will reproduce the bug?

Run non stop http1.1 requests over a keepAlive connection pool using node-fetch or axios against any server that closes the connection from its side after 'N' seconds. In my case the server did a 20 second close. There might be multiple requests to the server over the connection but if 20 seconds have elapsed it closes the connection after the the last http response is sent.I have a default nodejs client configuration ( I haven't assigned it more threads or anything).

When configuring the custom http agent on the client side, i supplied {keepAlive: true, maxSockets:50}.

How often does it reproduce? Is there a required condition?

When the there's a lot of requests being sent constantly things are fine, but if there is a slowdown(not many things do, hence lesser requests to go out) - the next request usually ends up getting an ECONNRESET.

Based on my TCP dump I have, when there's a good load of requests over the connection pool when the server sends a
[FIN, ACK], the client sends a [FIN,ACK] and the server sends an ACK back and the connection closes successfully

But when there is a "lull" later and there's not enough requests over the pool, the server sends a [FIN,ACK] for an unused connection in the pool, the nodejs client responds [ACK] and the next request in the queue goes on this socket causing the server to respond with a RESET. (Rightly so - coz the server wanted to close the connection).

Now I believe the reason for the next request to go on the socket that just got the FIN has probably has to do with connection choosing strategy. I think the default is both these frameworks is lifo, and the ACK (without the FIN) that gets sent makes the connection go to the top of the pool for the next request.

What is the expected behavior? Why is that the expected behavior?

A socket closed from the server side (FIN,ACK sent by server) must be removed from the connection pool instead of it being kept in there - Regardless of the fact that a FIN wasn't sent back. And no future requests should go on it.

What do you see instead?

The connection stays in the pool if the FIN wasn't sent back. The next request goes on the connection. Sever forcibly closes connection with a RESET

Additional information

I tried a few other frameworks apart from node-fetch and axios (same issue, making me think it's a node core problem) - But I can't use them in my code so not mentioning them .

When I reduced maxSockets from 50 to 20 the issue happened less frequently. Which is why i think it is related activity on those sockets. I switch to the keepaliveagent package that has a SocketTTL feature - it helps, but doesn't solve the problem . Resets still happen (same issue). Seems like this issue was reported there and they tried to handle it there (still a problem though). I'm assuming this issue has the same problem I'm facing - they were using keepaliveagent as well.

@bnoordhuis bnoordhuis added the http Issues or PRs related to the http subsystem. label Mar 17, 2023
@bnoordhuis
Copy link
Member

I've read your description of the problem bug I don't think it's actionable without a way to reproduce that doesn't depend on third-party modules like axios. Can you add a test case?

@DevasiaThomas
Copy link
Author

@bnoordhuis i'll see if I can come up with something. Since this happened across multiple frameworks this should be reproducible by using just the core stuff. Gimme this weekend :)
I was hoping you check the TCP FIN handling part of the code and see how the socket is treated on the client side on receiving a FIN. If it doesn't remove the connection from the pool before doing anything else - that would be the problem.

@bnoordhuis
Copy link
Member

Node handles it when ECONNRESET happens when the socket is inside the socket pool but there's a fundamental (and inevitable) race condition when that happens just when the socket is being pulled from the pool.

Node could theoretically use another socket if the current socket fails on first write but without a circuit breaker or other fail-safes that creates a DoS-like scenario when the server immediately closes incoming connections. Node would basically busy-loop trying to send the request.

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 18, 2023

I understand the inevitable race that can happen and I agree you have to retry in such a scenario and its best that the application takes care of that. But based on my tcp dump the server sends a FIN which causes the client to send an ACK(socket on client transitions to close_wait). Whatever happens after the ACK is sent causes the socket to get pushed to the top of the connection pool(which is why the very next request goes on it). So, there is time for the lifo algorithm or underlying code to push the connection to top of the pool - the same code can remove it from the pool instead of pushing it to the top when the ACK is for a FIN. This is what I wanted to highlight. I am working on a reproducible example. I hope I am successful with it.

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 18, 2023

@bnoordhuis So, I don't have guaranteed replicability - but over time (5 of the runs i got it in like the first 5 mins, 2 other runs i had to leave it overnight). I'm running them on different physical boxes. I couldn't get resetAndDestroy to actually send a TCP RST, but the prints happen and socket hangups get reported on the client end. If you can figure out how to send the RST, you'll get the ECONNRESET as reported above. The code tries to mimic what happened in my firm's environment.

Server code
import http from 'node:http';
//Time, in seconds, any socket can be alive for. If there is a request already being processed close it as soon as response is done
let maxSocketLifetime = 20;

if (maxSocketLifetime === 0) {
    process.exit(1);
} else {
    maxSocketLifetime = maxSocketLifetime * 1000;
}

/**
 *
 * @param socket {node:net.Socket}
 * @return {Promise<void>}
 */
async function initiateClose(socket) {
    //console.log('Initiate Close' + socket.remotePort);
    if (socket.writableEnded || socket.destroyed || socket.closed) {
        return;
    }
    if(socket.requestInFlight) {
        socket.closeAfterResponse = true;
        return;
    }
    //Send FIN
    //console.log('Timed Close' + socket.remotePort);
    socket.end();
}
async function handleClose(hasError) {
    console.log('Socket Closed' + this.remotePort);
    //when it was closed by client
    clearTimeout(this.intervalz);
}
/**
 *
 * @param socket {node:net.Socket}
 * @return {Promise<void>}
 */
async function handleConnection(socket) {
    console.log('New Connection' + socket.remotePort);
    socket.requestInFlight = false;
    socket.closeAfterResponse = false;
    socket.intervalz = setTimeout(initiateClose, maxSocketLifetime - 100, socket);
    socket.on('close', handleClose);
}

/**
 *
 * @param req {http.IncomingMessage}
 * @param res {http.ServerResponse}
 * @return {Promise<void>}
 */
async function handleRequest(req, res) {
    if(req.socket.writableEnded) {// If client sends data after we sent FIN
        console.log('Reset')
        req.socket.resetAndDestroy();
    }
    req.socket.requestInFlight = true;
    let body = [];
    req.on('data', (chunk) => {
        body.push(chunk);
    }).on('end', () => {
        body = Buffer.concat(body).toString();
        res.end(body);
        res.socket.requestInFlight = false;
        if(res.socket.closeAfterResponse) {
            //Send FIN
            //console.log('Close after response' + res.socket.remotePort);
            res.socket.end();
        }
    });
}


let serverOpts = {
    keepAliveTimeout: maxSocketLifetime,
    connectionsCheckingInterval: Math.floor(maxSocketLifetime/3)
}
let server = http.createServer(serverOpts);
server.on('connection', handleConnection);
server.on('request', handleRequest);
server.listen(6969);
console.log('Server Started');
Client code
import http from "node:http";
let load_arr = [5000, 4000, 3000, 1500, 1000, 600, 400, 300, 200, 100, 50, 20 , 10 , 1];
let interval_arr = [1, 3, 5, 8, 9, 10, 13, 15, 17];
const postData = JSON.stringify([{"_id":"6414532c438f86245e1a9fae","index":0,"guid":"a7bf9526-1972-418d-b251-f865a02131ae","isActive":false,"balance":"$2,510.55","picture":"http://placehold.it/32x32","age":31,"eyeColor":"green","name":"Pittman Green","gender":"male","company":"EVENTEX","email":"pittmangreen@eventex.com","phone":"+1 (812) 487-3784","address":"557 Johnson Avenue, Dotsero, South Dakota, 682","about":"Esse voluptate proident duis sunt excepteur eiusmod proident dolore. Ullamco nulla voluptate nostrud ullamco officia quis proident id pariatur Lorem quis culpa. Non ipsum voluptate exercitation reprehenderit ex duis nisi exercitation.\r\n","registered":"2014-05-19T06:47:19 +07:00","latitude":55.881952,"longitude":176.809812,"tags":["dolor","magna","aute","voluptate","ex","eiusmod","nisi"],"friends":[{"id":0,"name":"Tammi Odom"},{"id":1,"name":"Bush Ingram"},{"id":2,"name":"Pennington Sharp"}],"greeting":"Hello, Pittman Green! You have 1 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532ce611be5510ba2c21","index":1,"guid":"6350b383-bee6-48a1-b19e-00cc20d74db7","isActive":true,"balance":"$2,193.74","picture":"http://placehold.it/32x32","age":38,"eyeColor":"brown","name":"Rice Mccullough","gender":"male","company":"MAGNINA","email":"ricemccullough@magnina.com","phone":"+1 (915) 527-2484","address":"689 Hanover Place, Lumberton, Wisconsin, 4844","about":"Labore dolor nostrud anim veniam incididunt dolore fugiat proident. Sit fugiat incididunt culpa aute veniam. Sint labore non magna labore tempor amet. Id mollit ipsum eiusmod reprehenderit dolor consequat nostrud nisi.\r\n","registered":"2019-04-02T08:05:50 +07:00","latitude":34.247422,"longitude":-144.451109,"tags":["elit","adipisicing","voluptate","proident","consectetur","minim","duis"],"friends":[{"id":0,"name":"Abbott Compton"},{"id":1,"name":"Cathryn Wilkins"},{"id":2,"name":"Jeannette Robertson"}],"greeting":"Hello, Rice Mccullough! You have 7 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532c1896aa6394d16d18","index":2,"guid":"22596015-6d9b-447e-8fa4-9007e80b37bb","isActive":true,"balance":"$2,100.66","picture":"http://placehold.it/32x32","age":24,"eyeColor":"green","name":"Katharine Kaufman","gender":"female","company":"ELENTRIX","email":"katharinekaufman@elentrix.com","phone":"+1 (846) 453-2614","address":"271 Grand Avenue, Tonopah, New Hampshire, 9234","about":"Eiusmod officia amet sit dolor elit id labore ea. Aliquip elit veniam reprehenderit aliquip mollit. Tempor non ipsum ad exercitation sit sunt consectetur amet pariatur. Nisi in non incididunt quis ex laboris. Veniam voluptate proident do commodo laborum elit velit Lorem enim aliquip mollit quis tempor veniam. Anim ea velit fugiat sit ea voluptate esse in officia nostrud consequat amet magna.\r\n","registered":"2020-12-21T05:29:02 +08:00","latitude":-54.93951,"longitude":151.859727,"tags":["proident","sint","Lorem","duis","sint","sit","excepteur"],"friends":[{"id":0,"name":"Corrine Chavez"},{"id":1,"name":"Diana Pate"},{"id":2,"name":"Madden Yang"}],"greeting":"Hello, Katharine Kaufman! You have 2 unread messages.","favoriteFruit":"strawberry"},{"_id":"6414532c404e9ebdabd81650","index":3,"guid":"b8d8605f-a72a-4d8f-954d-2339434d3c5d","isActive":false,"balance":"$2,420.51","picture":"http://placehold.it/32x32","age":29,"eyeColor":"green","name":"Mattie Craft","gender":"female","company":"CINASTER","email":"mattiecraft@cinaster.com","phone":"+1 (982) 481-3054","address":"525 Church Lane, Statenville, Michigan, 8845","about":"Et irure velit esse reprehenderit id. Dolore excepteur labore ullamco eu est nostrud et. Ipsum consectetur exercitation ut elit id irure ullamco irure irure voluptate laborum id enim est. Ipsum elit voluptate ullamco culpa dolor esse mollit sint velit. Cupidatat exercitation ipsum commodo enim labore anim officia ut incididunt ea velit amet ut Lorem. Fugiat id occaecat anim aute ex elit irure pariatur commodo dolore magna pariatur.\r\n","registered":"2014-05-10T04:15:02 +07:00","latitude":-46.535255,"longitude":140.547752,"tags":["esse","ullamco","mollit","quis","proident","aliqua","pariatur"],"friends":[{"id":0,"name":"Irene Delgado"},{"id":1,"name":"Christensen Craig"},{"id":2,"name":"Vilma Mccarthy"}],"greeting":"Hello, Mattie Craft! You have 6 unread messages.","favoriteFruit":"banana"},{"_id":"6414532c1a5cfc33a11dc118","index":4,"guid":"d7dfa4ee-e4be-43c4-bd5a-2993debbbebc","isActive":false,"balance":"$2,646.29","picture":"http://placehold.it/32x32","age":23,"eyeColor":"brown","name":"Lopez Davenport","gender":"male","company":"MEDIFAX","email":"lopezdavenport@medifax.com","phone":"+1 (885) 442-2655","address":"401 Eastern Parkway, Loretto, Iowa, 7360","about":"Nulla laboris laborum ipsum voluptate sint deserunt eu. Esse velit sit sunt qui. Proident anim do ea do. Elit qui sit cupidatat dolor quis fugiat dolor quis voluptate in. Velit esse irure est duis nulla incididunt do ut. Tempor qui culpa labore eiusmod voluptate quis elit in cillum ea nostrud.\r\n","registered":"2020-04-25T03:40:35 +07:00","latitude":58.560169,"longitude":58.62533,"tags":["voluptate","esse","est","cupidatat","officia","elit","deserunt"],"friends":[{"id":0,"name":"Blanca Wilkinson"},{"id":1,"name":"Aguirre Schmidt"},{"id":2,"name":"Mandy Norris"}],"greeting":"Hello, Lopez Davenport! You have 1 unread messages.","favoriteFruit":"strawberry"}]);
let agent = new http.Agent({
    keepAlive:true,
    maxSockets: 50,
});
const options = {
    agent: agent,
    hostname: 'localhost',
    port: 6969,
    method: 'POST',
    headers: {
        'Content-Type': 'application/json',
    },
};

async function handleResponse(res) {
    let body = [];
    res.setEncoding('utf8');
    res.on('data', (chunk) => {
        body.push(chunk);
    });
    res.on('end', () => {
        body = body.join("");
    });
}
async function handleRequestError(error) {
    console.error(`problem with request: ${error.message}`)
}

while (true) {
    let load = load_arr[load_arr.length * Math.random() | 0];
    for (let i = 0; i < load; i++) {
        let req = http.request(options, handleResponse);
        req.on('error', handleRequestError);
        req.write(postData);
        req.end();
    }

    let interval = interval_arr[interval_arr.length * Math.random() | 0];
    await new Promise(r => setTimeout(r, interval * 1000));
}

@bnoordhuis
Copy link
Member

@nodejs/http can you take a look?

@DevasiaThomas
Copy link
Author

To surface the problem more frequently, i tried to randomly close ~ 1% of the connections after the response was ended
so this

        res.socket.requestInFlight = false;
        if(res.socket.closeAfterResponse) {

changed to this

        res.socket.requestInFlight = false;
        let random = Math.random();
        if(random > 0.98 || res.socket.closeAfterResponse) {

You can change the number to increase the percentage of connections closed by the server.
This what the TCP dump will look like (if my node.js server would actually send the reset).

Screen Shot 2023-03-19 at 3 13 41 AM

The cool thing I saw in the dump was that these connections were idle till the FIN showed up. Which is why I'm emphasizing that there is some logic causing an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool instead of marking it as closed from the pool's perspective.

@ronag
Copy link
Member

ronag commented Mar 19, 2023

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.

Undici does that and I would recommend you switch to that.

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 19, 2023

Haven't looked into this in depth but I think this is expected. There is a race condition in the http/1.1 spec for keep alive requests. This can be mostly avoided by applying the keep alive hints that the server provides.

I'm sorry what do you mean by hints? Is the hint part of the spec or implemented by nodejs servers? I don't know about this - so if you could point me to the info regarding this, I would greatly appreciate it.

Undici does that and I would recommend you switch to that.

Can't - its not that simple. Our restAPI interactions are all generated code based on things like OpenApi and protobuff schemas. A quick google search does not find me a generator that uses Undici. If we had to switch - we'd probably choose another language at that point. This is because my colleagues have spend quite a bit of time on this and we don't believe its the race mentioned in spec regarding keep-alive.(We haven't faced this problem in rust and python implementations of core http libs)

Based on the tcpdump timelines and ordering of which connection gets the next request, its pretty clear that nodejs has the time to apply the lifo algorithm to place the socket, that just sent the ACK to the incoming FIN, to the top of the pool. Which also means you have the same amount of time to take that socket out of the connection pool. I believe this is fixable. If you can show me where the code, that handles the connection pool management, is - I can see if I can submit a PR.

@ronag
Copy link
Member

ronag commented Mar 19, 2023

MDN calls it parameters. See the timeout parameter. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Keep-Alive

@ronag
Copy link
Member

ronag commented Mar 19, 2023

The client should not make any request on a connection that has been idle more than timeout duration (minus some threshold).

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 19, 2023

The client should not make any request on a connection that has been idle more than timeout duration (minus some threshold).

@ronag , timeout is not the problem here. Neither is the concept of keep alive. Please have a deep look into my issue. Thanks :)

@ronag
Copy link
Member

ronag commented Mar 19, 2023

@mcollina I think this is something we've resolved in undici. This kind of goes into http core client maintenance issue...

@ywave620
Copy link
Contributor

@DevasiaThomas I believe this is just the known issue as memtioned by @ronag and @bnoordhuis. For you curiosity why

an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool

your sever code response a HTTP response indicating keep-alive followed by a TCP FIN:

async function handleRequest(req, res) {
// ...
        res.socket.requestInFlight = false;
        if(res.socket.closeAfterResponse) {
            //Send FIN
            //console.log('Close after response' + res.socket.remotePort);
            res.socket.end();
        }
// ...

nodejs client does not take the TCP FIN as a signal to reuse the connection but does take the HTTP response with a keep-alive indication.

@ywave620
Copy link
Contributor

FYI, the TCP FIN, in nodejs spectrum, is the 'end' event of a socket. I don't see any code of http agent makes use of this event to decide whether a socket can be reused.

@DevasiaThomas
Copy link
Author

FYI, the TCP FIN, in nodejs spectrum, is the 'end' event of a socket. I don't see any code of http agent makes use of this event to decide whether a socket can be reused.

@ywave620
There's 2 behaviors on an incoming FIN,ACK from the server based on my TCP dump.

  1. Node.js Client sends FIN, ACK --> This has no problem, works fine
  2. Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.

@ywave620
Copy link
Contributor

ywave620 commented Mar 20, 2023

Node.js Client sends FIN, ACK --> This has no problem, works fine

This is the default behavior of a nodejs client/server, i.e. nodejs automatically close any half-open connection. This behavior can be overwritten by setting allowHalfOpen to true in the option paased to http or net module when creating outbound requests/connections or creating server

Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.

This worths further digging. Did you spot Node.js Client send a FIN later in this weird scenario?

@ywave620
Copy link
Contributor

Also note that allowHalfOpen is not officially documented in http module. Is it intended or do we miss it? @ronag

@DevasiaThomas
Copy link
Author

Node.js Client sends ACK --> Idk what code path makes this happen, but this also causes socket to go to top of connection pool. Because the very next request goes on this socket.

This worths further digging. Did you spot Node.js Client send a FIN later in this weird scenario?

Nope, since the client sends the next request right away, server sends reset. client has no opportunity to send FIN.
I haven't passed allowHalfOpen option anywhere so i can't comment about that.

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 20, 2023

@ywave620 I just checked http.createServer call path real quick and it seems that the Server creates a net.Server with halfOpen as true always. Please confirm. But my issue isn't with the server but the client. I'll check that as well

EDIT: idk if that is only for the listening socket or applies to all incoming sockets for the server.

@ywave620
Copy link
Contributor

allowHalfOpen applies to either client or server. It make sense in net abstraction level, however, it is copied to the option paased to net by http. E.g. net.createConnection inherits the option from http.request, which is why http.request({allowHalfOpen: true}) works. Although I don't know whether this is the desired usage, in other words, should it be considered as backdoor?

@ronag
Copy link
Member

ronag commented Mar 20, 2023

I think your best bet is to either disable keep alive or switch to undici.

@DevasiaThomas
Copy link
Author

I think your best bet is to either disable keep alive or switch to undici.

You've mentioned this already. Neither of which I can do. However, I'm fairly certain there's a bug here - please let it get fixed.

@mcollina mcollina added the help wanted Issues that need assistance from volunteers or PRs that need help to proceed. label Mar 20, 2023
@mcollina
Copy link
Member

Would you like to send a pull request to resolve this issue? Investigating this problem would likely require multiple days of work, and I'm not convinced either @ronag, myself or other members of @nodejs/http have the time to investigate.

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 20, 2023

@ywave620 So default behavior for nodejs server is to allowHalfOpen whereas for clients, its to not allowHalfOpen - you can still pass it in options to override this behavior (backdoor or not - up the maintainers)

@DevasiaThomas
Copy link
Author

@mcollina Sure I can try :)

@bnoordhuis
Copy link
Member

there is some logic causing an outgoing ACK (not FIN,ACK) to an incoming FIN (on my nodejs client) to appear as an active connection which further causes the lifo scheduling (this happens in fifo as well) to push that connection to the top of the pool instead of marking it as closed from the pool's perspective

Just an observation from reading lib/_http_agent.js but I don't think that's possible. Sockets are removed from the FIFO/LIFO queue but sockets in the queue are never reordered.

(It's the .freeSockets property, keyed by host.)

What I think can happen is this: destroyed sockets aren't immediately removed from the queue, they're simply discarded (in [FL]IFO order) when the agent looks for a live socket, by inspecting socket.destroyed == true.

Assume for simplicity that all sockets expire at the same time, e.g., because the network went down. Node gets notified by the operating system on a per-socket basis. Crucially, node may not get those notifications all at the same time.

For the next request, the agent discards the ones it knows are dead. It picks the first one it believes is alive - except it isn't, we just haven't received the notification yet. It's basically the race condition I mentioned earlier.

@DevasiaThomas
Copy link
Author

DevasiaThomas commented Mar 20, 2023

@bnoordhuis Thank you looking into it :) . I just reached that part of code

Looking for your thoughts on the below:

  1. in addRequest, the code that clears stale free sockets does not follow the direction of the scheduling algorithm. Any particular reason for this? If it was lifo, you would get a socket from the other end that was not checked for stale sockets.

  2. in places like addRequest, on 'free' and removeSocket, instead of just destroyed or !writable, could we check for half closed scenarios incase allowHalfOpen is not set -> My case where server sent the FIN so its the readable end that was closed but not (yet) the writeable.

@jiho-kr
Copy link

jiho-kr commented Jan 6, 2024

Any updates on this one?
issue on node 20.10.x

@mcollina
Copy link
Member

mcollina commented Jan 8, 2024

At this point I don't think much can be done in Node.js because our theory is that it is a race condition. It would be great to have a reproduction for this (cc @dhedey) with a client and server to prove or disprove this theory.

To mitigate, you can:

  • disable keep-alive
  • reduce the keep-alive timeout to a couple of seconds

@dhedey
Copy link

dhedey commented Jan 21, 2024

@mcollina I have a (relatively) simple reproduction in this PR dhedey#1 where a server doesn't send back a Connection: Close header but closes the connection anyway (I believe similar to axum on receiving a Connection: Close header). Whilst I agree it's a race condition, I believe it's a race condition which should be able to be fixed inside Node.js (although it might not be easy).

Basically whenever the server closes a TCP connection, the http agent with keep-alive: true is likely to cause a request error for one or more queued requests or requests submitted in the same event loop iteration as the previous response.

The repro seems to back up my understanding that the http agent is too slow to remove server-closed sockets from its pool, thereby re-using them, and immediately triggering a "socket hang up" (ECONNRESET) inside the http client the next time the socket is attempted to be used, via the socketCloseListener (which transfers a socket "close" event into a request "error").

It's even easier to repro if you use keepAlive: true and maxSockets: 1 on your http agent, sending two requests, and having a connection close header in your http request. This will result in the server closing the connection immediately on you with a FIN packet. When the first response is returned, the http agent will immediately start on the queued second response, re-using the old socket which has been closed by the server and trigger this bug.

Even without that maxSockets, this bug will still manifest if there is a little time between a response and next request, and then the http agent can re-use the socket anyway.

Now that we have a repro in the node code base itself, I'll try to find some time soon to have a play with the implementation itself, and propose / implement some fixes to it.

@dhedey
Copy link

dhedey commented Jan 22, 2024

I've spent most of the rest of the day on this, and I'm afraid with my newbie energy I was a little too optimistic, and I think for the greater good I'll have to admit defeat and stop working on this.

The earlier messages in this issue cover a lot of good ground, and are likely better places to get up to speed than this post.

To summarise my understanding

This issue crops up where a server closes the TCP connection (with a "FIN"); while in the mean-time the http agent believes the connection is suitable to be kept alive and routes another request to it.

To hit this issue requires:

  1. keep-alive: true to be set on the http agent. The default http agent now has this, which is why this issue got more attention in the last year or so.
  2. The server must not have sent a Connection: close HTTP header on a response before closing the socket with a FIN at the TCP layer (if the http agent sees such a response header, it does correctly ensure the socket is not 'kept alive' after the response completes).
  3. If the server sends a FIN after the previous response, the http agent must send another request before it processes the socket "close" event is processed a few ticks later (so either it has a request queued or receive a new request within a few ticks).

At its core, this is a race condition - however, the multiple ticks between the socket's "free" event firing as a result of parsing the http response and the socket "end" or "close" event firing mean that there's actually quite a large window for the condition to be hit, and the socket re-assigned to another request. Typically the socket has already received the FIN, and maybe even ACK'd it, whilst Node is still reusing the connection.

What I've looked into

I tried picking up part 2 as DevasiaThomas mentioned here: #47130 (comment) - essentially improving the handling of freeing to reduce the window/ticks where we've received the FIN but can re-assign the socket. But as detailed in the logs below, no flags are yet set to indicate a FIN has been received by the time the socket is freed and re-assigned.

Setting max sockets to 1 (which seems to give the more straight-forward timeline to reason about), we get a timeline (amended from the test in dhedey#1) as follows:

CLIENT_SOCKET_READY: Request 1
SERVER_SOCKET_OPEN: 1
SERVER_SOCKET_END: 1 (i.e. sends FIN)
CLIENT_RESPONSE_END: Request 1 (i.e. HTTP response parser completes)
CLIENT_SOCKET_FREE: Request 1 (which triggers CLIENT_AGENT_SOCKET_FREE)
CLIENT_SOCKET_REUSED: Request 1 for Request 2 (i.e. the "socket" event on request 2 uses the same socket) 
> With socket._readableState.ended: false
> With socket._readableState.endEmitted: false
> With socket._writableState.ending: false
> With socket._writableState.ended: false
> With socket._writableState.finished: false
CLIENT_SOCKET_END: Request 1
<CLIENT_SOCKET_CLOSED then fires which causes Request 2 to fail>

Sadly at the time free is called, none of the flags are set as part of end-processing yet; so a fix here I imagine would need to look at the code which handles the packet containing TCP data + FIN and ideally share the knowledge of the FIN before sharing the data (or as soon as possible afterwards).

I looked through:

  • Stream implementation (readable.js, writable.js, duplex.js)
  • Socket implementation (net.js, tcp_wrap.cc)
  • HTTP client + Agent implementation (_http_client.js, _http_agent.js)

But with my newbie knowledge I couldn't find where the FIN handling and triggering of the "end" event occurred. (unless it came from the some read call / EOF in Readable.prototype.read which gets triggered via afterConnect in net.js if the socket is no longer readable?).

I guess if the end/close event was only triggered by attempting to read from the socket, that might explain why this bug is so noticable - but I'm probably wrong and I think I'm far past my limits here in terms of code-base expertise, so will need to take a step back.

Someone with more knowledge of how this pipeline works may be able to point at how/whether it may be possible to capture / mark the received FIN earlier; before the socket close handlers are processed a few ticks later.

Possible future work

Note - I'm afraid this isn't something I can do, I'm just posting these in case they may be useful to others

  1. For my original use-case (i.e. old versions of node-fetch before my bug fix), we could avoid this issue if we simply set req.shouldKeepAlive to false if the user of http-client adds their own Connection: close header (and thereby overrides the Connection: keep-alive header which the agent uses). Note - as mentioned in part 2 of the summarisation, if the server replies with a Connection: close header, the issue also doesn't occur - but many servers such as axum don't send their own Connection: close header on the response when the client added it themselves.
  2. Someone with more knowledge could try to pick up the "shortening the window" investigation above. Essentially if we can read the TCP FIN that was sent immediately after the response, and mark the socket as "not to be reused" before we try to re-use it, we could avoid these issues.
  3. Solving Devasia's problem (where the server randomly times out connections, and doesn't use Keep-Alive timeout headers) may benefit from either improved handling of Keep-Alive hint headers; or a more general retry-on-failure approach - but this would be a lot of work and it sounds like the new replacement undici http client has already fixed this issue; so it's probably not on the cards?

In Summary

With lots more knowledgable people advising using undici (the new HTTP 1.1 client) or disabling keep alive, there's not much more I can offer here at the moment.

Sorry I couldn't be more use!

dhedey added a commit to dhedey/node that referenced this issue Jan 22, 2024
@timakhalaya
Copy link

I would strongly advice to do the following :

  • Check the k8s network policies , check that you don't have any network restrictions
  • Worth checking the DNS resolution as well...
  • Check timeout setting on your http module config
  • And only then we can check keepAlive and keepAliveMsecs, if you have downstream services ensure that you don't have any restarts within the service - > check grace termination

@lagog
Copy link

lagog commented Mar 4, 2024

I believe I may have found another testcase that highlights this same issue. It can clearly show how the node http agent will assign a dead socket to a new request. Presumably because the close event handler is not given a chance to process. It executes too far down the event loop.

The issue is as follows:

  1. nodejs http client using default agent sends a request
  2. client remains busy for longer than the server keep-alive timeout (ie. processing the result of request 1 above)
  3. client makes another request to same remote - ERROR:
Error: read ECONNRESET
    at TCP.onStreamRead (node:internal/stream_base_commons:217:20) {
  errno: -4077,
  code: 'ECONNRESET',
  syscall: 'read'
}

Node.js v20.11.0

Note that if you defer the second request using setTimeout(0), it works fine.

Here's the test code:

////
//// SERVER
////

const server = require('node:http').createServer((req, res) => {
    console.log('SERVER: Received:', req.headers);
    res.end('OK');
}).listen(3000);

////
//// CLIENT
////

const request = async (url, options) => {
    return new Promise((resolve, reject) => {
        const req = require('node:http').request(url, options, (res) => {
            res.on('data', (data) => console.log('CLIENT: Received: ' + data.toString(), res.headers));
            res.on('end', resolve);
        });
        req.on('error', reject);
        req.end();
    });
};

(async () => {
    ////
    //// 1. Send a request and wait for response
    ////

    console.log('CLIENT: Send (1) ...');
    await request('http://localhost:3000');

    ////
    //// 2. Stay busy long enough for the server to close the connection.
    ////
    console.log('CLIENT: Busy....');
    const end = Date.now() + 10000;       // nodejs default http server keep-alive is 5 seconds
    while (end > Date.now());

    ////
    //// 3. Send a second request and await response
    ////

    // PROBLEM:    this request will be over the same socket that was created in the first request,
    //             which by now has timed out and was closed on the server side
    //
    // WORKAROUND: defer the request to another event loop iteration (by uncommenting below):
    //     await require('node:util').promisify(setTimeout)(0);
    console.log('CLIENT: Send (2) ...');
    await request('http://localhost:3000');

    server.close();
})();

There's probably not much that can be done with regards to handling the socket close in a more timely way in this situation, but is there anything that can be done in the agent to prevent assigning a dead socket?

@ShogunPanda
Copy link
Contributor

The repro is quite interesting, thanks for that.
The only way to solve this would be to detect the ECONNRESET internally and eventually retry with a new socket.
But the fix would happen at the agent level I guess.

@dhedey
Copy link

dhedey commented Mar 5, 2024

That's a nice clear (specific) repro; and this is a good thought experiment:

There's probably not much that can be done with regards to handling the socket close in a more timely way in this situation, but is there anything that can be done in the agent to prevent assigning a dead socket?

Indeed, it makes sense that there's not been a chance for node to action the received TCP close because of the while loop in this case.

As per @ShogunPanda 's thinking:

The only way to solve this would be to detect the ECONNRESET internally and eventually retry with a new socket.
But the fix would happen at the agent level I guess.

In this case, yeah adding a retry when sending down a closed socket sounds like the most fruitful approach (option 3 in the Future Work I laid out).

I'm slightly worried that retrying on ECONNRESET might not be quite specific enough, as I believe it covers a few different error cases, and we may wish retry to be specific to this "attempted use after closed" case. More specifically - I imagine we may wish to avoid retrying if the socket closes midway through sending data (rather than being already closed) in case this breaks user expectations around e.g. not sending duplicate POST requests (even if such expectations aren't actually guaranteed in general).

But I really don't know enough about this to comment if this is actually a problem. Maybe catching all such ECONNRESET is just the correct approach! The fact a TCP reset ends up coming out a http socket feels like a little bit of an abstraction break. Although if we've already started sending data from the stream we may be unable to just restart.

Perhaps we can check if the socket is still accessible before using it (via changes in node's C layer via some call into the OS TCP layer or saving a flag earlier when the socket changed - tcp_wrap.cc / uv_common.cc / tcp.c / stream.c); and throw a specific error at the socket/stream level; which can be used to close the http connection and retry on a new one at the http agent level - although this post RE Java suggests it's standard practice to attempt to read to detect that which I believe is what node is doing already when we get the ECONNRESET.

In any case, I am far out of my depth, so will leave all this thinking to people who know a lot more than I! I'm glad it's in such hands now.

@ziimakc
Copy link

ziimakc commented Mar 14, 2024

Faced same issue while just making 2 simple fetch requests one after another. While setting keepalive: false didn't help, await sleep(0) // setTimeout(0) solved an issue.

Not sure if this just an coincidence, but if I make request to the same api over https I don't get an error.

@Lomilar
Copy link

Lomilar commented Mar 25, 2024

Having this problem specifically when the node process is doing a lot of processing and doing http based requests to itself -- our analogy -- without a chance to "rest" and take care of all the new packets that came in.

We mitigated it by putting an await new Promise(resolve => setTimeout(resolve, 1)) on the beginning of each HTTP request -- setTimeout(0) wasn't enough for us.

For us, this was manifesting as an EPIPE error. We were using a containerized environment on a windows box.

For us, this was happening with undici 6.10.1, node 20, and express.

zanettea added a commit to zanettea/node that referenced this issue Apr 23, 2024
zanettea added a commit to zanettea/node that referenced this issue Apr 24, 2024
zanettea added a commit to zanettea/node that referenced this issue Apr 24, 2024
zanettea added a commit to zanettea/node that referenced this issue Apr 24, 2024
…lation between http1.1 servers and clients

reduced likelihood of race conditions on keep-alive timeout calculation between http1.1 servers and clients and honor server keep-alive timeout when agentTimeout is not set

Fixes: nodejs#47130
Fixes: nodejs#52649
zanettea added a commit to zanettea/node that referenced this issue Apr 24, 2024
…lation between http1.1 servers and clients

reduced likelihood of race conditions on keep-alive timeout calculation between http1.1 servers and clients and honor server keep-alive timeout when agentTimeout is not set

Fixes: nodejs#47130
Fixes: nodejs#52649
zanettea added a commit to zanettea/node that referenced this issue Apr 24, 2024
…lation between http1.1 servers and clients

reduced likelihood of race conditions on keep-alive timeout calculation between http1.1 servers and clients and honor server keep-alive timeout when agentTimeout is not set

Fixes: nodejs#47130
Fixes: nodejs#52649
zanettea added a commit to zanettea/node that referenced this issue Apr 30, 2024
reduce likelihood of race conditions on keep-alive timeout
calculation between http1.1 servers and clients and honor server
keep-alive timeout when agentTimeout is not set

Fixes: nodejs#47130
Fixes: nodejs#52649
@joelin-vyond
Copy link

H there,
Since Node c22 is just released, it would be great to know whether the issue has been fixed in Node 22, or we still have to use node v16 or the other http client undici?

Thanks,
Joe Lin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues that need assistance from volunteers or PRs that need help to proceed. http Issues or PRs related to the http subsystem.
Projects
None yet
Development

No branches or pull requests