Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLSWrap objects grow with HTTP keep-alive enabled until out of memory error occurs #33468

Closed
mgilbey opened this issue May 19, 2020 · 16 comments
Closed
Labels
http Issues or PRs related to the http subsystem. https Issues or PRs related to the https subsystem.

Comments

@mgilbey
Copy link

mgilbey commented May 19, 2020

  • Version:
$ node -p process.versions
{
  node: '14.2.0',
  v8: '8.1.307.31-node.33',
  uv: '1.37.0',
  zlib: '1.2.11',
  brotli: '1.0.7',
  ares: '1.16.0',
  modules: '83',
  nghttp2: '1.40.0',
  napi: '6',
  llhttp: '2.0.4',
  openssl: '1.1.1g',
  cldr: '36.1',
  icu: '66.1',
  tz: '2019c',
  unicode: '13.0'
}
  • Platform: Darwin <host> 19.4.0 Darwin Kernel Version 19.4.0: Wed Mar 4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64 x86_64 and node docker image node:13.13.0-alpine

  • Subsystem: https

What steps will reproduce the bug?

I have created a script to reproduce at https://github.com/mgilbey/node-tls-wrap

The script runs a TLS server and https Agent to be self contained, but only the Agent is needed to reproduce the problem with a HTTPS server to connect to. The classes are only used to help identify which objects are holding references in the heap snapshots.

The script will open 100 connections, send and receive two HTTP requests on each connection (demonstrating keep-alive functions as expected) and then lets the sockets close due to the keep-alive timeout. This repeats five times, leaving 501 TLSWrap objects in memory. The scripts forces a garbage collection, and then collects a heap dump before exiting.

The agent reports the expected number of active / free sockets (never more than 100).

How often does it reproduce? Is there a required condition?

The script will consistently reproduce the issue.

I believe the salient points for reproduction are:

  • https Agent with keepAlive enabled
  • The number of open sockets never exceeds the maxSockets Agent option.
  • Sockets are closed due to the keepAlive timeout specified by the Agent being reached.

What is the expected behavior?

TLSWrap and TLSSocket objects are garbage collected when the socket from the agent is closed due to the keep-alive timeout, and no further keep-alive socket re-use is possible.

What do you see instead?

TLSWrap and TLSSocket objects will grow in memory until the old-space heap limits are reached at e.g. 10 or 300 MB limits. The number of TLSWrap and TLSSocket objects in memory will exceed the maxSockets, maxFreeSockets or maxCachedSessions Agent options.

An allocation timeline and heap snapshot are also included in the repository.

image

Each TLSWrap object has four retainers, although none appear to be the https Agent.

image

Additional information

To demonstrate the process crashing due to heap limits, run with --max-old-space-size=10 and increase the number of iterations on line 128 to 50. You can increase heap and the number of iterations should you be more patient.

The issue does not reproduce with the http Agent.

The cause appears to be very similar to resolved issue #19859 that affected the http Agent, particularly the reference to the open sockets being less than the maxSockets option.

@mgilbey mgilbey changed the title TLSWrap objects grow with HTTP keep-alive enabled until of out memory errors occurs TLSWrap objects grow with HTTP keep-alive enabled until out memory error occurs May 19, 2020
@addaleax
Copy link
Member

@mgilbey Can you check whether this was introduced between v13.8.0 and v13.9.0? I’m kind of suspecting 9fdb6e6 here, based on the retainer output.

If yes, #31745 might fix this, although it requires some updating as indicated in the PR. I’ll get on that.

@mgilbey
Copy link
Author

mgilbey commented May 19, 2020

I've checked versions 13.0.0, 13.8.0 and 13.9.0 but all fail to close connections after the keep-alive timeout has elapsed, meaning that they bypass the test entirely, and leak unclosed sockets instead of closed ones.

I went back to 12.6.3 and it behaves as expected with sockets closed after the keep-alive timeout elapses and does not retain TLSWrap objects. When it completes the snapshot only references a single TLSWrap object, which would be expected when both the client and server are destroyed.

Conversely the 13.x versions above retain the TLSWrap objects even after client, server and all sockets are closed, which suggests that the retention was before 9fdb6e6.

@lpinca lpinca added http Issues or PRs related to the http subsystem. https Issues or PRs related to the https subsystem. labels May 20, 2020
@omsmith
Copy link
Contributor

omsmith commented Jun 2, 2020

Ran into what appears to be the same or similar issue when updating node.js to 14.3 (from 8.x) in one of our services. In our case we're also using keep-alive and to exacerbate the issue, we deliberately destroy sockets 10 minutes after first use to ensure we're responding to DNS changes of servers that may not necessarily close connections.

We've set up the service to take heapsnapshots every hour. Here's a comparison of one of the workers between hour 1 and hour 2 of operation, June 2 00:48Z and 01:48Z respectively. In this case I don't have an hour 3, because a graceful restart was necessary ahead of any OOM issues.

2020-06-02T12:46:10+00:00

I compiled and deployed a package of v14.x with 6f6bf01 cherry-picked on top. This appears to have resolved the issue, and the service thankfully made it through the night without intervention. Certainly some slightly different load profiles due to difference in time, but meaningful nonetheless. Here's a comparison of one of the workers between hour 1 and hour 9 of operation.

2020-06-02T12:51:35+00:00

Thank you @addaleax, and thanks @mgilbey for reporting the issue - hope it's fixed for you as well.

addaleax added a commit to addaleax/node that referenced this issue Jun 9, 2020
Since f59ec2a, `BaseObject` instances were tracked in heap snapshots
through their associated `CleanupHookCallback`s which were stored on
the `Environment`; however, this is inaccurate, because:

- Edges in heap dumps imply a keeps-alive relationship, but cleanup
  hooks do not keep the `BaseObject`s that they point to alive.
- It loses information about whether `BaseObject` instances are
  GC roots: Even weak `BaseObject`s are now, practically speaking,
  showing up as hanging off a GC root when that isn’t actually the case
  (e.g. in the description of nodejs#33468).

Thus, this is a partial revert of f59ec2a.

Refs: nodejs#33468
Refs: nodejs#27018
addaleax added a commit that referenced this issue Jun 11, 2020
Since f59ec2a, `BaseObject` instances were tracked in heap snapshots
through their associated `CleanupHookCallback`s which were stored on
the `Environment`; however, this is inaccurate, because:

- Edges in heap dumps imply a keeps-alive relationship, but cleanup
  hooks do not keep the `BaseObject`s that they point to alive.
- It loses information about whether `BaseObject` instances are
  GC roots: Even weak `BaseObject`s are now, practically speaking,
  showing up as hanging off a GC root when that isn’t actually the case
  (e.g. in the description of #33468).

Thus, this is a partial revert of f59ec2a.

Refs: #33468
Refs: #27018

PR-URL: #33809
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
@abartomeu
Copy link

abartomeu commented Jun 15, 2020

Since I upgraded from 12.16.3 to 12.17 I have had a memory leak in production that I am trying to figure out. I took a memory snapshot in the staging environment and I can see 898 TLSWrap objects with the same 4 retainers as mentioned by @mgilbey. I couldn't validate yet that is the leak because I was only able to take one snapshot but if it is, it was introduced in 12.17.

@mgilbey mgilbey changed the title TLSWrap objects grow with HTTP keep-alive enabled until out memory error occurs TLSWrap objects grow with HTTP keep-alive enabled until out of memory error occurs Jun 15, 2020
@christhegrand
Copy link

We also started seeing this after upgrading from 12.16 to 12.17.

@abartomeu
Copy link

I have done further testing on this issue. I have taken a memory snapshot in after 24 hours and I see 64000 TLSWrap objects using about 400MB of memory. In previous experiments I have seen that the number of these TLSWrap objects is always growing. I also tested that even 10 mins after I stopped all traffic to my service, and Http keepalive should have expired, the TLSWrap objects were still around.

Is there any chances we could get this fixed?

codebytere pushed a commit that referenced this issue Jun 18, 2020
Since f59ec2a, `BaseObject` instances were tracked in heap snapshots
through their associated `CleanupHookCallback`s which were stored on
the `Environment`; however, this is inaccurate, because:

- Edges in heap dumps imply a keeps-alive relationship, but cleanup
  hooks do not keep the `BaseObject`s that they point to alive.
- It loses information about whether `BaseObject` instances are
  GC roots: Even weak `BaseObject`s are now, practically speaking,
  showing up as hanging off a GC root when that isn’t actually the case
  (e.g. in the description of #33468).

Thus, this is a partial revert of f59ec2a.

Refs: #33468
Refs: #27018

PR-URL: #33809
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
@addaleax
Copy link
Member

@christhegrand @abartomeu I think it’s a good sign that for @omsmith this has been fixed through e8cc269.

I’ve opened a backport PR for v12.x to expedite getting the fix in: #33962

It would be very helpful if you could build Node.js from source for that PR and confirm that it does solve the issue for you as well.

@abartomeu
Copy link

Hi @addaleax, we have tried the patched version for 12h and the memory trend seems similar to that in v12.16.3. That alone would justify promoting this fix. I will leave it running for at least one more day and will update it.

@abartomeu
Copy link

abartomeu commented Jun 24, 2020

About 30h running and memory continues to be similar to that in v12.16.3. I haven't noticed any other change in any other performance parameters (CPU usage, response time, etc)

@addaleax, could you kindly comment how long would this usually take to get to an official nodejs version?

@mortenbo
Copy link

I've been debugging a memory leak in our Node.js app after upgrading from Node 10 to Node 14. I was close to giving up when stumbling upon this issue. Downgrading from 14.4.0 to 12.6.0 fixed it for us.

@karicem
Copy link

karicem commented Jun 25, 2020

We recently upgraded from 12.16 to 12.18 and noticed the memory leak.
Will the fix for the memory leak be included in the next release of Node.js 12.x? When will that be available? We would like to stay on the most current version, rather then rolling back, to ensure we have all security fixes.

@addaleax
Copy link
Member

We recently upgraded from 12.16 to 12.18 and noticed the memory leak.

@karicem Just to make sure, are you confident that it is this one or is there a chance it’s a separate bug?

Will the fix for the memory leak be included in the next release of Node.js 12.x?

See #33962probably yes.

When will that be available?

According to nodejs/Release#494, that would be July 14th. However, as you can see in the PR, we’re discussing potentially releasing this earlier than that.

@karicem
Copy link

karicem commented Jun 27, 2020

We haven't rebuilt from source with the patch but we have some data that seems to indicate it will. We have one app running with 12.18 that does not use keep-alive and it does not show signs of growing memory usage. When we look at memory usage of other apps running 12.18 that are using keep-alive it keeps increasing, at different rates depending on the app.

@BethGriggs
Copy link
Member

Opened a v12.18.2 release proposal (#34077) which contains this patch. Aiming for it to be released next Tuesday, 30th June. Here's a release candidate build that you may be able to test with - https://nodejs.org/download/rc/v12.18.2-rc.0/

@karicem
Copy link

karicem commented Jun 27, 2020

Thanks, we will have a look.

codebytere pushed a commit that referenced this issue Jun 30, 2020
Since f59ec2a, `BaseObject` instances were tracked in heap snapshots
through their associated `CleanupHookCallback`s which were stored on
the `Environment`; however, this is inaccurate, because:

- Edges in heap dumps imply a keeps-alive relationship, but cleanup
  hooks do not keep the `BaseObject`s that they point to alive.
- It loses information about whether `BaseObject` instances are
  GC roots: Even weak `BaseObject`s are now, practically speaking,
  showing up as hanging off a GC root when that isn’t actually the case
  (e.g. in the description of #33468).

Thus, this is a partial revert of f59ec2a.

Refs: #33468
Refs: #27018

PR-URL: #33809
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
@mgilbey
Copy link
Author

mgilbey commented Jul 2, 2020

Confirming that v12.18.2 and v14.5.0 no longer exhibit this behaviour. Thank you for the fix!

I've updated the reproduction repository with new certificates as the original expired if people want to test other versions.

@mgilbey mgilbey closed this as completed Jul 7, 2020
codebytere pushed a commit that referenced this issue Jul 10, 2020
Since f59ec2a, `BaseObject` instances were tracked in heap snapshots
through their associated `CleanupHookCallback`s which were stored on
the `Environment`; however, this is inaccurate, because:

- Edges in heap dumps imply a keeps-alive relationship, but cleanup
  hooks do not keep the `BaseObject`s that they point to alive.
- It loses information about whether `BaseObject` instances are
  GC roots: Even weak `BaseObject`s are now, practically speaking,
  showing up as hanging off a GC root when that isn’t actually the case
  (e.g. in the description of #33468).

Thus, this is a partial revert of f59ec2a.

Refs: #33468
Refs: #27018

PR-URL: #33809
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
codebytere pushed a commit that referenced this issue Jul 14, 2020
Since f59ec2a, `BaseObject` instances were tracked in heap snapshots
through their associated `CleanupHookCallback`s which were stored on
the `Environment`; however, this is inaccurate, because:

- Edges in heap dumps imply a keeps-alive relationship, but cleanup
  hooks do not keep the `BaseObject`s that they point to alive.
- It loses information about whether `BaseObject` instances are
  GC roots: Even weak `BaseObject`s are now, practically speaking,
  showing up as hanging off a GC root when that isn’t actually the case
  (e.g. in the description of #33468).

Thus, this is a partial revert of f59ec2a.

Refs: #33468
Refs: #27018

PR-URL: #33809
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem. https Issues or PRs related to the https subsystem.
Projects
None yet
Development

No branches or pull requests

9 participants