Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks when using node-v10 and node-v12 #829

Closed
artemp opened this issue Oct 30, 2020 · 25 comments
Closed

Memory leaks when using node-v10 and node-v12 #829

artemp opened this issue Oct 30, 2020 · 25 comments

Comments

@artemp
Copy link

artemp commented Oct 30, 2020

Memory leak(s)

To reproduce

https://github.com/nodejs/node-addon-examples/tree/master/napi-asyncworker-example/node-addon-api

then following script (MemoryLeak.js)


const runWorker = require('../build/Debug/napi-asyncworker-example-native');


function MemoryLeak(i, count) {
  if (i == count) return;
  let result = runWorker.runSimpleAsyncWorker(1, (err, result) => {
    if (err) {
      console.log("SimpleAsyncWorker returned an error: ", err);
    } else {
      if (i == count - 1) {
        //global.gc();
        //console.log(process.memoryUsage());
        //console.log(require('v8').getHeapStatistics());
        //console.log(require('v8').getHeapSpaceStatistics());
      }
      console.log("SimpleAsyncWorker returned '"+result+"'.");
    }
  });
  console.log("runSimpleAsyncWorker returned '"+result+"'.");
  MemoryLeak(++i, count);
}

//console.log(process.memoryUsage());
//console.log(require('v8').getHeapStatistics());
//console.log(require('v8').getHeapSpaceStatistics());
let i = 0;
MemoryLeak(i, 5);


node-v10.23.0 + node-addon-api-v3.0.2

valgrind --leak-check=yes  /opt/node-v10.23.0/bin/node test/MemoryLeak.js
==222152== 224 bytes in 4 blocks are definitely lost in loss record 35 of 41
==222152==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==222152==    by 0x75CFD9: napi_create_reference (in /opt/node-v10.23.0/bin/node)
==222152==    by 0x4863057: Napi::Reference<Napi::Function>::New(Napi::Function const&, unsigned int) (napi-inl.h:2153)
==222152==    by 0x4862292: Napi::Persistent(Napi::Function) (napi-inl.h:2317)
==222152==    by 0x4862683: Napi::AsyncWorker::AsyncWorker(Napi::Object const&, Napi::Function const&, char const*, Napi::Object const&) (napi-inl.h:3588)
==222152==    by 0x48625C1: Napi::AsyncWorker::AsyncWorker(Napi::Function const&, char const*, Napi::Object const&) (napi-inl.h:3564)
==222152==    by 0x4862534: Napi::AsyncWorker::AsyncWorker(Napi::Function const&, char const*) (napi-inl.h:3555)
==222152==    by 0x48624C5: Napi::AsyncWorker::AsyncWorker(Napi::Function const&) (napi-inl.h:3550)
==222152==    by 0x4861AA3: SimpleAsyncWorker::SimpleAsyncWorker(Napi::Function&, int) (SimpleAsyncWorker.cc:6)
==222152==    by 0x485E09A: runSimpleAsyncWorker(Napi::CallbackInfo const&) (RunSimpleAsyncWorker.cc:6)
==222152==    by 0x486139D: Napi::details::CallbackData<Napi::Value (*)(Napi::CallbackInfo const&), Napi::Value>::Wrapper(napi_env__*, napi_callback_info__*)::{lambda()#1}::operator()() const (napi-inl.h:82)
==222152==    by 0x4861753: napi_value__* Napi::details::WrapCallback<Napi::details::CallbackData<Napi::Value (*)(Napi::CallbackInfo const&), Napi::Value>::Wrapper(napi_env__*, napi_callback_info__*)::{lambda()#1}>(Napi::details::CallbackData<Napi::Value (*)(Napi::CallbackInfo const&), Napi::Value>::Wrapper(napi_env__*, napi_callback_info__*)::{lambda()#1}) (napi-inl.h:61)

looks interesting ^

==222152== LEAK SUMMARY:
==222152==    definitely lost: 544 bytes in 10 blocks
==222152==    indirectly lost: 16 bytes in 1 blocks
==222152==      possibly lost: 304 bytes in 1 blocks
==222152==    still reachable: 72,075 bytes in 1,076 blocks
==222152==         suppressed: 0 bytes in 0 blocks

Amount of definitely lost increases as number of iterations increases e.g calling MemoryLeak(i, 100) ^^

node-v14.11.0 + node-addon-api-v3.0.2

valgrind --leak-check=yes  /opt/node-v14.11.0/bin/node test/MemoryLeak.js
==222253== LEAK SUMMARY:
==222253==    definitely lost: 64 bytes in 1 blocks
==222253==    indirectly lost: 0 bytes in 0 blocks
==222253==      possibly lost: 304 bytes in 1 blocks
==222253==    still reachable: 8,191 bytes in 30 blocks
==222253==                       of which reachable via heuristic:
==222253==                         multipleinheritance: 48 bytes in 1 blocks
==222253==         suppressed: 0 bytes in 0 blocks

With node >=13 (tested with node-v13.12.0 and node-v14.11.0) there are constant results from memcheck no matter how many iterations as above.

It looks like there are memory leaks related to node-addon-api in node-10 and/or corresponding napi implementation ( and also in node-v12.16.1)

Is this known behaviour? Did anyone came across similar issues?

@mhdawson
Copy link
Member

mhdawson commented Nov 2, 2020

@artemp, @gabrielschulhof has done work to better handle cleanup, in particular for Worker threads. This change is in v13.x and higher not 12.x and 10.x. nodejs/node@53ca0b9

I think that could explain what you are seeing. In the non-worker case the 'leak' would only be present when the Node.js process was shutting down. That would be reported by valgrind but not be a problem in real-world deployments. In addition to fixing it for workers where it could have been a real-world issue, the change above also fixed it on shutdown.

@artemp
Copy link
Author

artemp commented Nov 3, 2020

@mhdawson - looks like nodejs/node@53ca0b9 is the one! Assuming not, but are there any plans to backport these changes to node 10.x and/or 12.x? Cheers.

@mhdawson
Copy link
Member

mhdawson commented Nov 4, 2020

@artemp, no plans to backport as in most cases it's not a real issue since Node.js is shutting down anyway and its been like that for addons since the beginning of time versus being a regression at some point.

@artemp
Copy link
Author

artemp commented Nov 5, 2020

ok, thanks @mhdawson

@artemp artemp closed this as completed Nov 5, 2020
@artemp
Copy link
Author

artemp commented Nov 6, 2020

I'm having second thoughts on this issue, I think I closed it prematurely:) I have done more testing and I'm seeing unbound memory growth in long running processes which use AsyncWorker with node 10.

in most cases it's not a real issue since Node.js is shutting down anyway

^ Contrary, it looks like a defect. The memory doesn't get released by GC over time and it leads to 'out-of-memory' condition in real life deployments.

From my testing, it looks like node-addon-api AsyncWorker can't be used with node < 13 in production because of memory leaks. Both Node 10 and 12 are considered production ready : https://nodejs.org/en/about/releases/

If backporting is not an option, perhaps this should be documented somewhere?
/cc @mhdawson @gabrielschulhof

@artemp artemp reopened this Nov 6, 2020
@mhdawson
Copy link
Member

mhdawson commented Nov 6, 2020

@artemp thanks for re-opening. I'm now wondering if you are running into nodejs/node#35777.

Which versions of 12.x and 10.x are you seeing the issue with? If they are versions that have the regression related to 35777 can you try with an earlier version ?

@mhdawson
Copy link
Member

mhdawson commented Nov 6, 2020

Although I don't think that made it into 10.x so maybe not related.

Do you have a test case that when run with a small heap quickly shows it running out of memory?

@artemp
Copy link
Author

artemp commented Nov 8, 2020

@mhdawson thanks for getting back.

Which versions of 12.x and 10.x are you seeing the issue with?

I'm using node-v10.19.0, node-v10.23.0 and node-v12.16.1.

Do you have a test case that when run with a small heap quickly shows it running out of memory?

The leak is small and gets problematic only over time. Here is my attempt to show the issue using unmodified
napi-asyncworker-example-native from node-addon-examples

const runWorker = require('../build/Release/napi-asyncworker-example-native');
const argv = process.argv.slice(2);

function MemoryLeak(i, count) {
  setTimeout(() => {
    if (i == count)
    {
      console.log(process.memoryUsage());
      console.log(require('v8').getHeapStatistics());
      return;
    }
    let result = runWorker.runSimpleAsyncWorker(0.001, (err, result) => {
      if (err) {
        console.log("SimpleAsyncWorker returned an error: ", err);
      } else {
        console.log(i + " SimpleAsyncWorker returned '"+result+"'.");
      }
      MemoryLeak(++i, count);
    });
  }, 0.1);
}

function main(args)
{
  console.log(args);
  var count = Number(args[0]);
  console.log("testing count=", count, "..");
  let i = 0;
  MemoryLeak(i, count);
  console.log("done.");
}

main(argv);

Run the same script with node-10 and node-15

/opt/node-v10.23.0/bin/node test/MemoryLeak.js 1000000
/opt/node-v15.0.1/bin/node test/MemoryLeak.js 1000000

After some time:

htop

Not quite running out or heap memory, but it shows that node-10 accumulate some unreferenced memory over time comparing to node-15.

If they are versions that have the regression related to 35777 can you try with an earlier version ?

I'm going to try reverting nodejs/node#35777 to see if it makes any differences.

@mhdawson
Copy link
Member

mhdawson commented Nov 9, 2020

@artemp thanks for the info and trying those things out.

I was asking about if we can get to an OOM, because if nodejs/node@53ca0b9 fixes the reported memory leaks at shutdown, it would seem that the issue was related to the finalizers not being run at shutdown (which nodejs/node@53ca0b9 fixed). That would mean there are Objects that are keeping the memory alive and if there is a bug that means they can never be collected eventually we should be able to force an OOM with a smaller head using --max-old-space-size and enough iterations.

I guess I should have also asked if the case where you see the leak is using workers. In that case it would be expected and the right answer might be to document the limitation applies only when using workers which would be the case.

@artemp
Copy link
Author

artemp commented Nov 10, 2020

@mhdawson

Not having much luck triggering OOM with my simple script, yet.
Hmm.. I'll have to come up with something better to demo this issue..

I guess I should have also asked if the case where you see the leak is using workers.

yes ^.

@mhdawson
Copy link
Member

Without nodejs/node@53ca0b9 when using Node.js WorkerThreads, objects which have not been collected when the worker terminates would result in the leak you mentioned. So the situation I could understand is:

You repeatedly create a worker thread which
- runs a number of node-addon-api AsyncWorker workers
- immediately terminates before a gc runs

If we confirm that is the case, then I think updating the documentation to identify the potential issue for 10.x and 12.x and recommending people use a pool for WorkerThreads (which is already advisable for other reasons) as opposed to repeatedly creating a new one.

@artemp
Copy link
Author

artemp commented Nov 11, 2020

@mhdawson - I didn't realise you were referring to https://nodejs.org/api/worker_threads.html (?) and if this is the case I'm not using Node.js WorkerThreads , sorry for the confusion. I think there are multiple issues which got conflated.

I can now trigger OOMs reliably with node-v10 and node-addon-examples/napi-asyncworker-example/node-addon-api with the following change:

diff --git a/napi-asyncworker-example/node-addon-api/src/SimpleAsyncWorker.cc b/napi-asyncworker-example/node-addon-api/src/SimpleAsyncWorker.cc
index ff47d2f..e92b4e4 100644
--- a/napi-asyncworker-example/node-addon-api/src/SimpleAsyncWorker.cc
+++ b/napi-asyncworker-example/node-addon-api/src/SimpleAsyncWorker.cc
@@ -6,7 +6,7 @@ SimpleAsyncWorker::SimpleAsyncWorker(Function& callback, int runTime)
     : AsyncWorker(callback), runTime(runTime) {};
 
 void SimpleAsyncWorker::Execute() {
-    std::this_thread::sleep_for(std::chrono::seconds(runTime));
+    std::this_thread::sleep_for(std::chrono::milliseconds(runTime));
     if (runTime == 4) SetError ("Oops! Failed after 'working' 4 seconds.");
 };

and the script :

const runWorker = require('../build/Release/napi-asyncworker-example-native');
const argv = process.argv.slice(2);

function MemoryLeak(i, count) {
  if (i == count)
  {
    console.log(process.memoryUsage());
    console.log(require('v8').getHeapStatistics());
    return;
  }

  if (i%100 == 0)
    console.log("AsyncWork:",i);

  let result = runWorker.runSimpleAsyncWorker(10, (err, result) => {
    if (err) {
      console.log("SimpleAsyncWorker returned an error: ", err);
    } else {
      if (i%100 == 0) console.log(i + " SimpleAsyncWorker returned");
    }
    MemoryLeak(++i, count);
  });
}

function main(args)
{
  console.log(args);
  var count = Number(args[0]);
  console.log("testing count=", count, "..");
  let i = 0;
  MemoryLeak(i, count);
  console.log("done.");
}

main(argv);

node-v10.23.0

/opt/node-v10.23.0/bin/node --trace-gc --max-old-space-size=4 ./test/MemoryLeak.js 100000
[5903:0x5608fe9b0dd0]       23 ms: Mark-sweep 2.5 (4.2) -> 2.1 (5.2) MB, 2.6 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure scavenge might not succeed
[5903:0x5608fe9b0dd0]       33 ms: Mark-sweep 2.6 (5.2) -> 2.5 (5.7) MB, 1.9 / 0.0 ms  (average mu = 0.815, current mu = 0.815) allocation failure scavenge might not succeed
[5903:0x5608fe9b0dd0]       41 ms: Mark-sweep 3.1 (5.7) -> 2.9 (8.2) MB, 2.1 / 0.0 ms  (average mu = 0.784, current mu = 0.746) allocation failure GC in old space requested
[5903:0x5608fe9b0dd0]       55 ms: Mark-sweep 3.5 (8.2) -> 3.3 (8.7) MB, 3.1 / 0.0 ms  (average mu = 0.780, current mu = 0.777) allocation failure GC in old space requested
[ '100000' ]
testing count= 100000 ..
AsyncWork: 0
done.
0 SimpleAsyncWorker returned
AsyncWork: 100
100 SimpleAsyncWorker returned
AsyncWork: 200
200 SimpleAsyncWorker returned
AsyncWork: 300
300 SimpleAsyncWorker returned
AsyncWork: 400
400 SimpleAsyncWorker returned
AsyncWork: 500
500 SimpleAsyncWorker returned
AsyncWork: 600
600 SimpleAsyncWorker returned
AsyncWork: 700
700 SimpleAsyncWorker returned
[5903:0x5608fe9b0dd0]     8474 ms: Mark-sweep 3.7 (8.7) -> 3.4 (5.7) MB, 6.5 / 0.1 ms  (average mu = 0.999, current mu = 0.999) allocation failure GC in old space requested
[5903:0x5608fe9b0dd0]     8477 ms: Mark-sweep 3.4 (5.7) -> 3.4 (5.7) MB, 2.0 / 0.0 ms  (average mu = 0.998, current mu = 0.079) allocation failure GC in old space requested
[5903:0x5608fe9b0dd0]     8479 ms: Mark-sweep 3.4 (5.7) -> 3.4 (5.7) MB, 1.7 / 0.0 ms  (average mu = 0.998, current mu = 0.155) last resort GC in old space requested
[5903:0x5608fe9b0dd0]     8481 ms: Mark-sweep 3.4 (5.7) -> 3.4 (5.7) MB, 2.0 / 0.0 ms  (average mu = 0.996, current mu = 0.021) last resort GC in old space requested

<--- Last few GCs --->

[5903:0x5608fe9b0dd0]     8477 ms: Mark-sweep 3.4 (5.7) -> 3.4 (5.7) MB, 2.0 / 0.0 ms  (average mu = 0.998, current mu = 0.079) allocation failure GC in old space requested
[5903:0x5608fe9b0dd0]     8479 ms: Mark-sweep 3.4 (5.7) -> 3.4 (5.7) MB, 1.7 / 0.0 ms  (average mu = 0.998, current mu = 0.155) last resort GC in old space requested
[5903:0x5608fe9b0dd0]     8481 ms: Mark-sweep 3.4 (5.7) -> 3.4 (5.7) MB, 2.0 / 0.0 ms  (average mu = 0.996, current mu = 0.021) last resort GC in old space requested


<--- JS stacktrace --->

==== JS stack trace =========================================

    0: ExitFrame [pc: 0x15df1d3dbf1d]
    1: StubFrame [pc: 0x15df1d394e11]
    2: StubFrame [pc: 0x15df1d408edf]
Security context: 0x0360fa99e6c1 <JSObject>
    3: formatWithOptions [0x1a9af85d5bf9] [util.js:162] [bytecode=0x1a9af85831b9 offset=804](this=0x0f5175d44ef9 <Object map = 0x1536356b6189>,inspectOptions=0x3c9dcf732b51 <Object map = 0x1536356c8879>,f=0x3c9dcf74fdc1 <String[10]: AsyncWork:>)
    4: arguments adaptor frame: 3->2
 ...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: 0x5608fc6eb594 node::Abort() [/opt/node-v10.23.0/bin/node]
 2: 0x5608fc6eb5e2  [/opt/node-v10.23.0/bin/node]
 3: 0x5608fc918cba v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/opt/node-v10.23.0/bin/node]
 4: 0x5608fc918f85 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/opt/node-v10.23.0/bin/node]
 5: 0x5608fcd46bf6  [/opt/node-v10.23.0/bin/node]
 6: 0x5608fcd58f19 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationSpace, v8::internal::AllocationAlignment) [/opt/node-v10.23.0/bin/node]
 7: 0x5608fcd1c5ab v8::internal::Factory::NewFixedArrayWithFiller(v8::internal::Heap::RootListIndex, int, v8::internal::Object*, v8::internal::PretenureFlag) [/opt/node-v10.23.0/bin/node]
 8: 0x5608fcd2fb11 v8::internal::Factory::SetNumberStringCache(v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::String>) [/opt/node-v10.23.0/bin/node]
 9: 0x5608fcd2fc01 v8::internal::Factory::NumberToString(v8::internal::Handle<v8::internal::Object>, bool) [/opt/node-v10.23.0/bin/node]
10: 0x5608fd01260c v8::internal::Runtime_NumberToStringSkipCache(int, v8::internal::Object**, v8::internal::Isolate*) [/opt/node-v10.23.0/bin/node]
11: 0x15df1d3dbf1d 
Aborted (core dumped)

The same script runs happily under node-v15.0.1

@mhdawson
Copy link
Member

@artemp this has been on my TODO list to get back to this, but other priorities have gotten in the way. Just an FYI that it is still on my list.

@artemp
Copy link
Author

artemp commented Nov 18, 2020

@mhdawson - no probs, thanks.

@mhdawson
Copy link
Member

I recreated the OOM's with node-v10.23.0.

I then tried with node-v12.18.3 and it seems to run fine.

@mhdawson
Copy link
Member

I then tried increasing the heap from 4MB to 6MB when running with 10.x (4MB is pretty small)

node --trace-gc --max-old-space-size=6 ./test/MemoryLeak.js 100000

@mhdawson
Copy link
Member

With the larger heap (6MB) it seems to run ok with 10.x and from the gc output it seems to stabilise at:

[18680:0x4290900]   123269 ms: Mark-sweep 4.9 (7.2) -> 3.9 (7.2) MB, 2.0 / 0.3 ms  (average mu = 1.000, current mu = 1.000) allocation failure scavenge might not succeed

where the heap goes up to 4.9MB and then after each GC goes back down to 3.9MB

@mhdawson
Copy link
Member

mhdawson commented Nov 20, 2020

I then looked at the Resident Set Size and Virtual Memory Size with:

ps -eo vsz,rss,comm |grep node

The Virtual memory size seems to stay consistent, while the RSS increases and then drops down after each GC:

598556 33896 node
user1@user1-VirtualBox:~/newpull/io.js$ ps -eo vsz,rss,comm |grep node
598556 33896 node
user1@user1-VirtualBox:~/newpull/io.js$ ps -eo vsz,rss,comm |grep node
598556 33112 node
user1@user1-VirtualBox:~/newpull/io.js$ ps -eo vsz,rss,comm |grep node
598556 33112 node

@mhdawson
Copy link
Member

I let it run for >15 mins and the GC/heap size still seems stable:

user1@user1-VirtualBox:~/newpull/io.js$ ps -p 18680 -o etime
    ELAPSED
      18:35
[18680:0x4290900]   985718 ms: Mark-sweep 4.9 (7.2) -> 3.9 (7.2) MB, 2.8 / 0.3 ms  (average mu = 1.000, current mu = 1.000) allocation failure scavenge might not succeed

as does the virtual memory and resident set size:

user1@user1-VirtualBox:~/newpull/io.js$ ps -eo vsz,rss,comm |grep node
598556 33860 node

which is still in the same range as after the first 5 mins.

And in fact while I was typing this the test actually finished.

@mhdawson
Copy link
Member

I'll start it up again and let it run for a bit longer to confirm the resident set size is not increasing but I think the heap was just a bit too small for 10.x as with a slightly larger heap it seems stable and I don't see the memory reported by the os increasing.

@mhdawson
Copy link
Member

mhdawson commented Nov 21, 2020

At 300,000 iterations I now see:

[8068:0x27b6900]  3207525 ms: Mark-sweep 5.0 (7.2) -> 4.0 (7.2) MB, 2.9 / 0.2 ms  (average mu = 1.000, current mu = 1.000) allocation failure scavenge might not succeed

for a 100k increase in heap size(and something similar in the RSS size), but if you take 100k/300k iterations (since it was at 3.9 near the start) that is 0.33 bytes per iteration so I don't think it's related to the iterations (you can't leak 0.33 bytes at a time :)) just the memory used by the GC can vary a bit.

So after running an hour I don't think I see anything that would point to a memory leak.

@mhdawson
Copy link
Member

And just to be sure I ran with 10.x:

user1@user1-VirtualBox:~/napi-mem/node-addon-examples/napi-asyncworker-example/node-addon-api$ node --version
v10.23.0

@artemp
Copy link
Author

artemp commented Nov 22, 2020

@mhdawson thanks for looking into this.
So, to summarise, this looks like a memory management issue rather than straightforward memory leak. My understanding, is that under heavy load and memory intensive tasks, node-10/node-addon-api doesn't recover 'external' memory efficiently enough and can result in OOMs. Changes in node-12/14 improve this situation because of less memory requirements overall and perhaps due to changes to finalisers. I have a hunch using NAN doesn't result in this behaviour but I don't have a proof handy. Is this how you see this ? Do you know what changes in v8/node/node-addon-api are at play here? Cheers.

@mhdawson
Copy link
Member

mhdawson commented Nov 23, 2020

@artemp, node-10/node-addon-api doesn't recover 'external' memory efficiently. I don't think that is the case. From my analysis 10.x seems to run just fine, its just that if you set a very small heap it will run out of memory were as 12/14 improve so that even with a 4MB (versus 6MB) heap it runs ok from the start.

For 10.x, provided you give it a 6MB or larger heap it looks to me like 10.x is running ok.

My summary is that from what I can see with the test case provided 10, 12,14 all seem to be ok, the only difference being if you set a heap size of less than 6MB which is extremely small.

@artemp
Copy link
Author

artemp commented Nov 25, 2020

@mhdawson OK, thanks again for investigating this issue, closing.

@artemp artemp closed this as completed Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants