Memory leak #64

AnatoliiVorobiov · 2020-08-29T10:59:35Z

I parsed many different pages and tried to get all links from pages. I collect all acquired URLs into an array, but with time I got an error 'heap out of memory'. I have made a dump of memory and I have discovered that your library returns sliced arrays in some cases. There is a situation when I store small strings but that continues to be linked with large strings that cause out of memory with time. I have used 'getAttribute' function. I recommend writing some notification in docs so that users could avoid this situation in the future. Or create an additional function with deep copy
Code that can reproduce, can use any HTML with links:

const fs = require('fs');
const HTMLParser = require('node-html-parser');

(async () => {
  const array = [];
  setInterval(() => {
    fs.readFile('tmp.html', (err, buf) => {
      let lines = buf.toString();
      const root = HTMLParser.parse(lines);
      for (const elem of root.querySelectorAll('a')) {
        array.push(elem.getAttribute('href'));
      }
    })
  }, 2000);
})();

The text was updated successfully, but these errors were encountered:

taoqf · 2020-09-22T17:07:53Z

Sorry I could not reproduce.

Dj-Polyester · 2021-02-23T17:43:56Z

I parsed many different pages and tried to get all links from pages. I collect all acquired URLs into an array, but with time I got an error 'heap out of memory'. I have made a dump of memory and I have discovered that your library returns sliced arrays in some cases. There is a situation when I store small strings but that continues to be linked with large strings that cause out of memory with time. I have used 'getAttribute' function. I recommend writing some notification in docs so that users could avoid this situation in the future. Or create an additional function with deep copy
Code that can reproduce, can use any HTML with links:
const fs = require('fs');
const HTMLParser = require('node-html-parser');

(async () => {
  const array = [];
  setInterval(() => {
    fs.readFile('tmp.html', (err, buf) => {
      let lines = buf.toString();
      const root = HTMLParser.parse(lines);
      for (const elem of root.querySelectorAll('a')) {
        array.push(elem.getAttribute('href'));
      }
    })
  }, 2000);
})();

@AnatoliiVorobiov , could you provide your html file in order that source of the error can be seen more easily. I am also using this module in my project. There are other people like me who are likely to suffer from the same issue. Thanks.

taoqf · 2021-04-15T01:02:07Z

This should be resolved after v3.1.1

taoqf · 2021-04-15T01:03:24Z

This should be resolved since v3.1.1

mpcabete · 2022-04-17T20:05:17Z

I have had a problem parsing 1000 html documents, one at a time (removing reference from previous one), node js trow the error:
Allocation failed - JavaScript heap out of memory
I switched to jsdom and it solved the issue.

taoqf · 2022-04-27T06:52:36Z

In what case?

nonara · 2022-04-27T15:09:58Z

@mpcabete Sorry to hear you're having issues. I can't think of anywhere in the codebase that such a leak would be happening, but I'd be happy to take a look. Can you provide a snippet of code or repo that will reproduce the issue?

ThibaudLopez · 2023-02-14T21:47:26Z

I also get JavaScript heap out of memory.

I don't know if the problem is in my code, node-html-parser, v8, or Node.js. But it doesn't seem normal for node-html-parser to consume so much memory.

For instance, one million strings '<foo></foo>' only take 11 Mb of memory, node-html-parser can parse them successfully, and I can store the result in memory.

But adding just one child element, where one million strings '<foo><bar/></foo>' increases to only 17 Mb, and it crashes. That seems dramatic.

If I increase the v8 heap size with --max-old-space-size=2500 then it works, but I shouldn't need 2.5 Gb of memory to parse and store 17 Mb.

Steps to reproduce:

// Linux 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
// node v16.15.0
// node v18.12.1
// npm 8.5.5
// node-html-parser@6.1.4
var { parse } = require('node-html-parser');

This succeeds:

var a = []; for (var i = 0; i < 10**6; i++) a.push(parse('<foo></foo>'));

This crashes:

var a = []; for (var i = 0; i < 10**6; i++) a.push(parse('<foo><bar/></foo>'));

[89570:0x5dfe540]    82635 ms: Scavenge (reduce) 2044.6 (2082.1) -> 2044.2 (2082.3) MB, 4.2 / 0.0 ms  (average mu = 0.159, current mu = 0.002) allocation failure 
[89570:0x5dfe540]    82641 ms: Scavenge (reduce) 2045.0 (2082.3) -> 2044.6 (2082.6) MB, 3.7 / 0.0 ms  (average mu = 0.159, current mu = 0.002) allocation failure 
[89570:0x5dfe540]    82648 ms: Scavenge (reduce) 2045.3 (2082.6) -> 2044.8 (2082.8) MB, 5.5 / 0.0 ms  (average mu = 0.159, current mu = 0.002) allocation failure 

<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb09c10 node::Abort() [node]
 2: 0xa1c193 node::FatalError(char const*, char const*) [node]
 3: 0xcf8dbe v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xcf9137 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xeb09d5  [node]
 6: 0xeb14b6  [node]
 7: 0xebf9de  [node]
 8: 0xec0420 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 9: 0xec339e v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
10: 0xe848da v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node]
11: 0x11fd626 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node]
12: 0x15f2099  [node]
Aborted (core dumped)

nonara · 2023-02-14T23:03:14Z

A leak is where something which should be garbage collected is retained. In your case, you add each result to an array, so the memory would not be freed.

In terms of memory usage, you're effectively parsing 1 million HTML documents which creates a full AST for each. This would be multiple nodes that are class instances and would certainly take more memory than a single string.

Generally speaking, in cases like needing to perform actions on millions of records, you're best suited to process one or a reasonable chunk of items per thread and proceed to the next item/chunk when finished so you're not filling up memory needlessly.

It could be worth running a profiler to investigate memory leaks, but leaks are rare, and I haven't seen any indication of them.

As for bloat, that could be worth investigating. I'd need to see some numbers which showed memory size per node + base document AST size. If it seems off, I can check it out. Unfortunately, I'm insanely backlogged and I only volunteer on this when I'm able, so I'd need specific and compelling data.

taoqf added the bug label Feb 24, 2021

taoqf mentioned this issue Mar 21, 2021

Memory leak HTMLElement#text and HTMLElement#rawText #106

Closed

taoqf closed this as completed Apr 15, 2021

taoqf reopened this Apr 27, 2022

nonara added needs reproduction and removed bug labels Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak #64

Memory leak #64

AnatoliiVorobiov commented Aug 29, 2020

taoqf commented Sep 22, 2020

Dj-Polyester commented Feb 23, 2021

taoqf commented Apr 15, 2021

taoqf commented Apr 15, 2021

mpcabete commented Apr 17, 2022

taoqf commented Apr 27, 2022

nonara commented Apr 27, 2022

ThibaudLopez commented Feb 14, 2023 •

edited

nonara commented Feb 14, 2023

Memory leak #64

Memory leak #64

Comments

AnatoliiVorobiov commented Aug 29, 2020

taoqf commented Sep 22, 2020

Dj-Polyester commented Feb 23, 2021

taoqf commented Apr 15, 2021

taoqf commented Apr 15, 2021

mpcabete commented Apr 17, 2022

taoqf commented Apr 27, 2022

nonara commented Apr 27, 2022

ThibaudLopez commented Feb 14, 2023 • edited

nonara commented Feb 14, 2023

ThibaudLopez commented Feb 14, 2023 •

edited