Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #64

Open
AnatoliiVorobiov opened this issue Aug 29, 2020 · 9 comments
Open

Memory leak #64

AnatoliiVorobiov opened this issue Aug 29, 2020 · 9 comments

Comments

@AnatoliiVorobiov
Copy link

I parsed many different pages and tried to get all links from pages. I collect all acquired URLs into an array, but with time I got an error 'heap out of memory'. I have made a dump of memory and I have discovered that your library returns sliced arrays in some cases. There is a situation when I store small strings but that continues to be linked with large strings that cause out of memory with time. I have used 'getAttribute' function. I recommend writing some notification in docs so that users could avoid this situation in the future. Or create an additional function with deep copy
Code that can reproduce, can use any HTML with links:

const fs = require('fs');
const HTMLParser = require('node-html-parser');

(async () => {
  const array = [];
  setInterval(() => {
    fs.readFile('tmp.html', (err, buf) => {
      let lines = buf.toString();
      const root = HTMLParser.parse(lines);
      for (const elem of root.querySelectorAll('a')) {
        array.push(elem.getAttribute('href'));
      }
    })
  }, 2000);
})();
@taoqf
Copy link
Owner

taoqf commented Sep 22, 2020

Sorry I could not reproduce.

@Dj-Polyester
Copy link

I parsed many different pages and tried to get all links from pages. I collect all acquired URLs into an array, but with time I got an error 'heap out of memory'. I have made a dump of memory and I have discovered that your library returns sliced arrays in some cases. There is a situation when I store small strings but that continues to be linked with large strings that cause out of memory with time. I have used 'getAttribute' function. I recommend writing some notification in docs so that users could avoid this situation in the future. Or create an additional function with deep copy
Code that can reproduce, can use any HTML with links:

const fs = require('fs');
const HTMLParser = require('node-html-parser');

(async () => {
  const array = [];
  setInterval(() => {
    fs.readFile('tmp.html', (err, buf) => {
      let lines = buf.toString();
      const root = HTMLParser.parse(lines);
      for (const elem of root.querySelectorAll('a')) {
        array.push(elem.getAttribute('href'));
      }
    })
  }, 2000);
})();

@AnatoliiVorobiov , could you provide your html file in order that source of the error can be seen more easily. I am also using this module in my project. There are other people like me who are likely to suffer from the same issue. Thanks.

@taoqf
Copy link
Owner

taoqf commented Apr 15, 2021

This should be resolved after v3.1.1

@taoqf taoqf closed this as completed Apr 15, 2021
@taoqf
Copy link
Owner

taoqf commented Apr 15, 2021

This should be resolved since v3.1.1

@mpcabete
Copy link

I have had a problem parsing 1000 html documents, one at a time (removing reference from previous one), node js trow the error:
Allocation failed - JavaScript heap out of memory
I switched to jsdom and it solved the issue.

@taoqf taoqf reopened this Apr 27, 2022
@taoqf
Copy link
Owner

taoqf commented Apr 27, 2022

In what case?

@nonara
Copy link
Collaborator

nonara commented Apr 27, 2022

@mpcabete Sorry to hear you're having issues. I can't think of anywhere in the codebase that such a leak would be happening, but I'd be happy to take a look. Can you provide a snippet of code or repo that will reproduce the issue?

@ThibaudLopez
Copy link

ThibaudLopez commented Feb 14, 2023

I also get JavaScript heap out of memory.

I don't know if the problem is in my code, node-html-parser, v8, or Node.js. But it doesn't seem normal for node-html-parser to consume so much memory.

For instance, one million strings '<foo></foo>' only take 11 Mb of memory, node-html-parser can parse them successfully, and I can store the result in memory.

But adding just one child element, where one million strings '<foo><bar/></foo>' increases to only 17 Mb, and it crashes. That seems dramatic.

If I increase the v8 heap size with --max-old-space-size=2500 then it works, but I shouldn't need 2.5 Gb of memory to parse and store 17 Mb.

Steps to reproduce:

// Linux 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
// node v16.15.0
// node v18.12.1
// npm 8.5.5
// node-html-parser@6.1.4
var { parse } = require('node-html-parser');

This succeeds:

var a = []; for (var i = 0; i < 10**6; i++) a.push(parse('<foo></foo>'));

This crashes:

var a = []; for (var i = 0; i < 10**6; i++) a.push(parse('<foo><bar/></foo>'));
[89570:0x5dfe540]    82635 ms: Scavenge (reduce) 2044.6 (2082.1) -> 2044.2 (2082.3) MB, 4.2 / 0.0 ms  (average mu = 0.159, current mu = 0.002) allocation failure 
[89570:0x5dfe540]    82641 ms: Scavenge (reduce) 2045.0 (2082.3) -> 2044.6 (2082.6) MB, 3.7 / 0.0 ms  (average mu = 0.159, current mu = 0.002) allocation failure 
[89570:0x5dfe540]    82648 ms: Scavenge (reduce) 2045.3 (2082.6) -> 2044.8 (2082.8) MB, 5.5 / 0.0 ms  (average mu = 0.159, current mu = 0.002) allocation failure 

<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb09c10 node::Abort() [node]
 2: 0xa1c193 node::FatalError(char const*, char const*) [node]
 3: 0xcf8dbe v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xcf9137 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xeb09d5  [node]
 6: 0xeb14b6  [node]
 7: 0xebf9de  [node]
 8: 0xec0420 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 9: 0xec339e v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
10: 0xe848da v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node]
11: 0x11fd626 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node]
12: 0x15f2099  [node]
Aborted (core dumped)

@nonara
Copy link
Collaborator

nonara commented Feb 14, 2023

A leak is where something which should be garbage collected is retained. In your case, you add each result to an array, so the memory would not be freed.

In terms of memory usage, you're effectively parsing 1 million HTML documents which creates a full AST for each. This would be multiple nodes that are class instances and would certainly take more memory than a single string.

Generally speaking, in cases like needing to perform actions on millions of records, you're best suited to process one or a reasonable chunk of items per thread and proceed to the next item/chunk when finished so you're not filling up memory needlessly.

It could be worth running a profiler to investigate memory leaks, but leaks are rare, and I haven't seen any indication of them.

As for bloat, that could be worth investigating. I'd need to see some numbers which showed memory size per node + base document AST size. If it seems off, I can check it out. Unfortunately, I'm insanely backlogged and I only volunteer on this when I'm able, so I'd need specific and compelling data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants