Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% cpu while parsing document #260

Open
papirosko opened this issue Dec 12, 2023 · 6 comments
Open

100% cpu while parsing document #260

papirosko opened this issue Dec 12, 2023 · 6 comments

Comments

@papirosko
Copy link

simple code hangs causing kube to kill pod:

import {parse} from 'node-html-parser';
const html = // load https://www.a1supplements.com/
const root = parse(html);

I use:

    "node-html-parser": "^6.1.10",

host: https://www.a1supplements.com/
html size: 4620497 symbols

I have these from node inspect:

break in node_modules/node-html-parser/dist/nodes/html.js:1192
 1190                     oneBefore.removeChild(last);
 1191                     last.childNodes.forEach(function (child) {
>1192                         oneBefore.appendChild(child);
 1193                     });
 1194                 }

The contents (in case the website will be updated):

a1supplements.com.html.txt

@papirosko
Copy link
Author

actually it finished parsing in 7 minutes on my macbook pro with i7. is it considered to be correct?

@papirosko
Copy link
Author

as a workaround i use this (i mostly need data only from ):

    const root = parse(html, {
        parseNoneClosedTags: false,
        fixNestedATags: false,
        blockTextElements: {
            'div': true,
            'p': true,
            'pre': true
        }
    });
    const title = option(root.querySelector('title'))
        .map(x => x.text)
        .filter(x => !!x && x.trim().length > 0);

@taoqf
Copy link
Owner

taoqf commented Dec 25, 2023

I'm so sorry I could find any clue about your usecase. I even could not find title element. I did not get a macbook either. But I parsed the file you uploaded and it finished parsing immediately .

@papirosko
Copy link
Author

Using:

  ...
  "dependencies": {
    "axios": "^1.6.2",
    "node-html-parser": "^6.1.10",
  }
  ...

This is how to reproduce it (using default options in parse):

import parse from 'node-html-parser';
import axios from 'axios';

async function runImpl() {
    const url = 'https://www.a1supplements.com/';
    const resp = await axios.get(url);
    const html = resp.data;

    const start = Date.now();
    parse(html);
    const duration = Date.now() - start;
    console.log(`Parsing took: ${duration.toLocaleString()}ms, document size: ${html.length.toLocaleString()} chars`)
}

runImpl().then(() => console.log('done'))

result:

Parsing took: 383,073ms, document size: 4,705,196 chars
done

Using custom options in parse:

    ...
    const start = Date.now();
    parse(html, {
        parseNoneClosedTags: false,
        fixNestedATags: false,
        blockTextElements: {
            'div': true,
            'p': true,
            'pre': true,
            script: true,
            noscript: true,
            style: true,
        }
    });
    const duration = Date.now() - start;
    ...

results:

Parsing took: 284ms, document size: 4,705,196 chars
done

I believe that blockTextElements -> div generally fixes the issue

taoqf added a commit that referenced this issue Dec 26, 2023
@taoqf
Copy link
Owner

taoqf commented Dec 26, 2023

I don't think we should block div elements. that maybe the html is broken, the option parseNoneClosedTags: true will speed up.

@sidpremkumar
Copy link

parseNoneClosedTags: true

Not sure if it was related, but had a big html page and adding this fixed it.

Curious what this mean @taoqf , couldn't find any documentation or issues around it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants