Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing document with a lot of HTML tags is slow #181

Open
alecpl opened this issue Apr 16, 2020 · 12 comments
Open

Parsing document with a lot of HTML tags is slow #181

alecpl opened this issue Apr 16, 2020 · 12 comments

Comments

@alecpl
Copy link
Contributor

alecpl commented Apr 16, 2020

I have a script that generates a HTML sample that is ~1.5MB in size. It emulates a real-world example. Then I parse it.

$html = '<HTML><BODY>';
$lines = 20000;
while ($lines--) {
    $html .= '<P DIR=LTR><SPAN LANG="en-gb"><FONT FACE="Consolas">&gt;&gt; </FONT></SPAN></P>';
}

$html5 = new Masterminds\HTML5();
$node  = $html5->loadHTML($html);

and here's the result:

PHP Fatal error:  Maximum execution time of 120 seconds exceeded in vendor/masterminds/html5/src/HTML5/Parser/DOMTreeBuilder.php on line 433
PHP Stack trace:
PHP   1. {main}() test.php:0
PHP   2. Masterminds\HTML5->loadHTML() test.php:23
PHP   3. Masterminds\HTML5->parse() vendor/masterminds/html5/src/HTML5.php:98
PHP   4. Masterminds\HTML5\Parser\Tokenizer->parse() vendor/masterminds/html5/src/HTML5.php:174
PHP   5. Masterminds\HTML5\Parser\Tokenizer->consumeData() vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:89
PHP   6. Masterminds\HTML5\Parser\Tokenizer->tagOpen() vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:132
PHP   7. Masterminds\HTML5\Parser\Tokenizer->tagName() vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:284
PHP   8. Masterminds\HTML5\Parser\DOMTreeBuilder->startTag() vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:388

I tested this with 2.7.0 and some older versions with no success. The sample half of that size works, but it takes 27 seconds to finish (so it's not linear).

Cross-ref: roundcube/roundcubemail#7331

@goetas
Copy link
Member

goetas commented Apr 18, 2020

Have you tried to debug it with backfire or some other profiler?

@alecpl
Copy link
Contributor Author

alecpl commented Apr 18, 2020

I didn't yet, but I can add that the specific content is not that important, the number of tags is. So, it looks like this library has problem with parsing big HTML pages. FYI, DOMDocument parses the sample in less than a second.

@alecpl
Copy link
Contributor Author

alecpl commented Apr 18, 2020

I'm not sure how useful is that, but here's xdebug profile on a smaller sample. Sorry, for Polish language, but forcing English in KCacheGrind didn't work.
xdebug

@alecpl alecpl changed the title Parsing specific (long) HTML content is too slow (infinite loop?) Parsing document with a lot of HTML tags is slow Apr 19, 2020
@alecpl
Copy link
Contributor Author

alecpl commented Apr 19, 2020

So, it looks like a DOMElement::appendChild() is the main bottleneck. Here's some performance stats showing how number of tags makes a difference. PHP 7.4.

Tags  |  Time
---------------
10k   |   1.3s
20k   |   3.3s
30k   |   7.9s
40k   |  16.4s
50k   |  28.3s

@goetas
Copy link
Member

goetas commented Apr 19, 2020

can you try to benchmark appendChild alone and see if that slows down after a certain number of tags?

@alecpl
Copy link
Contributor Author

alecpl commented Apr 19, 2020

Nope, and it's the other way round (more tags, better time per tag). What's more the following script is blazingly fast (<1sec).

$doc = new DOMDocument;
$body = $doc->createElement("body");
$doc->appendChild($body);
$lines = 100000;
while ($lines--) {
    $p = $doc->createElement("p");
    $body->appendChild($p);
    $span = $doc->createElement("span");
    $p->appendChild($span);
    $font = $doc->createElement("font");
    $span->appendChild($font);
}

@goetas
Copy link
Member

goetas commented Apr 19, 2020

image

@goetas
Copy link
Member

goetas commented Apr 19, 2020

Hmm, weird...

@goetas
Copy link
Member

goetas commented Jun 14, 2020

the bottleneck seems to be autoclose()..., by removing that, the script completes in 3s
NVM

@goetas
Copy link
Member

goetas commented Jun 14, 2020

This turned out to be a PHP issue that can be workedaroud by doing

$html5 = new Masterminds\HTML5([
    'disable_html_ns' => true
]);
$node  = $html5->loadHTML($html);

The perf issue was introduced by https://github.com/php/php-src/blob/35e0a91db717fe441a89ca9554d8843d8ee63112/ext/dom/php_dom.c and php/php-src@84b90f6

@alecpl
Copy link
Contributor Author

alecpl commented Jul 26, 2020

Thanks for the workaround. With it my initial test script takes 8 seconds, not that bad. DOMDocument needs 0.3 second.

Did you already create a ticket in PHP's bugtracker?

@steinmb
Copy link

steinmb commented Feb 23, 2024

Was listed by xhprof with PHP 8.3.2-1. Is this a thing or should I look other places?

Screenshot 2024-02-23 at 12 29 38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants