Invalid parsing result when head/body tag is missing #166

alecpl · 2019-04-06T06:49:19Z

Consider this:

<html>Hello, This is a test.<br />Does it work this time?</html>

Imo, this is valid HTML and it is also parsed correctly by DOMDocument. However, HTML5 parser will ignore the first line of text. We're using loadHTML() method.

Even this one works with DOMDocument:

Hello, This is a test.<br />Does it work this time?

According to Mozilla documentation:

html: The start tag may be omitted if the first thing inside the element is not a comment.
body: The start tag may be omitted if the first thing inside it is not a space character, comment, <script> element or <style> element.

Reference: roundcube/roundcubemail#6713 (comment)

The text was updated successfully, but these errors were encountered:

goetas · 2019-04-06T06:52:23Z

Can you post the References of the Mozilla documentation about this?

goetas · 2019-04-06T06:54:37Z

https://html5.validator.nu/ says is not valid

alecpl · 2019-04-06T07:05:44Z

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/html
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/body

I tried https://validator.w3.org, it also returns an error, however it looks strange to me. "Element head is missing a required instance of child element title" while there's no head at all.

alecpl · 2019-04-06T07:12:06Z

Official HTML5.2 documentation says:

A head element’s start tag may be omitted if the element is empty, or if the first thing inside the head element is an element.
A body element’s start tag may be omitted if the element is empty, or if the first thing inside the body element is not a space character or a comment, except if the first thing inside the body element is a meta, link, script, style, or template element.

goetas · 2019-04-06T07:14:19Z

Can you please post the exact references to the documentation instead the main links... is really hard to find the sentences you are referring

alecpl · 2019-04-06T07:15:42Z

Look for "Tag omission".

goetas · 2019-04-06T07:19:19Z

The document you have posted refers to the latest HTML 5.2 specs. This library implements most of the 5.0 specs.
However I see that starting to adopt some of the more recent specifications is a good idea, so if you wish to fix this behavior, PR are welcome.

alecpl · 2019-04-06T07:22:15Z

The old HTML5 documentation is the same in this context:

https://www.w3.org/TR/2014/REC-html5-20141028/semantics.html#the-html-element
https://www.w3.org/TR/2014/REC-html5-20141028/sections.html#the-body-element
https://www.w3.org/TR/2014/REC-html5-20141028/dom.html#element-dfn-tag-omission

alecpl · 2019-04-06T07:22:58Z

Also, don't miss the fact DOMDocument parses these correctly.

goetas · 2019-04-06T07:23:11Z

Good to know

goetas · 2019-04-06T07:24:18Z

well, DOMDocument does not follow that much the HTML5 logic... is just a relaxed XML parser internally.
DOMDocument is not much aware of the HTML5 specs

alecpl · 2019-04-06T07:28:06Z

Yeah, the main reason we switched from DOMDocument to this lib was to get better results. And in many cases the result is better, but this case obviously looks like a bug. Such "dummy" HTML code is not that uncommon in email world.

librevlad · 2020-02-06T15:43:22Z

Parsing such chunks of HTML would be useful when dealing with some ajax responses containing partials when scraping the web.

ju1ius · 2020-02-24T16:23:23Z

Hi, adding another test case to this issue:

Using the native DOMDocument::loadHTML() implementation:

$doc = new DOMDocument();
$doc->loadHTML('<title>Foo');
echo $doc->saveHTML();

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Foo</title></head></html>

Using this library's implementation:

$parser = new HTML(['disable_html_ns' => true]);
$doc = $parser->loadHTML('<title>Foo');
echo $doc->saveHTML();

<html><title>Foo</title></html>

ju1ius · 2020-02-24T16:45:56Z

Also, citing the spec for tag omission:

Omitting an element's start tag in the situations described below does not mean the element is not present; it is implied, but it is still there. For example, an HTML document always has a root html element, even if the string doesn't appear anywhere in the markup.

This implies that:

The result of evalutating $document->documentElement->tagName should always be the string html
The result of evalutating (new DOMXPath($document))->query('/html/head')->item(0)->tagName should always be the string head
The result of evalutating (new DOMXPath($document))->query('/html/body')->item(0)->tagName should always be the string body

goetas · 2020-05-21T11:57:55Z

@ju1ius that is a valid point, see my comment #182 (comment) for a possible solution

bytestream · 2020-05-21T18:24:52Z

Using this library's implementation:

$parser = new HTML(['disable_html_ns' => true]);
$doc = $parser->loadHTML('<title>Foo');
echo $doc->saveHTML();

<html><title>Foo</title></html>

<html><title>Foo</title></html> is valid.

Also, citing the spec for tag omission:

Omitting an element's start tag in the situations described below does not mean the element is not present; it is implied, but it is still there. For example, an HTML document always has a root html element, even if the string doesn't appear anywhere in the markup.

This implies that:
1. The result of evalutating `$document->documentElement->tagName` should always be the string `html`

2. The result of evalutating `(new DOMXPath($document))->query('/html/head')->item(0)->tagName` should always be the string `head`

3. The result of evalutating `(new DOMXPath($document))->query('/html/body')->item(0)->tagName` should always be the string `body`

The changes to achieve this are difficult and break several existing tests. Adding those elements means that they will also be output - as far as I'm aware it's not possible to parse but not output them... For starters, the document ends after Foo so you have to handle it here https://github.com/Masterminds/html5-php/blob/master/src/HTML5/Parser/DOMTreeBuilder.php#L570

alecpl mentioned this issue Apr 6, 2019

Email not displaying/being parsed properly roundcube/roundcubemail#6713

Closed

goetas added the bug label Apr 6, 2019

bytestream mentioned this issue Jan 16, 2020

Added masterminds/html5 MyIntervals/emogrifier#831

Closed

bytestream mentioned this issue May 12, 2020

Fixes #166 #182

Closed

bytestream added a commit to bytestream/html5-php that referenced this issue May 21, 2020

Fixed Masterminds#166

dbce3e8

alecpl mentioned this issue Jul 9, 2021

loadHTML drops first TextNode of HTML fragment string #208

Open

goetas pushed a commit to bytestream/html5-php that referenced this issue Jan 11, 2023

Fixed Masterminds#166

4d14e8c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid parsing result when head/body tag is missing #166

Invalid parsing result when head/body tag is missing #166

alecpl commented Apr 6, 2019

goetas commented Apr 6, 2019

goetas commented Apr 6, 2019

alecpl commented Apr 6, 2019

alecpl commented Apr 6, 2019

goetas commented Apr 6, 2019

alecpl commented Apr 6, 2019

goetas commented Apr 6, 2019

alecpl commented Apr 6, 2019

alecpl commented Apr 6, 2019

goetas commented Apr 6, 2019

goetas commented Apr 6, 2019

alecpl commented Apr 6, 2019

librevlad commented Feb 6, 2020 •

edited

ju1ius commented Feb 24, 2020

ju1ius commented Feb 24, 2020 •

edited

goetas commented May 21, 2020

bytestream commented May 21, 2020

Invalid parsing result when head/body tag is missing #166

Invalid parsing result when head/body tag is missing #166

Comments

alecpl commented Apr 6, 2019

goetas commented Apr 6, 2019

goetas commented Apr 6, 2019

alecpl commented Apr 6, 2019

alecpl commented Apr 6, 2019

goetas commented Apr 6, 2019

alecpl commented Apr 6, 2019

goetas commented Apr 6, 2019

alecpl commented Apr 6, 2019

alecpl commented Apr 6, 2019

goetas commented Apr 6, 2019

goetas commented Apr 6, 2019

alecpl commented Apr 6, 2019

librevlad commented Feb 6, 2020 • edited

ju1ius commented Feb 24, 2020

ju1ius commented Feb 24, 2020 • edited

goetas commented May 21, 2020

bytestream commented May 21, 2020

librevlad commented Feb 6, 2020 •

edited

ju1ius commented Feb 24, 2020 •

edited