Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid parsing result when head/body tag is missing #166

Open
alecpl opened this issue Apr 6, 2019 · 17 comments
Open

Invalid parsing result when head/body tag is missing #166

alecpl opened this issue Apr 6, 2019 · 17 comments
Labels

Comments

@alecpl
Copy link
Contributor

alecpl commented Apr 6, 2019

Consider this:

<html>Hello, This is a test.<br />Does it work this time?</html>

Imo, this is valid HTML and it is also parsed correctly by DOMDocument. However, HTML5 parser will ignore the first line of text. We're using loadHTML() method.

Even this one works with DOMDocument:

Hello, This is a test.<br />Does it work this time?

According to Mozilla documentation:

  • html: The start tag may be omitted if the first thing inside the element is not a comment.
  • body: The start tag may be omitted if the first thing inside it is not a space character, comment, <script> element or <style> element.

Reference: roundcube/roundcubemail#6713 (comment)

@goetas
Copy link
Member

goetas commented Apr 6, 2019

Can you post the References of the Mozilla documentation about this?

@goetas
Copy link
Member

goetas commented Apr 6, 2019

https://html5.validator.nu/ says is not valid

@alecpl
Copy link
Contributor Author

alecpl commented Apr 6, 2019

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/html
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/body

I tried https://validator.w3.org, it also returns an error, however it looks strange to me. "Element head is missing a required instance of child element title" while there's no head at all.

@alecpl
Copy link
Contributor Author

alecpl commented Apr 6, 2019

Official HTML5.2 documentation says:

  • A head element’s start tag may be omitted if the element is empty, or if the first thing inside the head element is an element.
  • A body element’s start tag may be omitted if the element is empty, or if the first thing inside the body element is not a space character or a comment, except if the first thing inside the body element is a meta, link, script, style, or template element.

@goetas
Copy link
Member

goetas commented Apr 6, 2019

Can you please post the exact references to the documentation instead the main links... is really hard to find the sentences you are referring

@alecpl
Copy link
Contributor Author

alecpl commented Apr 6, 2019

Look for "Tag omission".

@goetas
Copy link
Member

goetas commented Apr 6, 2019

The document you have posted refers to the latest HTML 5.2 specs. This library implements most of the 5.0 specs.
However I see that starting to adopt some of the more recent specifications is a good idea, so if you wish to fix this behavior, PR are welcome.

@alecpl
Copy link
Contributor Author

alecpl commented Apr 6, 2019

Also, don't miss the fact DOMDocument parses these correctly.

@goetas
Copy link
Member

goetas commented Apr 6, 2019

Good to know

@goetas goetas added the bug label Apr 6, 2019
@goetas
Copy link
Member

goetas commented Apr 6, 2019

well, DOMDocument does not follow that much the HTML5 logic... is just a relaxed XML parser internally.
DOMDocument is not much aware of the HTML5 specs

@alecpl
Copy link
Contributor Author

alecpl commented Apr 6, 2019

Yeah, the main reason we switched from DOMDocument to this lib was to get better results. And in many cases the result is better, but this case obviously looks like a bug. Such "dummy" HTML code is not that uncommon in email world.

@librevlad
Copy link

librevlad commented Feb 6, 2020

Parsing such chunks of HTML would be useful when dealing with some ajax responses containing partials when scraping the web.

@ju1ius
Copy link

ju1ius commented Feb 24, 2020

Hi, adding another test case to this issue:

Using the native DOMDocument::loadHTML() implementation:

$doc = new DOMDocument();
$doc->loadHTML('<title>Foo');
echo $doc->saveHTML();
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Foo</title></head></html>

Using this library's implementation:

$parser = new HTML(['disable_html_ns' => true]);
$doc = $parser->loadHTML('<title>Foo');
echo $doc->saveHTML();
<html><title>Foo</title></html>

@ju1ius
Copy link

ju1ius commented Feb 24, 2020

Also, citing the spec for tag omission:

Omitting an element's start tag in the situations described below does not mean the element is not present; it is implied, but it is still there. For example, an HTML document always has a root html element, even if the string doesn't appear anywhere in the markup.

This implies that:

  1. The result of evalutating $document->documentElement->tagName should always be the string html
  2. The result of evalutating (new DOMXPath($document))->query('/html/head')->item(0)->tagName should always be the string head
  3. The result of evalutating (new DOMXPath($document))->query('/html/body')->item(0)->tagName should always be the string body

@bytestream bytestream mentioned this issue May 12, 2020
@goetas
Copy link
Member

goetas commented May 21, 2020

@ju1ius that is a valid point, see my comment #182 (comment) for a possible solution

bytestream added a commit to bytestream/html5-php that referenced this issue May 21, 2020
@bytestream
Copy link
Contributor

Using this library's implementation:

$parser = new HTML(['disable_html_ns' => true]);
$doc = $parser->loadHTML('<title>Foo');
echo $doc->saveHTML();
<html><title>Foo</title></html>

<html><title>Foo</title></html> is valid.

Also, citing the spec for tag omission:

Omitting an element's start tag in the situations described below does not mean the element is not present; it is implied, but it is still there. For example, an HTML document always has a root html element, even if the string doesn't appear anywhere in the markup.

This implies that:

1. The result of evalutating `$document->documentElement->tagName` should always be the string `html`

2. The result of evalutating `(new DOMXPath($document))->query('/html/head')->item(0)->tagName` should always be the string `head`

3. The result of evalutating `(new DOMXPath($document))->query('/html/body')->item(0)->tagName` should always be the string `body`

The changes to achieve this are difficult and break several existing tests. Adding those elements means that they will also be output - as far as I'm aware it's not possible to parse but not output them... For starters, the document ends after Foo so you have to handle it here https://github.com/Masterminds/html5-php/blob/master/src/HTML5/Parser/DOMTreeBuilder.php#L570

goetas pushed a commit to bytestream/html5-php that referenced this issue Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants