Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #166 #182

Closed
wants to merge 1 commit into from
Closed

Fixes #166 #182

wants to merge 1 commit into from

Conversation

bytestream
Copy link
Contributor

This PR aims to resolve #166. It adds missing <html> <head> and <body> to the input HTML so fragments are parsed correctly.

There's several different solutions:

This class resolves cases that don't work correctly in the above.

It wouldn't mimic DOMDocument behaviour as written here; #166 (comment). That would require adding some context so it knows that <title> should go in <head>. I figure the input HTML is already invalid so it's an edge case that would require a lot more computation to handle.

What do you think?

@@ -152,6 +156,10 @@ public function hasErrors()
*/
public function parse($input, array $options = array())
{
if (isset($options['normalize']) && $options['normalize']) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (! empty($options['normalize']) { would be more concise.

@bytestream
Copy link
Contributor Author

@goetas any thoughts on this?

@goetas
Copy link
Member

goetas commented May 20, 2020

Hi and thanks for working on this.

I'm generally against the feature for the following reasons:

  • this library is not about repairing html, but about parsing decent html5 content into a dom document
  • there is already the possibility to parse a partial html5 document by using the loadHTMLFragment method

@goetas goetas closed this May 20, 2020
@goetas
Copy link
Member

goetas commented May 20, 2020

https://github.com/tgalopin/html-sanitizer is built on top of this library and is meant to sanitize external html

@bytestream
Copy link
Contributor Author

What's the status of the issue marked as bug? If you're against fixing it?

I also think there's so many people building there own implementations to work around this issue that it would benefit from a solution built into the library. That ensures collaboration and a higher quality of implementation to work around theb issue.

I don't think that html-santizer is relevant in the context of the bug.

@bytestream
Copy link
Contributor Author

@goetas below are copied from the issue, so I really think this should be reopened.

A head element’s start tag may be omitted if the element is empty, or if the first thing inside the head element is an element.

A body element’s start tag may be omitted if the element is empty, or if the first thing inside the body element is not a space character or a comment, except if the first thing inside the body element is a meta, link, script, style, or template element.

However I see that starting to adopt some of the more recent specifications is a good idea, so if you wish to fix this behavior, PR are welcome.

@goetas
Copy link
Member

goetas commented May 21, 2020

I understand.

  1. given that the following string is a valid html5 document
 <html>Hello, This is a test.<br />Does it work this time?</html>
  1. give this comment Invalid parsing result when head/body tag is missing #166 (comment)

What we need is to hook into some class in https://github.com/Masterminds/html5-php/tree/master/src/HTML5/Parser and add on the fly the body element if not present.

I'm not against solving this issue if is a valid html5 document.

@bytestream one of the reason for rejecting this PR is also that builds a parser/tokenizer within a library that has already a parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid parsing result when head/body tag is missing
3 participants