Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDATA and JavaScript gets mangled when manipulating XHTML #139

Open
dac514 opened this issue May 1, 2018 · 7 comments
Open

CDATA and JavaScript gets mangled when manipulating XHTML #139

dac514 opened this issue May 1, 2018 · 7 comments

Comments

@dac514
Copy link

dac514 commented May 1, 2018

When loading a XHTML document with JavaScript into html5-php I expect the <script>//<![CDATA[ conversion to behave like it does in plain old \DOMDocument. Instead they get turned into htmlentities, breaking the JavaScript on our target platforms (ebook readers).

Expected:

<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
	console.log(i);
}
//]]>
</script>

Actual:

<script>
//&lt;![CDATA[
for ( var i = 0; i &lt; 10; i++ ) {
	console.log(i);
}
//]]&gt;
</script>

< becomes &lt;, > becomes &gt;, ...

Test case:

$body = '<?xml version="1.0" encoding="UTF-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Test</title>
</head>
<body>
<p>Hi There!</p>
<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
	console.log(i);
}
//]]>
</script>
</body>
</html>';

$d1 = new \DOMDocument();
$d1->loadXml( $body );
echo $d1->saveXML();

$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
echo $d2->saveXML();
@goetas goetas added the bug label May 1, 2018
@dac514
Copy link
Author

dac514 commented Aug 22, 2018

I would like to submit a PR but I need clarification on: \Masterminds\HTML5\Parser\EventHandler::text.

    /**
     * A unit of parsed character data.
     *
     * Entities in this text are *already decoded*.
     */
    public function text($cdata);

What does Entities in this text are *already decoded* mean? If it means that $cdata should not be manipulated then changing \Masterminds\HTML5\Parser\DOMTreeBuilder::text from:

// fprintf(STDOUT, "Appending text %s.", $data);
$node = $this->doc->createTextNode($data);
$this->current->appendChild($node);

To:

// fprintf(STDOUT, "Appending text %s.", $data);
$frag = $this->doc->createDocumentFragment();
$frag->appendXML($data);
$this->current->appendChild($frag);

Solves this issue.

DOMDocument->createTextNode() encodes entities. Is this the desired behaviour or not?

Source: https://stackoverflow.com/a/7725965

@goetas
Copy link
Member

goetas commented Aug 23, 2018

The thing is that "script" is a raw text tag, so Tokenizer::rawText() looks directly for </script> instead of looking for a possible cdata sequence.

When serializing back the dom, are you using this lib or the domdocument::save xml method?

@dac514
Copy link
Author

dac514 commented Aug 23, 2018

I am using this lib:

$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
echo $d2->saveXML();

Entities in this text are *already decoded, rawText, .... these methods and comments lead me to believe that if you pass then the result should be , not H&eacute;

My question, worded another way, is what is rawText? The code for rawText calls text and text creates the node using createTextNode which will transform entities.

If this is the case then my proposed fix is no good.

@dac514
Copy link
Author

dac514 commented Aug 23, 2018

I am using this lib:

Ok, duh. No I am not. My mistake. When changing to:

$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
$mm->saveHTML( $d2 );

I get:

<!DOCTYPE html>
<html xml:lang="en"><head>
<title>Test</title>
</head>
<body>
<p>1</p>
<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {8
	console.log(i);
}
//]]>
</script>
<p>2</p>
</body>
</html>

Maybe this is #nobug? Thank you for your patience.

@goetas goetas removed the bug label Aug 23, 2018
@goetas
Copy link
Member

goetas commented Aug 23, 2018

Doing

$html = '
<!DOCTYPE html>
<html>
    <head>
        <script>ababc</script>
        <script>ab < abc</script>
        <script>
        // <![CDATA[
            foo <
        // ]]>
        bar
        </script>
    </head>
</html>';

$html5 = new HTML5();
$doc = $html5->loadHTML($html);
var_dump($html5->saveHTML($doc));

I get:

<!DOCTYPE html>
<html>
    <head>
        <script>ababc</script>
        <script>ab < abc</script>
        <script>
        // <![CDATA[
            foo <
        // ]]>
        bar
        </script>
    </head>
</html>

@goetas goetas closed this as completed Aug 23, 2018
@dac514
Copy link
Author

dac514 commented Aug 23, 2018

Back from vacation so excuse the all over the place tangent.

Elsewhere in our code, we have often used $mm->saveHTML( $d2 ) where $mm equals \Masterminds\HTML5

For this specific bug we need to save back as valid XML. We expect similar results to \Masterminds\HTML5\Tests\Parser\DOMTreeBuilderTest::testMoveNonInlineElements

I just tried the proposed fix and, for this specific bug, and it doesn' work for us because <br/> get turned into <br> which breaks our XSLT.

The issue remains. We need

<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
	console.log(i);
}
//]]>
</script>

Preserved in XML saves like \DOMDocument does it.

@goetas
Copy link
Member

goetas commented Aug 23, 2018

$html5->saveHTML is there exactly because the DOM version did not work as expected in many other cases...
Re opening as "feature request"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants