Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In scripts expected to have JSON, ensure the JSON is valid #4340

Merged
merged 20 commits into from
Mar 9, 2020

Conversation

kienstra
Copy link
Contributor

@kienstra kienstra commented Feb 28, 2020

Summary

  • For <script> tags with type="application/json" or
    type="application/ld+json", this ensures that the textContent is valid JSON, and produces a INVALID_JSON_CDATA error if it's not.
  • This doesn't help with Emoji in JSON causes syntax error and AMP validation error #4318. It doesn't prevent wp_staticize_emoji() from making the emojis invalid. But it now reports emojis that have become invalid due to that issue.
  • For example, adding a Custom HTML block with this will now cause an INVALID_JSON_CDATA error in the plugin:
<amp-analytics><script type="application/ld+json">{"anEmoji": "🍫"}</script></amp-analytics>

Testing

  1. Add this to a plugin:
add_action(
	'wp_head',
	static function() {
		?>
		<script type="application/ld+json">{"invalid": ⛷️}</script>
		<script type="application/ld+json">{"valid": "something"}</script>
		<script type="application/ld+json">{"anotherValid":" 🚫 🚫 🚫"}</script>
		<?php
	}
);
  1. Create a new post, and add a Custom HTML block with this:
<amp-analytics><script type="application/json">{"malformed": 🍫}</script>
  1. Go to the front-end, and validate the URL
  2. Expected: there are 2 validation errors:

expected-actual

One should be from the first <script> in the add_action( 'wp_head' ... call above, and one should be from the Custom HTML block.

The bottom two scripts in the add_action( 'wp_head' ... call should not cause a validation error:

<script type="application/ld+json">{"valid": "something"}</script>
<script type="application/ld+json">{"anotherValid":" 🚫 🚫 🚫"}</script>

Fixes #4320

Checklist

  • My pull request is addressing an open issue (please create one otherwise).
  • My code is tested and passes existing tests.
  • My code follows the Engineering Guidelines (updates are often made to the guidelines, check it out periodically).

For scripts with type="application/json" or
type="application/ld+json",
ensure that its content is valid JSON.
@googlebot googlebot added the cla: yes Signed the Google CLA label Feb 28, 2020
@@ -10,6 +10,7 @@
"ext-date": "*",
"ext-dom": "*",
"ext-iconv": "*",
"ext-json": "*",
Copy link
Contributor Author

@kienstra kienstra Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PhpStorm reported the lack of this in composer.json.

It looks like json_decode(), which AMP_HTML_Utils::is_valid_json() calls, is from the json extension.

And Jetpack also uses this in composer.json.

@kienstra
Copy link
Contributor Author

Hi @westonruter,
No hurry, it being Friday afternoon. But could you please review this when you have a chance?

Thanks, Weston!

@@ -94,7 +90,7 @@ jobs:

- name: E2E tests
php: "7.3"
env: WP_VERSION=latest DEV_LIB_SKIP=phpcs,eslint,xmllint,phpsyntax,phpunit PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=
env: WP_VERSION=latest DEV_LIB_SKIP=phpcs,eslint,xmllint,phpsyntax,phpunit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like WordPress/gutenberg#20547. It's because of a failed E2E Travis job.

// When the CDATA is expected to be JSON, ensure it's valid JSON.
if ( 'script' === $element->tagName &&
in_array( $element->getAttribute( 'type' ), [ 'application/json', 'application/ld+json' ], true ) &&
null === json_decode( $element->textContent )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right. What if the JSON being decoded is itself null? This needs to make use of json_last_error() to detect the error scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ea26570 uses json_last_error() instead.

in_array( $element->getAttribute( 'type' ), [ 'application/json', 'application/ld+json' ], true ) &&
null === json_decode( $element->textContent )
) {
return [ 'code' => self::INVALID_JSON_CDATA ];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could include the JSON error code as part of the error data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the new error message added in ea26570. How does that look?

new-error-message

Copy link
Contributor Author

@kienstra kienstra Feb 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most common message would probably be probably 'Syntax error'.

So far, these look to mainly
be 'Invalid syntax'
It looks like json_last_error() will not return
an error if passed a falsy value.
In the Changelog:
"7.0.0	An empty PHP string or value that after casting to string is an empty string (NULL, FALSE) results in JSON syntax error."
https://www.php.net/manual/en/function.json-decode.php
// When the CDATA is expected to be JSON, ensure it's valid JSON.
if ( 'script' === $element->tagName && in_array( $element->getAttribute( 'type' ), [ 'application/json', 'application/ld+json' ], true ) ) {
if ( empty( $element->textContent ) ) {
return [
Copy link
Contributor Author

@kienstra kienstra Feb 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like in PHP 5.6 and earlier, json_last_error( ) returns JSON_ERROR_NONE after json_decode( '' ).

In the Changelog:

7.0.0 | An empty PHP string or value that after casting to string is an empty string (NULL, FALSE) results in JSON syntax error.

These failed unit tests are from that different behavior, I think.

if ( empty( $element->textContent ) ) {
return [
'code' => self::INVALID_JSON_CDATA,
'message' => esc_html__( 'Expected JSON, got an empty value', 'amp' ),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what this looks when the expected JSON is empty():

ms-new-error


// When the CDATA is expected to be JSON, ensure it's valid JSON.
if ( 'script' === $element->tagName && in_array( $element->getAttribute( 'type' ), [ 'application/json', 'application/ld+json' ], true ) ) {
if ( empty( $element->textContent ) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As opposed to empty() should this check if '' === trim( $element->textContent )? Does that also result in JSON_ERROR_NONE?

Copy link
Contributor Author

@kienstra kienstra Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PHP 7.0+, json_decode( trim( ' ' ) ) results in json_last_error() returning JSON_ERROR_SYNTAX.

But your suggestion would probably give a more useful message to the user.

If $element->textContent is ' ', having a message that it's empty is probably more useful than a message about invalid syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 81ea67d

if ( JSON_ERROR_NONE !== json_last_error() ) {
return [
'code' => self::INVALID_JSON_CDATA,
'message' => esc_html( json_last_error_msg() ),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using esc_html() isn't right here, as this is not being printed here. No escaping/sanitization should be necessary here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, applied in 81ea67d

[ AMP_Tag_And_Attribute_Sanitizer::INVALID_JSON_CDATA ],
],
'cdata_malformed_json_with_emojis' => [
'<html><head><meta charset="utf-8"><script type="application/ld+json">{"wrong": 🚧 🚧 }</script></head><body></body></html>',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test isn't quite capturing the original issue observed. This should do it:

Suggested change
'<html><head><meta charset="utf-8"><script type="application/ld+json">{"wrong": 🚧 🚧 }</script></head><body></body></html>',
'<html><head><meta charset="utf-8"><script type="application/ld+json">{"wrong": "<?php echo wp_staticize_emoji( "🚧 🚧" ); ?>"}</script></head><body></body></html>',

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, committed in 5b187d3

if ( empty( $element->textContent ) ) {
return [
'code' => self::INVALID_JSON_CDATA,
'message' => esc_html__( 'Expected JSON, got an empty value', 'amp' ),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'message' => esc_html__( 'Expected JSON, got an empty value', 'amp' ),
'message' => __( 'Expected JSON, got an empty value', 'amp' ),

Escaping should be done at printing not storage. The validation error may be sent back via JSON where HTML entities should not be present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See note below about how having a more specific error code for this case could be preferable, as the error message could be translated for the user at display time.

json_decode( $element->textContent );
if ( JSON_ERROR_NONE !== json_last_error() ) {
return [
'code' => self::INVALID_JSON_CDATA,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also include the JSON error code (perhaps json_error). Ultimately, this may be preferable than using the message as it would mean the strings could be translated, e.g. using https://www.php.net/manual/en/function.json-last-error-msg.php#117393

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather, each of the error codes could have their own dedicated validation error codes. So instead of just INVALID_JSON_DATA there could be one for each of the constants in https://www.php.net/manual/en/function.json-last-error.php

Then the specific error message could be displayed at runtime rather than being stored in the actual error object, allowing for translations tailored to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, these error codes could be from the constants that json_last_error() returns. I'll apply that later tonight if that's alright.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the specific error message could be displayed at runtime rather than being stored in the actual error object, allowing for translations tailored to the user.

Sure, it'd be good to have actual translations of these error messages.

But do you have in mind passing each of the 10 error messages through __(), like:

private static function get_error_message() {
	...
	case AMP_Tag_And_Attribute_Sanitizer::JSON_ERROR_DEPTH:
		return __( 'The maximum stack depth has been exceeded', 'amp' );
	case AMP_Tag_And_Attribute_Sanitizer::JSON_ERROR_STATE_MISMATCH:
		return __( 'Invalid or malformed JSON', 'amp' );
	etc...

This could get pretty verbose, but maybe I'm not understanding this.

Thanks, Weston!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something like that. Maybe let the error title be normalized to just “Invalid JSON” for each error code, but then have the message have more elaboration, for example:

diff --git a/includes/validation/class-amp-validation-error-taxonomy.php b/includes/validation/class-amp-validation-error-taxonomy.php
index 6a6158c61..ed6c49e40 100644
--- a/includes/validation/class-amp-validation-error-taxonomy.php
+++ b/includes/validation/class-amp-validation-error-taxonomy.php
@@ -1833,8 +1833,22 @@ class AMP_Validation_Error_Taxonomy {
 					$content .= '</p>';
 				}
 
-				if ( isset( $validation_error['message'] ) ) {
-					$content .= sprintf( '<p>%s</p>', esc_html( $validation_error['message'] ) );
+				$message = null;
+				switch ( $validation_error['code'] ) {
+					case AMP_Tag_And_Attribute_Sanitizer::JSON_ERROR_DEPTH:
+						$message = __( 'The maximum stack depth has been exceeded', 'amp' );
+						break;
+					case AMP_Tag_And_Attribute_Sanitizer::JSON_ERROR_STATE_MISMATCH:
+						$message = __( 'Invalid or malformed JSON', 'amp' );
+						break;
+					// ...
+					default:
+						if ( isset( $validation_error['message'] ) ) {
+							$message = $validation_error['message'];
+						}
+				}
+				if ( $message ) {
+					$content .= sprintf( '<p>%s</p>', esc_html( $message ) );
 				}
 
 				break;
@@ -3099,6 +3113,11 @@ class AMP_Validation_Error_Taxonomy {
 				}
 				return $title;
 
+			case AMP_Tag_And_Attribute_Sanitizer::JSON_ERROR_DEPTH:
+			case AMP_Tag_And_Attribute_Sanitizer::JSON_ERROR_STATE_MISMATCH:
+			// ...
+				return __( 'Invalid JSON', 'amp' );
+
 			default:
 				/* translators: %s error code */
 				return sprintf( __( 'Unknown error (%s)', 'amp' ), $validation_error['code'] );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, how does 47565f1 look?

JSON_ERROR_EMPTY isn't one of the json_last_error() values, but I made it up for <script> textContent of '' or ' '.

If it's alright, later today I'll add a test for AMP_Validation_Error_Taxonomy::filter_manage_custom_columns() that covers that new switch statement for the message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

47b8e87 adds a test for the error message.

kienstra and others added 5 commits March 3, 2020 19:32
Commit Weston's suggestion for the unit test

Co-Authored-By: Weston Ruter <westonruter@google.com>
This should give a more useful message
than an invalid syntax error.
Pass these to __(),
and only get the message when they're about
to be displayed.
It should appear in the custom
column, based on whether it's
present in the validation error,
or whether it can be found from
the error code, like with JSON errors.
This can have null,
so this should be string|null
@westonruter westonruter added this to the v1.5 milestone Mar 5, 2020
Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failure for Test_AMP_Validation_Manager::test_has_parameters_passed_by_reference is not caused by the code here, but by a change in core: WordPress/wordpress-develop@f40c6c9#diff-1f96a4b7aed0d895bbbcfc265fcb4fc6

This can be fixed in a separate PR.

$filtered_content = AMP_Validation_Error_Taxonomy::filter_manage_custom_columns( $initial_content, 'error_code', $term_id );

$this->assertStringStartsWith( $initial_content . '<button type="button" aria-label="Toggle error details"', $filtered_content );
$this->assertStringContainsString( $expected_error_message, $filtered_content );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is causing test failures:

PHP Fatal error: Call to undefined method Test_AMP_Validation_Error_Taxonomy::assertStringContainsString() in /tmp/wordpress/src/wp-content/plugins/amp/tests/php/validation/test-class-amp-validation-error-taxonomy.php on line 1199

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing that. ae5df63 fixes it.

@westonruter westonruter dismissed their stale review March 5, 2020 05:00

Requested changes have been implemented.

@westonruter
Copy link
Member

The test failure for Test_AMP_Validation_Manager::test_has_parameters_passed_by_reference is not caused by the code here, but by a change in core: WordPress/wordpress-develop@f40c6c9#diff-1f96a4b7aed0d895bbbcfc265fcb4fc6

This is being fixed in #4357.

kienstra and others added 3 commits March 4, 2020 23:50
It looks like assertStringContainsString()
caused a failure,
as Weston mentioned.
…validation-error

* 'develop' of github.com:ampproject/amp-wp: (151 commits)
  Remove need to iterate over nodesToRemove a second time
  Fix static analysis of expiry param
  Improve static analysis of Document::normalizeHtmlAttributes()
  Improve phpdoc, formatting, and return value
  Add curl_getinfo() and curl_errno() to list of required cURL functions
  Remove obsolete Amp\AmpWP\Filter missed in 179b49b
  Expose AMP optimizer errors in HTML comment at end of head
  Omit AmpRuntimeCss transformer if SSR is not enabled
  Add angle brackets to ElementDump output
  Remove dead boilerplate code from AMP_Meta_Sanitizer
  Improve robustness of getting error code from class after 938bbf4
  Fix removal of namespace when deriving error code from class name
  Revert deprecation expectation in helper tests for stylesheet functions
  Capitalize AMP
  Fix tests for meta sanitizer and bug in Dom/Document
  Undeprecate boilerplate stylesheet functions
  Remove node deduplication in meta sanitizer
  Fix validation errors when optimizer is disabled
  Add cache-control support
  Introduce Response object to pass headers together with body for remote requests
  ...
…validation-error

* 'develop' of github.com:ampproject/amp-wp: (58 commits)
  Remove vendor dirs for library if they exist
  Remove dotenv config
  Update package-lock.json
  Update dependency babel-eslint to v10.1.0 (#4344)
  Update dependency @wordpress/e2e-test-utils to v4.3.1 (#4360)
  Update dependency react-dom to v16.13.0 (#4348)
  Update dependency uuid to v7 (#4325)
  Update dependency eslint-plugin-jest to v23.8.2 (#4346)
  Update dependency cross-env to v7.0.2 (#4359)
  Remove now unneeded dotenv dependency
  Remove extraneous closing brace from add_nav_menu_styles
  Used fixed-height layout for iframe
  Add missing allow-popups sandbox for iframe
  Re-add image assets for tests
  Improve phpdoc, method names, and remove unnecessary global namespace qualifier
  Throw Exception instead of Error
  Wait for server to start
  Improve static analysis for passing ownerDocument
  Do capital_K_dangit for TikTok
  Update namespace reference after #4364
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Signed the Google CLA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add sanitization of JSON to prevent AMP validation errors
3 participants