Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML schema conforms to v1.1 of the schema spec but not to v1.0 #113

Open
pbiron opened this issue Sep 23, 2022 · 25 comments
Open

XML schema conforms to v1.1 of the schema spec but not to v1.0 #113

pbiron opened this issue Sep 23, 2022 · 25 comments

Comments

@pbiron
Copy link

pbiron commented Sep 23, 2022

The phpcsdoc.xsd schema does not conform to v1.0 of the XML schema spec because of the wildcard in the rulegroup Model Group.

Since the schema has no targetNamespace, the element wildcard in that model group (<xs:any minOccurs="0"/>), as written, violates the Unique Particle Attribution constraint (since the wildcard could potentially match any unqualified element, including those specifically named in the model group).

XML Schema 1.1 loosened UPA so that wildcard is OK as specified provided that a v1.1 schema processor is used.

Since there are not that many v1.1 schema processors out there (especially in PHP, and most IDEs), it would probably be a good idea to modify the schema to conform to v1.0 of the schema spec.

I have some other suggestions for improvements to the schema, but it would be a good idea to have some general discussions around the "goals" of the schema (and validation in general) before jumping into opening a PR.

Note: I came across the schema when I saw WordPress/WordPress-Coding-Standards#2084.

@pbiron
Copy link
Author

pbiron commented Sep 23, 2022

For example, when trying to validate Security/SafeRedirectStandard.xml with Xerces2-j (in schema 1.0 mode), the following error message is output:

[Error] phpcsdocs.xsd:3:17: cos-nonambig: standard and WC[##any] (or elements from their substitution group) violate "Unique Particle Attribution". During validation against this schema, ambiguity would be created for those two particles.

(even when there is no element in the XML instance that could potentially match the wildcard)

@dingo-d
Copy link
Collaborator

dingo-d commented Sep 23, 2022

Hi @pbiron!

Thanks for the comments, I wasn't aware we violated anything (I never got any notice about the schema version when working on it from my IDE). I guess I should have made better research, as I'm not really that knowledgeable about XML schemas 😬

The original discussion was started in this issue. The original PR was also made in the WPCS repo.

What I used as a basis of the XSD file is the default phpcs.xsd schema.

Is there a tool we can use to verify the validity of the schema against a certain XML version spec?

I think @jrfnl won't be against a PR that will make the schema even better 🙂

@jrfnl
Copy link
Member

jrfnl commented Sep 23, 2022

@pbiron Thanks for opening this issue to discuss your concerns.

The use of wildcards in select places is intentional as we didn't want to make the XSD more restrictive than is strictly necessary.

Since there are not that many v1.1 schema processors out there (especially in PHP, and most IDEs), it would probably be a good idea to modify the schema to conform to v1.0 of the schema spec.

This XSD is not intended for processing via PHP. It's intended use is with XMLlint. See the usage instructions in the README.

And we do actually validate the XSD against the W3C specs during CI and the XSD validates against those:
5d491b6

I have some other suggestions for improvements to the schema, but it would be a good idea to have some general discussions around the "goals" of the schema (and validation in general) before jumping into opening a PR.

The target public for this XSD file is very limited: it is a tool which is only interesting for maintainer of sniff libraries.

Suggestions to improve the XSD are definitely welcome, but please don't take offense if any are rejected as, as I state above, the goal is for it to be a helpful tool for sniff developers and not to restrict them more than necessary.

@jrfnl
Copy link
Member

jrfnl commented Sep 23, 2022

As a side-note: the 1.1 version of the specs, which we used during our research when creating the XSD, was published in 2012, so any tooling not compatible with version 1.1 has had 10 years to update already.
Not sure we should accommodate tools which are not actively maintained.

@GaryJones
Copy link
Contributor

Xerces2-j (in schema 1.0 mode)

I've never used Xerces2-j, but according to the docs, it claims to support XML 1.1 processing (other than normalization), so why would you run it in "schema 1.0 mode" for an XML 1.1-specced XSD?

@GaryJones
Copy link
Contributor

Could:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">

...be changed to:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema/v1.1" elementFormDefault="qualified">

...as per https://www.w3.org/TR/xmlschema11-1/#langids to explictely indicate that it's written with the XSD 1.1 spec?

@jrfnl
Copy link
Member

jrfnl commented Sep 24, 2022

@GaryJones Sounds like a good improvement to make.

@pbiron
Copy link
Author

pbiron commented Sep 25, 2022

Thanx for all the replies everyone.

My only goal in opening this issue is to make sure that the XML Schema is usable in the widest array of XML schema processors...just as WP tries to be available on the widest array of PHP/webserver environments. Since the XML instances are very simple, I believe there is no need to write a schema that doesn't conform to v1.0 of the schema spec.

As one of the editors of v1.0 of the XML Schema spec (https://www.w3.org/TR/xmlschema-2/) and a member of the XML Schema Working Group that produced v1.1 of the schema spec, XML Schema is something that I am well versed in and am just trying to contribute that knowledge to this project.

Julliette, I agree that using wildcards is a good thing...and I'm not suggesting removing the wildcard.

Generally, when allowing "extension" elements in XML instances governed by an XML schema, it's a good idea to require those extension elements to be in an XML namespace different from the targetNamespace of the schema govering the instances. Since this XML schema does not have a targetNamespace (i.e., the elements are not in any namespace), this means that it would be a good practice to require extension elements to explictly be in a namespace (any namespace). This can easily be acheived by changing the element wildcard from:

<xs:any minOccurs='0'/>

to

<xs:any namespace='##other' minOccurs='0'/>

Not only would this change make it perfectly clear which elements in the XML instances are part of the "standard" and which are "extension" elements, it also means that the element wildcard no longer violates v1.0 UPA...and hence, an XML Schema v1.0 processor is able to validate the instances.

Similarly, I would suggest that the namespace='##other' attribute be added to all the attribute wildcards (that is, requiring that all "extension" attributes also have to explictly be in an XML namespace).

The next change I would make to that element wildcard is to make be "lax" as the attribute wildcards already are, as in:

<xs:any namespace='##other' minOccurs='0' processContents='lax'/>

What does "lax" mean? It means that if the XML schema processor can find a schema for those extension elements, then it will try to validate them against the extension schema, if not then it will silently ignore them (essentially, treat them "as if" they were valid). As is, the element wildcard is strict, meaning that the schema processor must be able to find a schema declaration for it and will try to validate it against that schema...and if no schema can be found or no element declaration can be found for the extension element, then the entire XML instance is marked as invalid.

For example, if someone were to add a <this_is_an_extension_element/> element after a <code_comparison> element, as in:

<?xml version="1.0"?>
<documentation xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="https://phpcsstandards.github.io/PHPCSDevTools/phpcsdocs.xsd"
    title="Safe Redirect"
    >
    <standard>
    <![CDATA[
    wp_safe_redirect() should be used whenever possible to prevent open redirect vulnerabilities. One of the main uses of an open redirect vulnerability is to make phishing attacks more credible. In this case the user sees your (trusted) domain and might get redirected to an attacker controlled website aimed at stealing private information.
    ]]>
    </standard>
    <code_comparison>
        <code title="Valid: Redirect can only go to allowed domains.">
        <![CDATA[
<em>wp_safe_redirect</em>( $location );
        ]]>
        </code>
        <code title="Invalid: Unsafe redirect, can be abused.">
        <![CDATA[
<em>wp_redirect</em>( $location );
        ]]>
        </code>
    </code_comparison>
    <this_is_an_extension_element/>
</documentation>

even a v1.1 schema processor (such as xmllint), would produce a validation error such as:

SafeRedirectStandard.xml:24: element this_is_an_extension_element: Schemas validity error : Element 'this_is_an_extension_element': No matching global element declaration available, but demanded by the strict wildcard.
SafeRedirectStandard.xml fails to validate

With the processContents='lax' attribute added to the element wildcard, then the instance would still validate.

I'll address several of the other comments in separate comments of my own.

@pbiron
Copy link
Author

pbiron commented Sep 25, 2022

Xerces2-j (in schema 1.0 mode)

I've never used Xerces2-j, but according to the docs, it claims to support XML 1.1 processing (other than normalization), so why would you run it in "schema 1.0 mode" for an XML 1.1-specced XSD?

Xerces2-j is the most widely used XML Schema processor out there. In fact, the vast majority of XML tools that are written in Java (which is most of them) use Xerces2-j "under the hood". For instance, PhpStorm does (see https://www.jetbrains.com/help/phpstorm/working-with-xml.html), altho I suspect that that page is out-of-date, since it mentions Xerces 2.11 which did not include support for XML Schema v1.1. Xerces didn't support schema 1.1 until version 2.12.0 (release in late 2018).

Many codebases that use Xerces "under the hood" still don't support schema v1.1 (such as Eclipse, the IDE I use...in some senses, PhpStorm is basically a fork of Eclipse). Why? Because the schema v1.1 support in Xerces 2.12 is only available through it's JAXP interface and it's not always easy for tools to switch from using Xerces' DOM or SAX interfaces.

@pbiron
Copy link
Author

pbiron commented Sep 25, 2022

Could:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">

...be changed to:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema/v1.1" elementFormDefault="qualified">

...as per https://www.w3.org/TR/xmlschema11-1/#langids to explictely indicate that it's written with the XSD 1.1 spec?

Unfortunately, the Schema Language Identifiers section of the schema v1.1 spec is unclear when it says:

Sometimes other specifications or Application Programming Interfaces (APIs) need to refer to the XML Schema Definition Language in general, sometimes they need to refer to a specific version of the language, possibly even to a version defined in a superseded draft. To make such references easy and enable consistent identifiers to be used, we provide the following URIs to identify these concepts.

Essentially, the use of the term refer in that section means "when writing documentation you can use this URLs to refer to a specific version of the schema spec" (such as RDF, which uses URLs [actually, IRIs] to uniquely identify a "resource").

As the v1.1 schema spec says in The Schema Namespace

The XML representation of schema components uses a vocabulary identified by the namespace name http://www.w3.org/2001/XMLSchema. For brevity, the text and examples in this specification use the prefix xs: to stand for this namespace; in practice, any prefix can be used.

Note: The namespace for schema documents is unchanged from version 1.0 of this specification [XSD 1.0 2E], because any schema document valid under the rules of version 1.0 has essentially the same ·assessment· semantics under this specification as it did under version 1.0 (Second Edition). There are a few exceptions to this rule, involving errors in version 1.0 of this specification which were not reparable by errata and which have therefore been fixed only in this version of this specification, not in version 1.0.

If you were to change the namespace URI in the schema to http://www.w3.org/XML/XMLSchema/v1.1 and try to validate an instance (e.g., using xmllint) you'd get an error such as:

Schemas parser error : The XML document 'phpcsdocs.xsd' is not a schema document.
WXS schema phpcsdocs.xsd failed to compile

@jrfnl
Copy link
Member

jrfnl commented Sep 25, 2022

@pbiron Sounds like you have some useful improvements in mind. I look forward to a PR.

When working on this, please be aware that the most important aspect of this XSD is that PHP_CodeSniffer should be able to interpret and display the information correctly - this XSD is specifically for the XML files which PHPCS displays when using the --generator=[Text|HTML|Markdown] option.

The current XSD is safeguarded by tests to ensure this and those tests would still need to pass for the PR (and if any additions are made, new tests may need to be added).

My only goal in opening this issue is to make sure that the XML Schema is usable in the widest array of XML schema processors...just as WP tries to be available on the widest array of PHP/webserver environments.

Considering that this toolset is only interesting for a very small group of people, that is not really a big concern for this project.
Having said that, I'm perfectly happy to accept a PR which will make it more widely usable.

@pbiron
Copy link
Author

pbiron commented Sep 25, 2022

I'll have to go through the existing tests to see if any of them will need to be changed. I suspect that at least 2 new tests will need to add to check that any elements/attributes that match the wildcards are explicitly in an XML namespace (one for extension elements, another for extension attributes).

Since I've never used phpcs --generator=[Text|HTML|Markdown], how does that use the these XML instances in producing it's output (so that I can verify that requiring extension elements/attributes be explicitly namespace qualified doesn't cause any problems)?

I just did an informal test of adding some extension elements/attributes to a few of the documentation files included with the installation of WPCS 2.3.0 for one of my plugins and ran phpcs --generator=HTML and as far as I can tell, the HTML output produced is the same as without the extension elements/attributes. Is there a "more formal" way I should verify there will be no problems?

@jrfnl
Copy link
Member

jrfnl commented Sep 25, 2022

I'll have to go through the existing tests to see if any of them will need to be changed. I suspect that at least 2 new tests will need to add to check that any elements/attributes that match the wildcards are explicitly in an XML namespace (one for extension elements, another for extension attributes).

👍🏻

Since I've never used phpcs --generator=[Text|HTML|Markdown], how does that use the these XML instances in producing it's output (so that I can verify that requiring extension elements/attributes be explicitly namespace qualified doesn't cause any problems)?

The code for the generators can be found here: https://github.com/squizlabs/PHP_CodeSniffer/tree/master/src/Generators

To see it in action, try the below commands on the command line:

# For sniff documentation on the command line:
> phpcs --standard=PSR12 --generator=Text

# To generate the same documentation as an HTML page to add to a website:
> phpcs --standard=PSR12 --generator=HTML > psr12.html

# Similar, to generate the same documentation to be published as markdown (like in a GH wiki or a GHPages website):
> phpcs --standard=PSR12 --generator=Markdown > psr12.md

I just did an informal test of adding some extension elements/attributes to a few of the documentation files included with the installation of WPCS 2.3.0 for one of my plugins and ran phpcs --generator=HTML and as far as I can tell, the HTML output produced is the same as without the extension elements/attributes.

For WPCS, there is currently no intention to add extra elements/attributes.

For the PHPCompatibility standard, there is, though IIRC, the limited set of currently available documentation doesn't do so yet.

A typical attribute PHPCompatibility may start using would be minPHP="7.1" (for instance), probably added to the standard element.
A typical extra element PHPCompatibility may start using would be something along the lines of <references> with within that a set of <url>...link here...</url> elements which would allow for linking back to the PHP RFC and PHP manual documentation related to the change the sniff detects.

Does that help ?

Is there a "more formal" way I should verify there will be no problems?

I suppose we could set something up in the test suite to verify that the docs are generated by PHPCS as expected.

This package contains one sniff with (very minimal) documentation which could possibly be used for that. Then again, it would probably be better if PHPCS itself would test the generators they provide (currently not the case, well, only done manually).

@pbiron
Copy link
Author

pbiron commented Sep 26, 2022

The code for the generators can be found here: https://github.com/squizlabs/PHP_CodeSniffer/tree/master/src/Generators

Thanx. Looking at that code, I don't foresee any problems...since it is looking for specific named elements/attributes via DOMDocument::getElementsByTagName() and DOMDocument::getAttribute(), so any additional elements/attributes won't interfere.

And I know you previously said that doing schema validation in PHP was a non-goal, but it might be helpful for PHP_CodeSniffer\Generators\Generator::generate() (and that method in the sub-classes, e.g., HTML::generate(), etc) to do schema validation, since none of that code is written defensively, e.g.,

    public function generate()
    {
        foreach ($this->docFiles as $file) {
            $doc = new \DOMDocument();
            $doc->load($file);
            $documentation = $doc->getElementsByTagName('documentation')->item(0);
            $this->processSniff($documentation);
        }

    }//end generate()

will result in a PHP fatal error in any of $this->docFiles do not contain a <documentation> element, e.g.

PHP Fatal error: Uncaught TypeError: Argument 1 passed to PHP_CodeSniffer\Generators\HTML::processSniff() must be an instance of DOMNode, null given, called in vendor\squizlabs\php_codesniffer\src\Generators\HTML.php on line 38 and defined in vendor\squizlabs\php_codesniffer\src\Generators\HTML.php:191

And DOMDocument::schemaValidate() only supports XML Schema 1.0, so one more reason it will be good if the XML Schema conforms to v1.0 :-) Of course, if the CI is done correctly, then there should be no invalid XML instances...but still, always a good idea to either code defensively OR validate before using the XML.

@jrfnl
Copy link
Member

jrfnl commented Sep 26, 2022

will result in a PHP fatal error in any of $this->docFiles do not contain a <documentation> element

And that is as it should be.

In contrast to WP, in most PHP projects, it is not customary to hide PHP errors.
I'm a huge fan of defensive coding, but a fatal error when the doc file is invalid is the correct result and the doc file should be fixed in that case. That can either be the PHP native fatal or a custom RuntimeException, but it should still throw an error.

@jrfnl
Copy link
Member

jrfnl commented Sep 26, 2022

@pbiron Sorry, I realize you are using WP as a frame of reference, but WP is, in all honesty, one of the worst pieces of code out there and really not a good example of good coding practices, nor of a good coding culture.

@pbiron
Copy link
Author

pbiron commented Sep 26, 2022

A typical attribute PHPCompatibility may start using would be minPHP="7.1" (for instance), probably added to the standard element.
A typical extra element PHPCompatibility may start using would be something along the lines of with within that a set of ...link here... elements which would allow for linking back to the PHP RFC and PHP manual documentation related to the change the sniff detects.

Does that help ?

We'll have to discuss this a little more. Because if I understand what you're saying, I would NOT consider the minPHP attribute nor the reference and url elements to be extension elements. They seem more like optional elements/attributes that will (might be?) added to standard XML schema at some point in the future. (and the generators modified to process them if they exist?).

By extension elements/attributes, I think of elements/attributes added to the XML instances that are intended to be processed by a tool other than phpcs, itself. For example, suppose a company writes a Standard/ruleset that is specific to their company (i.e., not intended for "public" consumption) and they write XML Docs for their sniffs. I can imagine them adding extension elements/attributes that are used by other tools in their build stack.

And it is those "company-specific" elements/attributes that I'm suggesting should be in a company-specific namespace.

@pbiron
Copy link
Author

pbiron commented Sep 26, 2022

will result in a PHP fatal error in any of $this->docFiles do not contain a <documentation> element

And that is as it should be.

In contrast to WP, in most PHP projects, it is not customary to hide PHP errors. I'm a huge fan of defensive coding, but a fatal error when the doc file is invalid is the correct result and the doc file should be fixed in that case. That can either be the PHP native fatal or a custom RuntimeException, but it should still throw an error.

I'm not suggesting hiding PHP errors!! Trust me, I won't be offended if you don't take any of my advice, but I would much rather the generate() method do either:

foreach ($this->docFiles as $file) {
    $doc = new \DOMDocument();
    $doc->load($file);
    if ( ! $doc->schemaValidate( 'phpcsdocs.xsd' ) {
        // output error message that `$file` is not valid and is being skipped.
       continue;
    }

    $documentation = $doc->getElementsByTagName('documentation')->item(0);
    $this->processSniff($documentation);
}

or, at least:

foreach ($this->docFiles as $file) {
    $doc = new \DOMDocument();
    $doc->load($file);
    $documentation = $doc->getElementsByTagName('documentation');
    if ( $documentation->count() < 1 ) {
        // output error message that `$file` does not contain a `documentation` element and is being skipped.
       continue;
    }

    $this->processSniff($documentation->item(0);
}

That is, process what you can and tell the user what you can't process...graceful degradation.

@jrfnl
Copy link
Member

jrfnl commented Sep 26, 2022

We'll have to discuss this a little more.

PHPCS will not add support for additional attributes/elements to the doc generators. This has been discussed in that repo before and Greg is leaning more towards changing the whole docs setup completely, but it is completely unknown to "what" it will be changed.

If and when that happens, I intend to add a tool to convert existing XML based PHPCS docs to whatever the new format is to this tooling repo.

As for external tooling using the docs files... there is a WIP/Proof of concept repo here: https://github.com/PHPCSStandards/phpcs-docs

On the one hand, this PoC may be used to start auto-generating a docs website for PHPCompatibility, combining information in the sniff docblocks with the information/code samples in the XML docs.
There is also an idea to have a generic PHPCS info website, including a searchable index of available sniffs from common and well-maintained external standards and it may be used as a basis for that.

For now, all of that is still up in the air as there's only so many hours in a day,

For reference, here are some links to previous discussions about the docs in the PHPCS repo:

@pbiron
Copy link
Author

pbiron commented Sep 26, 2022

@pbiron Sorry, I realize you are using WP as a frame of reference, but WP is, in all honesty, one of the worst pieces of code out there and really not a good example of good coding practices, nor of a good coding culture.

And, no, I'm not necessarily using WP as a frame of reference. While it's true that for the last 12+ years, 90% of what is do is WP-related, I've been a software engineer for ~30 years. And in fact, long before I'd ever heard of WP, I spent 10+ years writing XML related standards (not only through the W3C, but also for Health Level Seven, an ANSI-accredited SDO that defines clinical messaging standards used throughout the world) and code to process XML for the healthcare industry (including teaching other developers how to write XML processing code).

@jrfnl
Copy link
Member

jrfnl commented Sep 26, 2022

That is, process what you can and tell the user what you can't process...graceful degradation.

Ah, I think we were talking sideways. Yes, I agree your suggestion would be a nice improvement for PHPCS. however there are two caveats:

  1. I don't maintain PHPCS, so the discussion on if and if so, how the generators could be improved should be moved to the PHPCS repo.
  2. PHPCS does not currently have a dependency on this repo and I know from previous discussions that Greg is not keen on adding external dependencies (not that I haven't suggested it), so validating against the XSD in this repo may be a reach and make a PR unlikely to be accepted.

In case you are not aware: the proposal to add an XSD was first proposed to be added to PHPCS itself, but that was rejected (twice), which is why we decided to add it here as we still felt it would be useful to have the XSD available.

@jrfnl
Copy link
Member

jrfnl commented Sep 26, 2022

And, no, I'm not necessarily using WP as a frame of reference.

Apologies, I just got that impression based on some of the references and suggestions you made. No offense intended Paul!

@jrfnl
Copy link
Member

jrfnl commented Oct 14, 2022

👋 Hiya, just checking in whether this is still being worked on ?

@jrfnl
Copy link
Member

jrfnl commented Nov 12, 2023

Some useful improvement suggestions were made in this ticket, but I've yet to see a PR. As it's been over a year, I'm tempted to close this ticket.

@pbiron
Copy link
Author

pbiron commented Nov 12, 2023

thanx for the reminder Juliette...I'd forgotten that I opened this.

I think I might have some time in the next week to put together a preliminary PR.

Tomorrow I'll refresh my memory about everything that's been said on this issue and that'll give me a better idea of when I'm likely to have a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants