Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve WHATWG Url Spec Conformance #802

Open
hyspace opened this issue Jun 11, 2019 · 15 comments
Open

Improve WHATWG Url Spec Conformance #802

hyspace opened this issue Jun 11, 2019 · 15 comments
Milestone

Comments

@hyspace
Copy link
Contributor

hyspace commented Jun 11, 2019

New Feature Proposal

Description

This Proposal contains 2 parts:

  1. Making Url to catch up with latest WHATWG URL Living Standard
  2. Keep tracking failing [web-platform-tests]

Background

After reading the source code, I found that the Url is not aligned with lastest WHATWG URL Living Standard
The difference is trivial, but it is blocking me from using Url directly in my project.

The main difference is that Url is not returning parse failure for some invalid input. In current implementation, it is reflected by the wrong value of IsInvalid property after parsing.

For example,
Port number validation is not returning failure when port number is larger than 65535.

(new Url("http://example.com:65536")).IsInvalid // expect true, but got false

I heard from @FlorianRappl in another thread about this:

I don't see any error here. Ports are currently 16-bit, however, who knows when (or if) this will change. It should be up to the specific requester to then block a request (let's say the port is 99999 - what should we do about it? Just drop the port? note that invalid URLs will result to a default URL -> it will have potential negative side-effects) if its doing something invalid. It's a potential enhancement, but I would regard it as highly optional and I'm fine with omitting it.

I agree it is highly optional for most of the use cases, but for me it is not. I'm trying to get all links from web pages using AngleSharp.Html and trying to log down all invalid links for security related research. I mean "invalid" by not able to be open in browser. If you try the example in browser's address bar, it will redirect you to search engine with the invalid URL as keyword. If you have such link in a HTML file like this:

<a href='http://example.con:65536'>link</a>

you will see it links to empty page.

I don't see any negative side-effects of make this example URL returning parse error here, but I do see that I'm getting wrong result with current Url and I see "potential negative side-effects" of not following the standard. Different than your statement, I found current Browsers is following standard pretty good, and even they fail some test cases, I can find information in Issues like this

I understand that it may not worth it to implement some part of the standard for performance. But instead, developers need to understand which part we are currently not following the Standard and why, so they can make decision easier. Today, I have to find out which part we are not following standard by myself case by case.

My solution will be catch up with "latest living standard as much as possible", and provide the information about which part we are not following the standard by providing information of "failing test cases" in web-platform-tests, and have documentation of why we are not doing it.

Some of the changes will be trivial, like the example of port number validation, and the change will not affect performance at all. Some of the change may be bigger and will have impact on performance, we can do profiling and make decision.

This change may be painful since AngleSharp is focusing on HTML parsing, not URL. But actually URL is highly related and similar libraries in other languages also have their own URL implementation, like jsdom.

@FlorianRappl mentioned "what if port number can have more space in future". If it happens, it will be reflected in the Standard first. WHATWG Standard's principle is provide backward compatibility (so it won't directly increase the range, some extra mark to declare the new range for sure), so we should still be fine for existing implementation.

Today, I believe AngleSharp.Url is already the closest implementation of WHATWG URL Standard, why not making it even better so that more C# developer can use it? More efforts may be needed to maintain the Url Class, but I think it worth it since this is the only option in C# community today.

I talked to .Net community about the need of a WHATWG Standard Uri implementation. They will not consider it in next milestone but maybe in the future. I would like to push them internally within Microsoft as well, so I wish we can have them to maintain the URL implementation in the future.

Specification

I plan to do following changes:

  1. Add test case that takes WPT URL test json file as input and run all test defined there.
    • currently we already have most of them in tests, but not invalid ones.
    • this will make it easier to update the test cases in the future.
    • will mark failing test cases that we are not going to fix and reason why not fixing it.
  2. Fix failing test cases that we want to cover
    • work items already known: Ipv4 / Ipv6, port, Unicode in hash
@FlorianRappl
Copy link
Contributor

FlorianRappl commented Jun 11, 2019

Sounds all fine, but a word of caution regarding the use of "standard": WHATWG is not a standard, but rather browser vendors working on what could become a standard. For URL they are trying to replace the previous RFCs (some details: https://en.wikipedia.org/wiki/WHATWG).

The reason AngleSharp follows WHATWG is because AngleSharp is interested to unlock the same potential of the web that is usually just available to web browsers. As noted URLs play an important role.

The other thing is that I'm glad "maintenance" was mentioned. I don't think it has been fully realized at this point, but WHATWG provides a living standard. Many things change (some just slightly, but even the smallest change may have some impact, e.g., unlocking a new (edge) case). Hence when Url was created several years ago the WHATWG URL spec looked quite a bit different in parts; especially the validation. That does not make AngleSharp "wrong", but rather just following an older version of the released spec. This is very important to keep in mind - and important for anything that follows (e.g., time-frozen URLs should be used to refer to parts of the spec).

Besides proper IDN mapping (which we now have) the only thing missing from the previous spec was IPv4/IPv6 parsing. Again, that does not exclude making improvements or being up-to-date - that was just to put what is there in historical perspective and use it as an example for expectations with this issue (whatever will be done now will also be invalid / outdated in weeks / months / years if its not maintained continuously).

@FlorianRappl FlorianRappl changed the title New feature of [Url]: Better implementation of WHATWG URL Standard and keep tracking of failing web-platform-tests Improve WHATWG Url Spec Conformance Jun 11, 2019
@hyspace
Copy link
Contributor Author

hyspace commented Jun 12, 2019

a word of caution regarding the use of "standard": WHATWG is not a standard, but rather browser vendors working on what could become a standard.

I would like to use term Living Standard to describe WHATWG. Even it is not actual standard, it is factual standard today.

The reason AngleSharp follows WHATWG is because AngleSharp is interested to unlock the same potential of the web that is usually just available to web browsers.

We are looking for a solution follows WHATWG for similar reason. To serve my customer, who is using modern browser today, I need to pay the cost for catching up with the browsers.

when Url was created several years ago the WHATWG URL spec looked quite a bit different in parts; especially the validation. That does not make AngleSharp "wrong", but rather just following an older version of the released spec.

I did not realize this point at beginning. That's why I used bugfix prefix of my PRs. I would like to apologize for it. After working on some changes and looking at the changelog of wpt-tests, also by discussing with you, now I'm aware of it.
I think it is a good chance to catch up with the latest standard, and I'm happy to contribute.

(whatever will be done now will also be invalid / outdated in weeks / months / years if its not maintained continuously).

Being aware of this, I would like to do some change to make it easier to catch up with the living standard in the future. (the test change I proposed)

This is very important to keep in mind - and important for anything that follows (e.g., time-frozen URLs should be used to refer to parts of the spec).

I have this need as well, bu I did not find any time-frozen URLs about the living standard. Do you have any solution about it?

Thanks!

@FlorianRappl
Copy link
Contributor

I have this need as well, bu I did not find any time-frozen URLs about the living standard. Do you have any solution about it?

Not really - there is the Internet Archive and the possibility to link against the Git repo (e.g., here https://github.com/whatwg/url - GitHub allows links also with specific commit hashes).

Maybe we should build up our own mirror of the published spec (and others / specs we use) which could be then linked (i.e., having something like https://anglesharp.github.io/whatwg/url -> snapshot of the URL spec we follow; would make it easy to compare against the real / current thing).

Best thing would be an annotated mirror with references / entry points of uses within AngleSharp / aux libraries.

I feel this will be some effort and we should align / have a good strategy first. I could cover the HTML DOM parts and maybe you can cover URL?

@hyspace
Copy link
Contributor Author

hyspace commented Jun 12, 2019

Not really - there is the Internet Archive and the possibility to link against the Git repo (e.g., here https://github.com/whatwg/url - GitHub allows links also with specific commit hashes).

If we only want to link to the source file, we don't need to mirror the spec. We can use link of source file with commit hash, however it will be inconvenient to read.

We can not get link by mirroring the spec since it seems not using Github pages but deployed somewhere else.

If we want to have a link with particular version, I think we need to change the build file and publish the compiled pages to gh-pages branch after mirroring it.

I feel this will be some effort and we should align / have a good strategy first. I could cover the HTML DOM parts and maybe you can cover URL?

Yes, after we decide the strategy I will cover Url

@hyspace
Copy link
Contributor Author

hyspace commented Jun 14, 2019

Preview of my test snapshot of URL spec:
http://hyspace.moe/url/

Steps I used after fork:

  1. make
  2. rename url.html to index.html
  3. Create clean gh-pages branch following this guide and push:
$ git checkout --orphan gh-pages 
$ git rm -rf --dry-run . # preview files to be deleted
$ git rm -rf . # actually delete the files
$ git add .
$ git commit -m'init'
$ git push origin gh-pages

Problem:

  • make will create a page with "Living Standard — Last Updated 13 June 2019", the date is when build happen

Any idea?

@FlorianRappl
Copy link
Contributor

When you fork (or clone) the existing repository you can also go to any point in time (you have also cloned the full history). Hence running make after a git checkout ... would make sense, e.g.,

git checkout af93f92a7b937ddacfaa7ce8c158a18a83c9c9d7
make
mv url.html index.html
# ...

would check out right when some IPv6 clarifications have been done (Feb 2017). Does that help?

@hyspace
Copy link
Contributor Author

hyspace commented Jun 14, 2019

I implemented the latest IPv6, which should be the current version. Why should we worry about the previous version now?

When you fork (or clone) the existing repository you can also go to any point in time (you have also cloned the full history). Hence running make after a git checkout ... would make sense, e.g.,

I know how to checkout history version. The problem is, when you generate url.html, the "Last Updated" date in that HTML will be the time when you run make, not the time last change is made to the source file. If we want to show correct "Last Updated" date, we need to manually change that date in HTML file.

If you are fine with my solution, please fork the URL standard to AngleSharp/url-standard-snapshot (or the name you preferred) and give me write permission to the forked repo, I will setup the snapshot.

This can not be done with PR since there is no gh-pages branch at the beginning.

I can rename the html file with date so that the the time-frozen URL will like https://anglesharp.github.io/url-standard-snapshot/2019-06.html

After this is done I will refer to this snapshot when implementing ipv4 or update any logic in AngleSharp.Url to catch up with new standard. Also I will refer to the snapshot in comments of code.

@FlorianRappl
Copy link
Contributor

I know how to checkout history version. The problem is, when you generate url.html, the "Last Updated" date in that HTML will be the time when you run make, not the time last change is made to the source file. If we want to show correct "Last Updated" date, we need to manually change that date in HTML file.

Okay now I understand - well I would just run a sed command afterwards to change that date. Since we know it will be set on the current date we can easily use the one from git log at the point we've checked out. You think that may work?

If you are fine with my solution, please fork the URL standard to AngleSharp/url-standard-snapshot (or the name you preferred) and give me write permission to the forked repo, I will setup the snapshot.

Sounds great - I'll do it right away. Thanks for your contribution and ongoing efforts! Much appreciated! 🍻

@FlorianRappl
Copy link
Contributor

You should have received an invite - the new repo can be found at https://github.com/AngleSharp/Specification-Url.

Hope I've set all permissions correctly. Just let me know if there are any problems!

@hyspace
Copy link
Contributor Author

hyspace commented Jun 15, 2019

Setup finished, please enable github pages for that repo:
image

@FlorianRappl
Copy link
Contributor

Your site is ready to be published at http://anglesharp.github.io/Specification-Url/.

It's active and already published! Looks great!

@hyspace
Copy link
Contributor Author

hyspace commented Jun 15, 2019

Although I have finished Ipv6 parsing but I will hold it. I will try to improve test case first, so that we know how many test cases are failing now.

To fix some of the test cases, some structure change may be introduced.
For example, today, WHATWG says we should remove TAB, LF, and CR at beginning of parsing. Current implementation is doing this step inside each of the parsing step, we may move them to the beginning.

Another example is, Ipv4 and Ipv6 serializing need extra property to store the binary format of IP, so we need an Enum to store the host type, like System.Uri did.

I plan to keep track the progress in my fork.

@hyspace
Copy link
Contributor Author

hyspace commented Jun 16, 2019

An early test implementation
hyspace@ca47150

And current result
https://gist.github.com/hyspace/5965e4db6b2b84cc19c40c8440a57bb9

Trying to have better output format

@annevk
Copy link

annevk commented Feb 19, 2020

Stumbled upon this thread via a reference from wpt: https://url.spec.whatwg.org/commit-snapshots/ exists.

@FlorianRappl
Copy link
Contributor

Great URL thanks @annevk !

@FlorianRappl FlorianRappl added this to the v1.0 milestone Apr 8, 2020
@FlorianRappl FlorianRappl modified the milestones: v0.15, v1.0 Apr 22, 2021
@FlorianRappl FlorianRappl modified the milestones: v1.0, vNext Jan 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants