Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opaque Host and domain encoding #220

Open
rubycon opened this issue Feb 21, 2024 · 6 comments
Open

Opaque Host and domain encoding #220

rubycon opened this issue Feb 21, 2024 · 6 comments

Comments

@rubycon
Copy link
Contributor

rubycon commented Feb 21, 2024

What is the issue with the URL Pattern Standard?

Here's a test case from the WPT test suite, it's expected to succeed (Chrome 123 pass).

const up1 = new URLPattern({ hostname : "xn--caf-dma.com" });
up1.exec({ hostname: "café.com" });

The test expect the hostname "café.com" to be converted to "xn--caf-dma.com" and to match the pattern.

But as per the spec, without a special scheme the host is considered an opaque host and not a domain name, therefore it should be percent-encoded to "caf%C3%A9.com" and not IDNA encoded to "xn--caf-dma.com". The pattern won't match.

Is it an oversight in the test suite/Chrome implementation or should hostname always be considered domain name ?

In some case the specs already allows opaque URL to be treated as special (cf. process pathname for init) and default the URL processing to the most common (most expected??) behaviour.

Should it be the case here too ? Or maybe only in some cases... Like forcing the URL to a special scheme when the host is not an address and domain labels are detected ?

@rubycon
Copy link
Contributor Author

rubycon commented Mar 13, 2024

After some more digging, most of the fail tests in this issue #206 are related to this opaque host issue. I think the idea was that these test should (at least) fail when checking forbidden domain code point in the host parser. This checking is only done in the URL parser for non opaque host.

As for the domain encoding issue, all these cases produce URL encoded hostname:

new URLPattern({ hostname : "café.com" }); // hostname pattern string: "caf%C3%A9.com"
new URLPattern({ protocol: "https", hostname : "café.com" }); // hostname pattern string: "caf%C3%A9.com"
new URLPattern("https://café.com" }); // hostname pattern string: "caf%C3%A9.com"

The first one seems logic (except if we consider that every non-adress hostname with unknown protocol should be considered domain) but the other two should be IDNA encoded domain name since they are non opaque/special URLs. It seems everything that goes into the pattern compiler is parsed separately, without context, so the hostname is always considered opaque. We can see the right behaviour when the domain is passed in the base URL :

new URLPattern("", "https://café.com" }); // hostname pattern string: "xn--caf-dma.com"

@jeremyroman Maybe we should pass necessary URL context (the protocol) to the canonicalize a hostname method, for it to be properly interpreted by the URL parser.
And should input without unspecified protocol be force a special protocol ?

@annevk
Copy link
Member

annevk commented Mar 13, 2024

I guess ideally you can get both.

@rubycon
Copy link
Contributor Author

rubycon commented Mar 14, 2024

In fact, it's exactly what canonicalize a port is already doing : passing the protocol as context to normalize default port values.

And as pointed out above process pathname for init already fallback on "special URL" processing when protocol is unspecified.

So it would only make sense hostname are treated the same ;)

@rubycon
Copy link
Contributor Author

rubycon commented Apr 17, 2024

hi @annevk, I finally tried to look into this. It's pretty simple to fix the hostname parsing for regular processing but there's no easy solution to fix it for pattern processing.

The canonicalization functions are passed as callback parameter to the pattern parser. Internally, the pattern parser have no context for what it's currently parsing (scheme, hostname, port...) or what kind of protocol is used for the current URL pattern. So every hostname is always URL encoded.

Without adding URL context to the pattern parser, the easiest (but not so elegant) way would be to URL decode the hostname pattern after pattern parsing and if an empty or special scheme re-encoded it with the host parser. Would that be OK ?

@annevk
Copy link
Member

annevk commented Apr 17, 2024

I wonder if @domenic @wanderview or @jeremyroman have some ideas. I don't currently have the bandwidth to look into this.

@rubycon
Copy link
Contributor Author

rubycon commented Apr 24, 2024

@jeremyroman @domenic I found a decent way to fix this issue :

  • in canonicalize hostname: add a protocolValue param, and set dummyURL.scheme if supplied.
  • add a canonicalize domain name method: a simple wrapper around canonicalize hostname with param protocolValue always set to 'https', to force the host parser in non-opaque mode.
  • in process hostname for init: add a protocolValue param and a condition if special scheme or empty protocol return the result of calling the canonicalize domain name method with hostnameValue.
  • in process URLPatternInit: pass result.protocol as protocolValue to process hostname for init.
  • in URLPattern initialize steps: add a condition for hostname if protocol component matches special scheme or protocolComponent pattern is '*', use the CanonicalizeDomainName as encoding callback.

It encodes properly hostname and hostname patterns. If it's OK for you I can make a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants