Skip to content

Latest commit

 

History

History
204 lines (149 loc) · 12.3 KB

202012-update.md

File metadata and controls

204 lines (149 loc) · 12.3 KB

Incubation Update - December 2020

Overview

URLPattern is a new web API for matching URLs. Its intended to both provide a convenient API for web developers and to be usable in other web APIs that need to match URLs; e.g. service workers. The explainer discusses the motivating use cases. There is also a design document that goes into more details.

Currently URLPattern is in development in Chromium based browsers. A specification has not been written yet. Once the initial prototype is complete we will gather feedback and iterate. When we believe the API is stable, we will then codify it in a spec.

URLPattern's integration with service workers is not being worked on yet, as we want to get URLPattern in a solid shippable state first.

As discussed in the explainer and design document, URLPattern is heavily based on Blake Embrey's path-to-regexp library. Various versions of this library are used in such projects as express.js, koa.js, and react-router. By using a popular matching library with existing traction we hope to follow a well lit path and deliver a useful API to web developers.

To that end we have implemented a large portion of path-to-regexp in C++ in an MIT licensed library called liburlpattern. It is currently being developed in the Chromium source tree here. This implementation is based on version 6.2.0 of path-to-regexp. In addition, we plan to add back support for the * wildcard character that was previously in path-to-regexp up to version 1.7.0.

Currently there is a partial implementation of URLPattern available in Chrome Canary. To access it you must enable "Experimental Web Platform features" by enabling a flag. The following discusses the implementation as of version 89.0.4358.0.

Kenneth Rohde Christiansen is also working on a URLPattern polyfill. Note that this polyfill currently implements some features not yet implemented in Chromium yet. There may also be slight differences until the API stabilizes.

If you have questions or feedback, please start a discussion on the WICG/urlpattern repo.

API

The API currently consists of the URLPattern() constructor and two matching methods; test() and exec(). The related Web IDL can be found here.

Constructor

The URLPattern() constructor accepts a single object argument. This dictionary can contain separate patterns for each URL component. For example (see path-to-regexp for pattern syntax):

const p = new URLPattern({
  protocol: 'https',
  username: '',
  password: '',
  hostname: 'example.com',
  port: '',
  pathname: '/foo/:image.jpg',
  search: '(.*)',
  hash: '(.*)',
});

This is example is very verbose. Fortunately, it can be simplified. First, if a component is not specified it defaults to the (.*) wildcard pattern. So we can write:

const p = new URLPattern({
  protocol: 'https',
  username: '',
  password: '',
  hostname: 'example.com',
  port: '',
  pathname: '/foo/:image.jpg',
});

This is still pretty long, though. To further simplify we can provide the fixed origin patterns via a baseURL property.

const p = new URLPattern({
  pathname: '/foo/:image.jpg',
  baseURL: 'https://example.com',
});

So far all of the above code snippets have been equivalent. If it turns out we only really care about finding image file names and don't need strict origin checking we can even just leave the baseURL off completely. This will be a more lenient, different pattern, but its quite brief:

const p = new URLPattern({ pathname: '/foo/:image.jpg' });

The constructor also supports relative pathnames. For example, the following is equivalent to the first snippet above.

const p = new URLPattern({
  pathname: ':image.jpg',
  baseURL: 'https://example.com/foo/',
});

Currently URLPattern does not perform any encoding or normalization of the patterns. So a developer would need to URL encode unicode characters before passing the pattern into the constructor. Similarly, the constructor does not do things like flattening pathnames such as /foo/../bar to /bar. Currently the pattern must be written to target canonical URL output manually. There is an open question below whether the constructor should try to help more here or not.

Matching

Both the test() and exec() methods take the same input and perform the same matching algorithm. They only differ in their return value. So let's discuss the common parts first.

The matching methods take a single argument consisting of either a string or a dictionary object like the one used in the constructor.

In the case of the string it is parsed as a URL. Each component of the URL is then compared to the URLPattern's component pattern one by one. If they all match, then the entire matching operation succeeds. Otherwise the match fails.

In the case of the dictionary object the provided then the initial URL parsing step is skipped. Instead the components are individually encoded and normalized per normal URL rules. In addition, baseURL resolution is supported. Any missing components are assumed to be the empty string. The components are then again compared against the URLPattern's component patterns one by one. Again, if they all match then the overall operation succeeds.

To make that a bit more concrete, these two method calls are equivalent:

// string input
p.test('https://example.com/foo/bar');

// dictionary input
p.test({
  pathname: '/foo/bar',
  baseURL: 'https://example.com',
});

The dictionary input, though, is really intended as a convenience for when you only have part of a URL available. For example, maybe you only have a pathname and not a full URL available. For example:

// we only care about the pathname
const p = new URLPattern({ pathname: '/foo/:name' });

// we only provide the pathname
p.test({ pathname: '/foo/bar' });

If the match input is invalid test() and exec() simply return false instead of throwing. So p.test({ port: 'bad' }) will always return false.

The main difference between test() and exec() is the return value. The test() method returns true or false to indicate if the input matched the pattern.

The exec() method, however, returns a full result object. It contains a property with the input to exec() and then a separate sub-result for each component. Each component result then contains the canonicalized input value for that component and a dictionary of group results. For example:

const p = new URLPattern({ pathname: '(.*)/:image.jpg' });
const r = p.exec('https://example.com/foo/bar/cat.jpg?q=v');

// outputs 'https://example.com/foo/bar/cat.jpg?q=v'
console.log(r.input);

// outputs '/foo/bar/cat.jpg'
console.log(r.pathname.input);

// outputs 'cat'
console.log(r.pathname.groups.image);

What's Still Left To Do

While the current URLPattern implementation is usable, there are a number of additional features planned.

  • Support for a * wildcard character in patterns. This character would be equivalent to the (.*) regular expression group. This was previously available in path-to-regexp 1.7.0, but was later removed. We would like to support it in URLPattern since many existing URL matching systems in the web platform use * wildcards and this would offer us some compatibility with them. Discussions with the path-to-regexp author suggest it might be possible to upstream this change back to path-to-regexp.
  • Possible support for a "short form" argument to the constructor that supports patterns inline in a URL string. For example, new URLPattern('https://:cdnserver.foo.com/(.*)/:image.jpg'). Its not clear yet how easily we can support this since certain characters are both pattern syntax characters and used in URL parsing; e.g. the : character, etc.
  • Getters to access per-component patterns. For example, p.pathname returning '(.*)/:image.jpg'.
  • Getters to access per-component regular expressions equivalent to matching the component pattern; e.g. p.pathnameRegExp.
  • A toString() method that generates useful output.
  • Better error reporting. For example, if you pass a bad regular expression embedded within a pattern we do not identify which regexp group is the problem, etc.
  • Support a relative URL string argument to test() and exec() with a second argument providing the base URL.
  • A URLPatternList object that would aggregate multiple URLPattern objects.
  • Serialization support so that URLPattern objects can be passed through postMessage() and stored in indexedDB.
  • Additional features identified by answering the open questions below...

Open Questions

There are also a number of open questions to be answered.

  • [Discussion 37] - Currently URLPattern does no automatic encoding or URL canonicalization for patterns. It does, however, perform these operations for test() and exec() input. This means developers must write patterns to target canonical URL output which may mean manually encoding characters, etc. This may be non-ergonomic, particularly for developers working with non-Latin character sets. We could provide some encoding and canonicalization, but it would be uneven. For example, we probably could not automatically encode characters within a custom regular expression grouping. How much, if any, canonicalization to do is an open question.
  • [Discussion 38] - Currently URLPattern uses case-sensitive matching whereas path-to-regexp uses case-insensitive matching by default. Again, will this cause problems and do we need to provide an option to control the behavior?
  • [Discussion 39] - Currently URLPattern sets the path-to-regexp "strict" mode flag to true. By default path-to-regexp sets strict mode to false. When strict mode is disabled the pattern will implicitly allow a trailing slash on a pathname. We have chosen to set strict to true since its possible to achieve the same effect by appending {/}? to the end of a pattern. Conversely, its not possible to avoid the automatic slash behavior if strict mode is disabled. Will this cause problems for developers used to path-to-regexp behavior?

Questions? Feedback?

If you have questions or feedback, please start a discussion on the WICG/urlpattern repo.

Ackowledgements

Special thanks to Blake Embrey and the other path-to-regexp contributors for building an excellent open source library that so many have found useful.

Also, special thanks to Kenneth Rohde Christiansen for his work on the polyfill. He put in extensive work to adapt to the changing URLPattern API.

URLPattern is the culmination of extensive input and review from many people. Thanks to:

  • Łukasz Anforowicz
  • Jake Archibald
  • Kenji Baheux
  • L. David Baron
  • Joshua Bell
  • Ralph Chelala
  • Kenneth Rohde Christiansen
  • Victor Costan
  • Devlin Cronin
  • Domenic Denicola
  • Blake Embrey
  • Youenn Fablet
  • Matt Falkenhagen
  • Matt Giuca
  • Joe Gregorio
  • Darwin Huang
  • Kenichi Ishibashi
  • Rajesh Jagannathan
  • Cyrus Kasaaian
  • Anne van Kesteren
  • R. Samuel Klatchko
  • Marijn Kruisselbrink
  • Asa Kusuma
  • Michael Landry
  • Sangwhan Moon
  • Daniel Murphy
  • Dominick Ng
  • Kingsley Ngan
  • Jeffrey Posnick
  • Jeremy Roman
  • Alex Russell
  • Jimmy Shen
  • Makoto Shimazu
  • Kinuko Yasuda

Finally, thank you to Dominic Denicola, Kenneth Rohde Christiansen, and Jeremy Roman for reviewing this post.