Add consistent wildcard support #270

HansBrende · 2022-02-01T23:02:50Z

I realize that there is already a PR to add wildcard support. However, that PR is going to add some pretty big inconsistencies between named and anonymous wildcard groups, in a very confusing way. See: whatwg/urlpattern#163

The response over there so far has been "too bad, so sad, who cares."

So I'm taking the initiative to create this PR to at least demonstrate that there is a way forward with wildcard support that actually makes sense.

Why do I care? Because once URLPattern comes out of experimental status, we will be stuck with the choices we make today, FOREVER. If the path-to-regexp library itself chooses the right way forward, then maybe URLPattern will follow suit, since it seems to care a lot about compatibility with this library.

How did I go about making the two behaviors compatible with each other? Simply by not disallowing empty path segments in named wildcard groups, and not requiring a trailing slash in anonymous wildcard groups.

Hopefully my efforts here at least give some people food for thought.

P.S. the pre-commit hook that added 600 lines to this PR is really annoying.

P.P.S. Fixes #214. Also fixes or re-fixes the previously closed issues #228, #212, #196, #194 (for the case where it makes sense), #103, #87, and #37.

Detailed breakdown:

Issue	Pattern	Expectation	This PR	WICG PR	v6.2	v2 - v5	v0.1.7
#214	`/files/:path\.:ext`	/files/my/photo.jpg/gif	✅	❌	❌	✅	💀
#228	`/:everything*`	/	✅	❌	❌	❌	❌
#228	`abc/:everything*`	abc	✅	✅	✅	✅	❌
#212	`#/*`	#/	✅	✅	💀	❌	✅
#196	`*`	/any	✅	✅	💀	❌	✅
#194	`/foo/:bar*`	/foo/test1//test2	✅	❌	❌	❌	✅
#103	`/entity/:id/*`	/entity/foo	✅	❌	💀	❌	❌
#103	`/entity/:id/*`	/entity/foo/	✅	✅	💀	❌	✅
#87	`/test/*`	/test	✅	❌	💀	❌	❌
#37	`/some-path/*`	/some-path/anything	✅	✅	💀	❌	✅

(✅ = match; ❌ = no match; 💀 = parse error; all tests run in strict mode)

…ards to match paths with empty path segments

blakeembrey · 2022-02-11T02:48:57Z

Let me fix the codebase, it's my bad for merging a dependabot PR (which I also had to disable due to sheer volume of noise) which bumped prettier without the concurrent re-formatting. I'll fix that now and you can rebase the changes.

Aside from that, I'm less sure about the actual behavior of /files/:path*\.:ext* and /:everything* passing. Especially the first will be a regression for some users who might be using this in a non-path positions. Same with the ones allowing //, the implication there is that we're allowing empty segments. Does that mean /:test should work with just /?

blakeembrey · 2022-02-11T02:55:12Z

The other thing to keep in mind with * support is backward compatibility with existing scripts. The behavior of /test/* is actually incompatible with old path-to-regexp and incompatible with people migrating from other related syntaxes, such as older specs in web that might just treat * as a simple wildcard. Making it behave as a segment makes it harder for customers to properly migrate. An example in hostname search would be http://*.example.com which clearly does subdomain matching today, but in a world where it acts like a segment the subdomain becomes optional (currently path-to-regexp is only prefix segments, but the example still works if you do foo.*.example.com instead).

Overall the most unfortunate decision in consistent behavior has been the inclusion of magic prefixes which makes all this behavior harder to follow. If it was explicit the confusion would not exist, e.g. foo{.*}.example.com would be an optional domain segment while foo.*.example.com is explicit. The same concept for paths too, e.g. foo/* vs foo{/*}.

blakeembrey · 2022-02-11T04:08:40Z

Also forgot to mention in my initial review but the summary table you created is really awesome regardless of the actual output of the PR 💯

HansBrende · 2022-02-11T05:02:58Z

@blakeembrey thanks for the responses and follow-up questions.

Does that mean /:test should work with just /?

No, I would say definitely not. In the absence of a wildcard character, a variable should always have a non-empty value, IMO. The reason it makes sense to me to allow empty path segments when a wildcard is specified is that in this case, the intention is clear: devs want to match for anything. So any valid path should match... regardless of whether or not it contains empty path segments. (This is particularly useful, fyi, in routing libraries. See, e.g., react-router or wouter for examples of how they are using the same syntax. It is quite unexpected behavior to specify a 404 for anything that doesn't match the previous routes, only to have a blank page instead of your 404 page show up because the actual url typed in by the user happened to contain an empty path segment.)

Especially the first [/files/:path*.:ext*] will be a regression for some users

But in this case my PR is consistent with v2 thru v5, right? So wouldn't that be fixing a regression?

...in a world where it acts like a segment the subdomain becomes optional (currently path-to-regexp is only prefix segments, but the example still works if you do foo.*.example.com instead)

True, however... the behavior of the WICG PR will do a similar thing by allowing foo..example.com to match, which isn't even a valid hostname. If users really want correct behavior in this case they should probably be doing foo.(.+).example.com, or just specify that the only prefix in the config should be "/" and not both "." and "/". Alternatively, I could also simply amend this PR to only do this for forward slashes and not dots, since that's really the use case anyone cares about, and paths & hostnames obviously have different semantics. I agree that abc.* makes slightly less sense to have an optional trailing dot than abc/* with the optional trailing forward slash.

EDIT: and honestly I'm not sure the subdomain issue is much of a problem in the first place, because even though foo.*.example.com "matches" foo.example.com with this PR, the regexp exec result[1] will be undefined rather than empty string. So if that's really important to the developer to have it not match a url missing the subdomain portion (which... I would argue... is probably less likely than the alternative use-case), they could instead check if the result of the match is not undefined; i.e., they could simply switch the logic from pattern.exec(path) != null to pattern.exec(path)?.[1] != null.

EDIT #2: OR simply do foo.{*}.example.com (which basically aligns with existing syntax/behavior anyways).

Thoughts?

blakeembrey · 2022-02-11T06:09:35Z

The reason it makes sense to me to allow empty path segments when a wildcard is specified is that in this case, the intention is clear: devs want to match for anything.

Isn’t that why you’d want the asterisk or (.*) group to make it clear? I don’t completely disagree, but it does make for some changes in other places. E.g. does path matching that makes arrays today leave in the empty holes or remove them? Does the optional delimiter also allow empty?

blakeembrey · 2022-02-11T06:12:17Z

Also something worth keeping in mind is whether we actually care about all these edge cases. In theory if you’re allowing repeated delimiters in one place shouldn’t we just allow it everywhere? But that makes for a more complex regex. And I’ve been of the opinion that the simple solution there is to fix it by normalizing paths at the router level before path matching. Arguably even the trailing slash should be at the router level to normalize it first.

HansBrende · 2022-02-11T14:15:09Z

@blakeembrey

Isn’t that why you’d want the asterisk or (.*) group to make it clear?

This could be done, however, there are a couple reasons why this would pose a major burden:

(1) Most front-end routing libraries (e.g. wouter, react-router) only expose a subset of the path-to-regexp syntax, and that subset does not include custom regexps at all! 99.999% of the routing architecture use-case can be accomplished without custom regexps, so it's a bit awkward to require one solely for the most common use-case in routing.
(2) Subrouting/404-page handling is such a common use-case that I'd guess probably 50% of patterns being matched for in the routing use-case want the equivalent of a trailing wildcard. The cost of making one specify a regexp for this use-case every. single. time. is going to add up.
(3) The wildcard syntax already exists... so devs WILL use it and prefer it over a manual regexp. And they will do so expecting it to mean one thing ("give me everything else!"). And in 99.999% of cases, it will mean that thing, so they won't change it. Except for the one rare edge case where the user mistypes an extra slash and the entire application breaks. In short: the wildcard syntax is SO handy that it would be a tremendous shame for it to fail on 1 odd-ball edge case in literally THE most common use-case for it.

does path matching that makes arrays today leave in the empty holes or remove them?

This is the part I don't really care about too much since I'm doing my own splitting logic when necessary... and only using the path-to-regexp part of the library (usually splitting is not necessary at all since the remaining path represented by the wildcard is either delegated to a subrouter, or sent to a 404 page). I do believe that the question of "how regexp match results are split" should be an entirely separate & independent question from the standardization of the actual generated regexp. That being said, it might make sense to leave in the holes here. Or take them out, either way. My gut tells me to leave them in, that way the original path can be reconstructed by joining the match with the "/" delimiter, rather than being lossy. Since the dev did not specify an overriding pattern, we're simply making the default pattern be (.*) in this case to unify behavior with the unnamed wildcard. Which allows empty string. So empty strings should be valid in the array. (I would think.)

Does the optional delimiter also allow empty?

No. The use-case for the optional quantifier is pretty much 100% across the board either an optional path segment or an optional affix of a single path segment. Therefore, the optional quantifier should not imply a different default pattern than that of a mandatory path segment variable. The use-cases where a dev would want to match a variable or empty string in the former case are basically non-existent, since nobody writes their routes that way (for a very good reason). And if they did, they could simply use 2 patterns: 1 for the non-empty case (e.g. foo/:var/bar), and one for the empty case (e.g. foo//bar). (Alternatively: foo/:var([^/]*)/bar if they didn't want 2 patterns). That way it's explicit that they are expecting a very weird use-case to be handled. With the wildcard, however, the main use-case is that you really don't care about the internal structure of the match at all, if you haven't specified a pattern. Which means that the default pattern for wildcards should be "anything".

And I’ve been of the opinion that the simple solution there is to fix it by normalizing paths at the router level before path matching.

That's not a robust solution though, because sometimes empty path segments actually mean something. (E.g. https://en.wikipedia.org/wiki///). If you normalize prior to putting the URL through the router, then you are arbitrarily deleting information from the URL that might indeed be used by subrouters. (If only to simply display the correct error message to the user, e.g. "404 path suffix foo////bar was not found in the directory!") This poses a particular problem for generic routing libraries, who cannot make the assumption that the user's routes can be normalized without changing the semantics.

HansBrende · 2022-02-12T23:15:43Z

@blakeembrey P.S. -- if you get a chance and haven't already, please do check out the original issue I filed in URLPattern: whatwg/urlpattern#163.

In that issue, I go a lot more in-depth into all the pros & cons that I see here, as well as listing out tables of possible different behaviors. (The behavior I've implemented in this PR is equivalent to "Proposed Behavior 1" in that issue.)

consistent behavior between named & anonymous wildcards + allow wildc…

b107909

…ards to match paths with empty path segments

HansBrende mentioned this pull request Feb 1, 2022

Better asterisk functionality #115

Closed

keep it simple

f8e79ad

HansBrende mentioned this pull request Feb 2, 2022

Multiple parameters suffixed with an asterisk does not match - works with v5/v3 #214

Open

HansBrende added 2 commits February 1, 2022 21:05

simplify further by adding helper function

c6afde4

maximize backwards compatibility w/URLPattern

c2d6c5a

HansBrende mentioned this pull request Feb 3, 2022

Named vs. Anonymous wildcard groups have inconsistent and confusing behavior whatwg/urlpattern#163

Closed

Merge branch 'master' into master

32c2e3c

fix recently added test

60e71e2

Fix 'Package size limit has exceeded by 27 B' error

a0637b7

import-brain added the enhancement label Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add consistent wildcard support #270

Add consistent wildcard support #270

HansBrende commented Feb 1, 2022 •

edited

blakeembrey commented Feb 11, 2022

blakeembrey commented Feb 11, 2022 •

edited

blakeembrey commented Feb 11, 2022

HansBrende commented Feb 11, 2022 •

edited

blakeembrey commented Feb 11, 2022

blakeembrey commented Feb 11, 2022

HansBrende commented Feb 11, 2022 •

edited

HansBrende commented Feb 12, 2022

Add *consistent* wildcard support #270

Are you sure you want to change the base?

Add *consistent* wildcard support #270

Conversation

HansBrende commented Feb 1, 2022 • edited

blakeembrey commented Feb 11, 2022

blakeembrey commented Feb 11, 2022 • edited

blakeembrey commented Feb 11, 2022

HansBrende commented Feb 11, 2022 • edited

blakeembrey commented Feb 11, 2022

blakeembrey commented Feb 11, 2022

HansBrende commented Feb 11, 2022 • edited

HansBrende commented Feb 12, 2022

Add consistent wildcard support #270

Add consistent wildcard support #270

HansBrende commented Feb 1, 2022 •

edited

blakeembrey commented Feb 11, 2022 •

edited

HansBrende commented Feb 11, 2022 •

edited

HansBrende commented Feb 11, 2022 •

edited