Deprecate url.parse API in favor of URL constructor #11

david-luna · 2023-02-16T22:08:39Z

Replace the legacy parse API from url module and use the URL constructor instead. Reasons for this change are:

API is deprecated and it use discouraged in the documentation
Consumers of the package have discovered issues related to the parsing algorithm (example)
- URLs starting with double slash //some/path are not parsed correctly
- URLs with auth like form //user@foo have the same problem

The solution proposed in this PR considers the raw URL as partial like the ones seek on the headers and adding a fake protocol if necessary (to remove later). The function iterates over the URL instance properties to fill a result object since some properties of URL do not accept empty values like protocol.

const url = new URL('https://localhost:8080/path')
console.log(url.protocol) // prints "https:"
url.protocol = '';
console.log(url.protocol) // keeps printing "https:"

Tests have been added in http/https requests to test these specific scenarios.

chore: remove deprecated code

david-luna · 2023-02-16T22:19:08Z

index.js

-  if (!containsProtocol) result.protocol = ''
+  const urlInstance = new URL(containsProtocol ? url : 'invalid://' + url)
+  const result = {}
+  for (const prop in urlInstance) {


Preferred to use for..in loop to not have to maintain a list of URL properties.

I haven't run tests, but I believe you can just pass back the urlInstance here. No need to copy properties over to another object.

Ah, I see that the URL instance is resistant to having the .protocol and .href changed.

> result.href = result.href.replace('invalid://', '') Uncaught TypeError [ERR_INVALID_URL]: Invalid URL at __node_internal_captureLargerStackTrace (node:internal/errors:478:5) at new NodeError (node:internal/errors:387:5) at URL.onParseError (node:internal/url:565:9) at URL.set href [as href] (node:internal/url:755:5) { input: '//my/path', code: 'ERR_INVALID_URL' }

What about pulling the invalid:// string out to a const INVALID_PROTO, and then in the calling code change this:

if (url.protocol) result.protocol = url.protocol

to this:

if (url.protocol !== INVALID_PROTO) result.protocol = url.protocol

Then one should be able to avoid copying the URL properties to a new object.
This could be improved upon a little bit to not be susceptible to user code actually using "invalid://" as a proto.

I just realised other parts of the logic rely on the mutability of the URL object properties to resolve some values like the protocol.

So either we complicate the main logic or we keep the property copy. Performance wise I guess 1st option would be the one to pick.

This could be improved upon a little bit to not be susceptible to user code actually using "invalid://" as a proto.

Do you mean to use a value which is longer and more unlikely to be used by consumers? Something like original-url-protocol: or something of the sorts?

Did some changes following your advice @trentm

changed logic to detect have a placeholder proto and detect it into the main logic. No need to copy properties

add a try/catch logic to fallback to legacy parse if URL throws

Something I've noticed in the tests is that parse method return more properties in the result object like slashes and auth. IMO this is a breaking change and needs to be reflected in the version with a major bump.

trentm · 2023-02-21T23:18:55Z

index.js

-  if (!containsProtocol) result.protocol = ''
+  const urlInstance = new URL(containsProtocol ? url : 'invalid://' + url)
+  const result = {}
+  for (const prop in urlInstance) {


I haven't run tests, but I believe you can just pass back the urlInstance here. No need to copy properties over to another object.

trentm · 2023-02-21T23:21:56Z

index.js

@@ -92,7 +92,14 @@ function getFirstHeader (req, header) {

 function parsePartialURL (url) {
  const containsProtocol = url.indexOf('://') !== -1
-  const result = parseUrl(containsProtocol ? url : 'invalid://' + url)
-  if (!containsProtocol) result.protocol = ''
+  const urlInstance = new URL(containsProtocol ? url : 'invalid://' + url)


This could throw where url.parse would not throw before, e.g.:

> new URL('invalid://@@') Uncaught TypeError [ERR_INVALID_URL]: Invalid URL at __node_internal_captureLargerStackTrace (node:internal/errors:478:5) at new NodeError (node:internal/errors:387:5) at URL.onParseError (node:internal/url:565:9) at new URL (node:internal/url:641:5) { input: 'invalid://@@', code: 'ERR_INVALID_URL' } > url.parse('invalid://@@') Url { protocol: 'invalid:', slashes: true, auth: '@', host: '', port: null, hostname: '', hash: null, search: null, query: null, pathname: null, path: null, href: 'invalid://' }

Good catch!
I think the legacy parse method is much more permissive since it consists on a loop trying to figure out as much as possible from the input string. You can look at the implementation

This means there will be more cases where the WHATWG URL API would simply throw since the input does not conform to its specs.

I guess the best approach to do not drop support is to catch the error and fallback to the legacy so we ensure support to package consumers which make use of this edge cases.

trentm · 2023-02-21T23:47:08Z

index.js

-  if (!containsProtocol) result.protocol = ''
+  const urlInstance = new URL(containsProtocol ? url : 'invalid://' + url)
+  const result = {}
+  for (const prop in urlInstance) {


Ah, I see that the URL instance is resistant to having the .protocol and .href changed.

> result.href = result.href.replace('invalid://', '') Uncaught TypeError [ERR_INVALID_URL]: Invalid URL at __node_internal_captureLargerStackTrace (node:internal/errors:478:5) at new NodeError (node:internal/errors:387:5) at URL.onParseError (node:internal/url:565:9) at URL.set href [as href] (node:internal/url:755:5) { input: '//my/path', code: 'ERR_INVALID_URL' }

What about pulling the invalid:// string out to a const INVALID_PROTO, and then in the calling code change this:

if (url.protocol) result.protocol = url.protocol

to this:

if (url.protocol !== INVALID_PROTO) result.protocol = url.protocol

Then one should be able to avoid copying the URL properties to a new object.
This could be improved upon a little bit to not be susceptible to user code actually using "invalid://" as a proto.

david-luna · 2023-03-02T18:57:40Z

For the record I did run the benchmarks and compare between current implementation and the new one. It is noticeable that WHATWG URL is slower in parsing.

Current implementation with url.parse

Run 0: Benchmark with no headers took: 1.962s
Run 1: Benchmark with no headers took: 1.942s
Run 2: Benchmark with no headers took: 1.929s
Run 3: Benchmark with no headers took: 1.909s
Run 0: Benchmark with Forwarded header took: 2.859s
Run 1: Benchmark with Forwarded header took: 2.761s
Run 2: Benchmark with Forwarded header took: 2.766s
Run 3: Benchmark with Forwarded header took: 2.786s
Run 0: Benchmark with multiple X-Forwarded-Host headers took: 2.297s
Run 1: Benchmark with multiple X-Forwarded-Host headers took: 2.286s
Run 2: Benchmark with multiple X-Forwarded-Host headers took: 2.290s
Run 3: Benchmark with multiple X-Forwarded-Host headers took: 2.504s

New implementation with new URL(...)

Run 0: Benchmark with no headers took: 3.285s
Run 1: Benchmark with no headers took: 2.978s
Run 2: Benchmark with no headers took: 3.060s
Run 3: Benchmark with no headers took: 2.993s
Run 0: Benchmark with Forwarded header took: 4.258s
Run 1: Benchmark with Forwarded header took: 4.225s
Run 2: Benchmark with Forwarded header took: 4.334s
Run 3: Benchmark with Forwarded header took: 4.519s
Run 0: Benchmark with multiple X-Forwarded-Host headers took: 3.610s
Run 1: Benchmark with multiple X-Forwarded-Host headers took: 3.538s
Run 2: Benchmark with multiple X-Forwarded-Host headers took: 3.438s
Run 3: Benchmark with multiple X-Forwarded-Host headers took: 3.469s

styfle · 2023-03-02T19:04:49Z

Which version of Node.js did you benchmark?

There were some recent performance improvements that landed 3 weeks ago in v19.7.0:

david-luna · 2023-03-02T22:33:38Z

Which version of Node.js did you benchmark?

There were some recent performance improvements that landed 3 weeks ago in v19.7.0:
* [Investigate slow `new URL` nodejs/performance#33](https://github.com/nodejs/performance/issues/33)

* [deps: replace url parser with Ada nodejs/node#46410](https://github.com/nodejs/node/pull/46410)

Node version used for benchmarks was v16.13.1. It's clear the improvements in v19.7.0 but still a bit higher than url.parse

Run 0: Benchmark with no headers took: 2.705s
Run 1: Benchmark with no headers took: 2.647s
Run 2: Benchmark with no headers took: 2.647s
Run 3: Benchmark with no headers took: 2.615s
Run 0: Benchmark with Forwarded header took: 3.305s
Run 1: Benchmark with Forwarded header took: 3.369s
Run 2: Benchmark with Forwarded header took: 3.460s
Run 3: Benchmark with Forwarded header took: 3.586s
Run 0: Benchmark with multiple X-Forwarded-Host headers took: 2.953s
Run 1: Benchmark with multiple X-Forwarded-Host headers took: 2.884s
Run 2: Benchmark with multiple X-Forwarded-Host headers took: 2.794s
Run 3: Benchmark with multiple X-Forwarded-Host headers took: 2.817s

trentm · 2023-03-02T22:35:07Z

[I'm pulling this out of the review comment so the conversation doesn't get broken up and more easily lost.]

Do you mean to use a value which is longer and more unlikely to be used by consumers? Something like original-url-protocol: or something of the sorts?

Yes, that is what I had meant. It isn't very satisfying, however. :)

In the nodejs-land issue that @styfle referred to, there was mention of a separate https://www.npmjs.com/package/request-target package for handling parsing a req.url. Perhaps it would be preferable to use that (or logic similar to it, but not url.parse()) for parsing the input req.url? I haven't dug into it.

The parsing of the other headers (Host, Forwarded, et al) could still perhaps use new URL with a try-catch guard, assuming those headers are specified to have, or typically have complete URLs.

... IMO this is a breaking change and needs to be reflected in the version with a major bump.

Yes, agreed. I'm not opposed to a new major rev for this package. It would then be nice to then get specific about what fields of the returned object are promised -- as opposed to the currently looser "The result object will also contain any other property normally returned by the Node.js core url.parse function."

david-luna · 2023-03-13T18:46:58Z

I've checked https://www.npmjs.com/package/request-target package and it's using a set of RegExp to match the string extracted from req.url. The expressions are derived from the syntax specified in RFC3986. I think we can add it as a dependency.

WATHWG URL and request-target are stricter than url.parse(). When the URL string is invalid the former throws and the later returns null. I think we should unify this but not sure which option is best. With this set of changes we will do breaking changes anyway but I wonder what would be better for package consumers.

david-luna · 2023-03-16T23:12:32Z

Created a gist to compare different scenarios and how different parsers (url.parse, new URL and request-target) behave. The most restrictive is request-target and url.parse seems to accept anything.

Parse results for /

method	protocol	host	pathname
request-target	http:		/
new-url	N/A	N/A	N/A
url-parse			/

Parse results for /some/path

method	protocol	host	pathname
request-target	http:		/some/path
new-url	http:	some	/path
url-parse			/some/path

Parse results for //some/path

method	protocol	host	pathname
request-target	N/A	N/A	N/A
new-url	http:	some	/path
url-parse			//some/path

Parse results for /user@pass

method	protocol	host	pathname
request-target	http:		/user@pass
new-url	http:	pass	/
url-parse			/user@pass

Parse results for //user@pass

method	protocol	host	pathname
request-target	N/A	N/A	N/A
new-url	http:	pass	/
url-parse			//user@pass

Parse results for /some/path?key=value

method	protocol	host	pathname
request-target	http:		/some/path
new-url	http:	some	/path
url-parse			/some/path

Parse results for @@

method	protocol	host	pathname
request-target	N/A	N/A	N/A
new-url	N/A	N/A	N/A
url-parse			@@

Parse results for ///

method	protocol	host	pathname
request-target	N/A	N/A	N/A
new-url	N/A	N/A	N/A
url-parse			///

Urls parsed by method

request-target	new-url	url-parse	total
4	5	8	8

david-luna · 2023-03-21T23:33:56Z

Using a non standard protocol in new URL() constructor changes the parsing result to a better detection of paths (starting with / and //) increasing the number of URLs being parsed without throwing. New results, in this batch we used protocol: as a prefix for URL constructor parameter.

Parse results for /

method	protocol	pathname
request-target	http:	/
new-url	protocol:	/
url-parse		/

Parse results for /some/path

method	protocol	pathname
request-target	http:	/some/path
new-url	protocol:	/some/path
url-parse		/some/path

Parse results for //some/path

method	protocol	host	pathname
request-target	N/A	N/A	N/A
new-url	protocol:		//some/path
url-parse			//some/path

Parse results for /user@pass

method	protocol	pathname
request-target	http:	/user@pass
new-url	protocol:	/user@pass
url-parse		/user@pass

Parse results for //user@pass

method	protocol	host	pathname
request-target	N/A	N/A	N/A
new-url	protocol:		//user@pass
url-parse			//user@pass

Parse results for /some/path?key=value

method	protocol	pathname
request-target	http:	/some/path
new-url	protocol:	/some/path
url-parse		/some/path

Parse results for @@

method	protocol	host	pathname
request-target	N/A	N/A	N/A
new-url	N/A	N/A	N/A
url-parse			@@

Parse results for ///

method	protocol	host	pathname
request-target	N/A	N/A	N/A
new-url	protocol:		///
url-parse			///

Urls parsed by method

request-target	new-url	url-parse	total
4	7	8	8

david-luna · 2023-03-21T23:39:11Z

The batches show that there are edge cases where url.parse still gives a result but we can cover most of them with request-target and new URL(...) being the former faster than the later. IMHO a good approach would be

use request-target
fallback to new URL(...)
return null if both failed to parse

david-luna · 2023-03-28T21:49:20Z

@watson since this a breaking change I would like to have your feedback on this.

As a summary url.parse is very permissive and other parsing implementations like new URL or request-target do not allow certain url strings if they do not conform to whatwg spec or RFC 3986

This means if we replace url.parse there will be some edge cases we might get a parsing error where it was working before. So the question would be if its okay to be stricter on the parsing and, if so, which is the preferred behavior when an invalid URL string is given, to throw or return null/void.

david-luna added 4 commits February 16, 2023 22:13

chore: use URL constructor to parse url strings

603fd51

chore: remove deprecated code

chore: add tests for double slash and auth like urls

363dda7

chore: use for..in instead of list of properties

279a5fa

chohre: remove log

1fbc6b3

david-luna commented Feb 16, 2023

View reviewed changes

trentm reviewed Feb 21, 2023

View reviewed changes

styfle mentioned this pull request Feb 23, 2023

URL's on incoming requests nodejs/web-server-frameworks#71

Open

fix: fix edge case which makes URL throw

fdad8c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate url.parse API in favor of URL constructor #11

Deprecate url.parse API in favor of URL constructor #11

david-luna commented Feb 16, 2023 •

edited

david-luna Feb 16, 2023

trentm Feb 21, 2023

trentm Feb 21, 2023

david-luna Feb 28, 2023

david-luna Mar 2, 2023

trentm Feb 21, 2023

trentm Feb 21, 2023

david-luna Feb 27, 2023

trentm Feb 21, 2023

david-luna commented Mar 2, 2023

styfle commented Mar 2, 2023 •

edited

david-luna commented Mar 2, 2023

trentm commented Mar 2, 2023

david-luna commented Mar 13, 2023

david-luna commented Mar 16, 2023 •

edited

david-luna commented Mar 21, 2023 •

edited

david-luna commented Mar 21, 2023

david-luna commented Mar 28, 2023

Deprecate url.parse API in favor of URL constructor #11

Are you sure you want to change the base?

Deprecate url.parse API in favor of URL constructor #11

Conversation

david-luna commented Feb 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-luna commented Mar 2, 2023

styfle commented Mar 2, 2023 • edited

david-luna commented Mar 2, 2023

trentm commented Mar 2, 2023

david-luna commented Mar 13, 2023

david-luna commented Mar 16, 2023 • edited

david-luna commented Mar 21, 2023 • edited

david-luna commented Mar 21, 2023

david-luna commented Mar 28, 2023

david-luna commented Feb 16, 2023 •

edited

styfle commented Mar 2, 2023 •

edited

david-luna commented Mar 16, 2023 •

edited

david-luna commented Mar 21, 2023 •

edited