Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate url.parse API in favor of URL constructor #11

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

david-luna
Copy link

@david-luna david-luna commented Feb 16, 2023

Replace the legacy parse API from url module and use the URL constructor instead. Reasons for this change are:

  • API is deprecated and it use discouraged in the documentation
  • Consumers of the package have discovered issues related to the parsing algorithm (example)
    • URLs starting with double slash //some/path are not parsed correctly
    • URLs with auth like form //user@foo have the same problem

The solution proposed in this PR considers the raw URL as partial like the ones seek on the headers and adding a fake protocol if necessary (to remove later). The function iterates over the URL instance properties to fill a result object since some properties of URL do not accept empty values like protocol.

const url = new URL('https://localhost:8080/path')
console.log(url.protocol) // prints "https:"
url.protocol = '';
console.log(url.protocol) // keeps printing "https:"

Tests have been added in http/https requests to test these specific scenarios.

index.js Outdated
if (!containsProtocol) result.protocol = ''
const urlInstance = new URL(containsProtocol ? url : 'invalid://' + url)
const result = {}
for (const prop in urlInstance) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferred to use for..in loop to not have to maintain a list of URL properties.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't run tests, but I believe you can just pass back the urlInstance here. No need to copy properties over to another object.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that the URL instance is resistant to having the .protocol and .href changed.

> result.href = result.href.replace('invalid://', '')
Uncaught TypeError [ERR_INVALID_URL]: Invalid URL
    at __node_internal_captureLargerStackTrace (node:internal/errors:478:5)
    at new NodeError (node:internal/errors:387:5)
    at URL.onParseError (node:internal/url:565:9)
    at URL.set href [as href] (node:internal/url:755:5) {
  input: '//my/path',
  code: 'ERR_INVALID_URL'
}

What about pulling the invalid:// string out to a const INVALID_PROTO, and then in the calling code change this:

if (url.protocol) result.protocol = url.protocol

to this:

if (url.protocol !== INVALID_PROTO) result.protocol = url.protocol

Then one should be able to avoid copying the URL properties to a new object.
This could be improved upon a little bit to not be susceptible to user code actually using "invalid://" as a proto.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised other parts of the logic rely on the mutability of the URL object properties to resolve some values like the protocol.

So either we complicate the main logic or we keep the property copy. Performance wise I guess 1st option would be the one to pick.

This could be improved upon a little bit to not be susceptible to user code actually using "invalid://" as a proto.

Do you mean to use a value which is longer and more unlikely to be used by consumers? Something like original-url-protocol: or something of the sorts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some changes following your advice @trentm

  • changed logic to detect have a placeholder proto and detect it into the main logic. No need to copy properties
  • add a try/catch logic to fallback to legacy parse if URL throws

Something I've noticed in the tests is that parse method return more properties in the result object like slashes and auth. IMO this is a breaking change and needs to be reflected in the version with a major bump.

index.js Outdated
if (!containsProtocol) result.protocol = ''
const urlInstance = new URL(containsProtocol ? url : 'invalid://' + url)
const result = {}
for (const prop in urlInstance) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't run tests, but I believe you can just pass back the urlInstance here. No need to copy properties over to another object.

index.js Outdated
@@ -92,7 +92,14 @@ function getFirstHeader (req, header) {

function parsePartialURL (url) {
const containsProtocol = url.indexOf('://') !== -1
const result = parseUrl(containsProtocol ? url : 'invalid://' + url)
if (!containsProtocol) result.protocol = ''
const urlInstance = new URL(containsProtocol ? url : 'invalid://' + url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could throw where url.parse would not throw before, e.g.:

> new URL('invalid://@@')
Uncaught TypeError [ERR_INVALID_URL]: Invalid URL
    at __node_internal_captureLargerStackTrace (node:internal/errors:478:5)
    at new NodeError (node:internal/errors:387:5)
    at URL.onParseError (node:internal/url:565:9)
    at new URL (node:internal/url:641:5) {
  input: 'invalid://@@',
  code: 'ERR_INVALID_URL'
}
> url.parse('invalid://@@')
Url {
  protocol: 'invalid:',
  slashes: true,
  auth: '@',
  host: '',
  port: null,
  hostname: '',
  hash: null,
  search: null,
  query: null,
  pathname: null,
  path: null,
  href: 'invalid://'
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
I think the legacy parse method is much more permissive since it consists on a loop trying to figure out as much as possible from the input string. You can look at the implementation

This means there will be more cases where the WHATWG URL API would simply throw since the input does not conform to its specs.

I guess the best approach to do not drop support is to catch the error and fallback to the legacy so we ensure support to package consumers which make use of this edge cases.

index.js Outdated
if (!containsProtocol) result.protocol = ''
const urlInstance = new URL(containsProtocol ? url : 'invalid://' + url)
const result = {}
for (const prop in urlInstance) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that the URL instance is resistant to having the .protocol and .href changed.

> result.href = result.href.replace('invalid://', '')
Uncaught TypeError [ERR_INVALID_URL]: Invalid URL
    at __node_internal_captureLargerStackTrace (node:internal/errors:478:5)
    at new NodeError (node:internal/errors:387:5)
    at URL.onParseError (node:internal/url:565:9)
    at URL.set href [as href] (node:internal/url:755:5) {
  input: '//my/path',
  code: 'ERR_INVALID_URL'
}

What about pulling the invalid:// string out to a const INVALID_PROTO, and then in the calling code change this:

if (url.protocol) result.protocol = url.protocol

to this:

if (url.protocol !== INVALID_PROTO) result.protocol = url.protocol

Then one should be able to avoid copying the URL properties to a new object.
This could be improved upon a little bit to not be susceptible to user code actually using "invalid://" as a proto.

@david-luna
Copy link
Author

For the record I did run the benchmarks and compare between current implementation and the new one. It is noticeable that WHATWG URL is slower in parsing.

Current implementation with url.parse

Run 0: Benchmark with no headers took: 1.962s
Run 1: Benchmark with no headers took: 1.942s
Run 2: Benchmark with no headers took: 1.929s
Run 3: Benchmark with no headers took: 1.909s
Run 0: Benchmark with Forwarded header took: 2.859s
Run 1: Benchmark with Forwarded header took: 2.761s
Run 2: Benchmark with Forwarded header took: 2.766s
Run 3: Benchmark with Forwarded header took: 2.786s
Run 0: Benchmark with multiple X-Forwarded-Host headers took: 2.297s
Run 1: Benchmark with multiple X-Forwarded-Host headers took: 2.286s
Run 2: Benchmark with multiple X-Forwarded-Host headers took: 2.290s
Run 3: Benchmark with multiple X-Forwarded-Host headers took: 2.504s

New implementation with new URL(...)

Run 0: Benchmark with no headers took: 3.285s
Run 1: Benchmark with no headers took: 2.978s
Run 2: Benchmark with no headers took: 3.060s
Run 3: Benchmark with no headers took: 2.993s
Run 0: Benchmark with Forwarded header took: 4.258s
Run 1: Benchmark with Forwarded header took: 4.225s
Run 2: Benchmark with Forwarded header took: 4.334s
Run 3: Benchmark with Forwarded header took: 4.519s
Run 0: Benchmark with multiple X-Forwarded-Host headers took: 3.610s
Run 1: Benchmark with multiple X-Forwarded-Host headers took: 3.538s
Run 2: Benchmark with multiple X-Forwarded-Host headers took: 3.438s
Run 3: Benchmark with multiple X-Forwarded-Host headers took: 3.469s

@styfle
Copy link

styfle commented Mar 2, 2023

Which version of Node.js did you benchmark?

There were some recent performance improvements that landed 3 weeks ago in v19.7.0:

@david-luna
Copy link
Author

Which version of Node.js did you benchmark?

There were some recent performance improvements that landed 3 weeks ago in v19.7.0:

* [Investigate slow `new URL` nodejs/performance#33](https://github.com/nodejs/performance/issues/33)

* [deps: replace url parser with Ada nodejs/node#46410](https://github.com/nodejs/node/pull/46410)

Node version used for benchmarks was v16.13.1. It's clear the improvements in v19.7.0 but still a bit higher than url.parse

Run 0: Benchmark with no headers took: 2.705s
Run 1: Benchmark with no headers took: 2.647s
Run 2: Benchmark with no headers took: 2.647s
Run 3: Benchmark with no headers took: 2.615s
Run 0: Benchmark with Forwarded header took: 3.305s
Run 1: Benchmark with Forwarded header took: 3.369s
Run 2: Benchmark with Forwarded header took: 3.460s
Run 3: Benchmark with Forwarded header took: 3.586s
Run 0: Benchmark with multiple X-Forwarded-Host headers took: 2.953s
Run 1: Benchmark with multiple X-Forwarded-Host headers took: 2.884s
Run 2: Benchmark with multiple X-Forwarded-Host headers took: 2.794s
Run 3: Benchmark with multiple X-Forwarded-Host headers took: 2.817s

@trentm
Copy link
Collaborator

trentm commented Mar 2, 2023

[I'm pulling this out of the review comment so the conversation doesn't get broken up and more easily lost.]

Do you mean to use a value which is longer and more unlikely to be used by consumers? Something like original-url-protocol: or something of the sorts?

Yes, that is what I had meant. It isn't very satisfying, however. :)

In the nodejs-land issue that @styfle referred to, there was mention of a separate https://www.npmjs.com/package/request-target package for handling parsing a req.url. Perhaps it would be preferable to use that (or logic similar to it, but not url.parse()) for parsing the input req.url? I haven't dug into it.

The parsing of the other headers (Host, Forwarded, et al) could still perhaps use new URL with a try-catch guard, assuming those headers are specified to have, or typically have complete URLs.

... IMO this is a breaking change and needs to be reflected in the version with a major bump.

Yes, agreed. I'm not opposed to a new major rev for this package. It would then be nice to then get specific about what fields of the returned object are promised -- as opposed to the currently looser "The result object will also contain any other property normally returned by the Node.js core url.parse function."

@david-luna
Copy link
Author

I've checked https://www.npmjs.com/package/request-target package and it's using a set of RegExp to match the string extracted from req.url. The expressions are derived from the syntax specified in RFC3986. I think we can add it as a dependency.

WATHWG URL and request-target are stricter than url.parse(). When the URL string is invalid the former throws and the later returns null. I think we should unify this but not sure which option is best. With this set of changes we will do breaking changes anyway but I wonder what would be better for package consumers.

@david-luna
Copy link
Author

david-luna commented Mar 16, 2023

Created a gist to compare different scenarios and how different parsers (url.parse, new URL and request-target) behave. The most restrictive is request-target and url.parse seems to accept anything.

Parse results for /

method protocol host pathname
request-target http: /
new-url N/A N/A N/A
url-parse /

Parse results for /some/path

method protocol host pathname
request-target http: /some/path
new-url http: some /path
url-parse /some/path

Parse results for //some/path

method protocol host pathname
request-target N/A N/A N/A
new-url http: some /path
url-parse //some/path

Parse results for /user@pass

method protocol host pathname
request-target http: /user@pass
new-url http: pass /
url-parse /user@pass

Parse results for //user@pass

method protocol host pathname
request-target N/A N/A N/A
new-url http: pass /
url-parse //user@pass

Parse results for /some/path?key=value

method protocol host pathname
request-target http: /some/path
new-url http: some /path
url-parse /some/path

Parse results for @@

method protocol host pathname
request-target N/A N/A N/A
new-url N/A N/A N/A
url-parse @@

Parse results for ///

method protocol host pathname
request-target N/A N/A N/A
new-url N/A N/A N/A
url-parse ///

Urls parsed by method

request-target new-url url-parse total
4 5 8 8

@david-luna
Copy link
Author

david-luna commented Mar 21, 2023

Using a non standard protocol in new URL() constructor changes the parsing result to a better detection of paths (starting with / and //) increasing the number of URLs being parsed without throwing. New results, in this batch we used protocol: as a prefix for URL constructor parameter.

Parse results for /

method protocol host pathname
request-target http: /
new-url protocol: /
url-parse /

Parse results for /some/path

method protocol host pathname
request-target http: /some/path
new-url protocol: /some/path
url-parse /some/path

Parse results for //some/path

method protocol host pathname
request-target N/A N/A N/A
new-url protocol: //some/path
url-parse //some/path

Parse results for /user@pass

method protocol host pathname
request-target http: /user@pass
new-url protocol: /user@pass
url-parse /user@pass

Parse results for //user@pass

method protocol host pathname
request-target N/A N/A N/A
new-url protocol: //user@pass
url-parse //user@pass

Parse results for /some/path?key=value

method protocol host pathname
request-target http: /some/path
new-url protocol: /some/path
url-parse /some/path

Parse results for @@

method protocol host pathname
request-target N/A N/A N/A
new-url N/A N/A N/A
url-parse @@

Parse results for ///

method protocol host pathname
request-target N/A N/A N/A
new-url protocol: ///
url-parse ///

Urls parsed by method

request-target new-url url-parse total
4 7 8 8

@david-luna
Copy link
Author

The batches show that there are edge cases where url.parse still gives a result but we can cover most of them with request-target and new URL(...) being the former faster than the later. IMHO a good approach would be

  • use request-target
  • fallback to new URL(...)
  • return null if both failed to parse

@david-luna
Copy link
Author

@watson since this a breaking change I would like to have your feedback on this.

As a summary url.parse is very permissive and other parsing implementations like new URL or request-target do not allow certain url strings if they do not conform to whatwg spec or RFC 3986

This means if we replace url.parse there will be some edge cases we might get a parsing error where it was working before. So the question would be if its okay to be stricter on the parsing and, if so, which is the preferred behavior when an invalid URL string is given, to throw or return null/void.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants