Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL's on incoming requests #71

Open
wesleytodd opened this issue Oct 2, 2020 · 38 comments
Open

URL's on incoming requests #71

wesleytodd opened this issue Oct 2, 2020 · 38 comments

Comments

@wesleytodd
Copy link
Member

Initial context: nodejs/node#12682

I think the new api's should use URL as the basis for req.url (or whatever equivalent we expose). As you can see from the thread above, there is some question about how today these are relative urls and the new spec does not support that. My thought, and what I think we need to discuss, is that the "base" should be derived from the incoming request headers and fall back to the server address.

If we come to an agreement on this I am happy to write up a more detailed description. Agenda added so we can discuss in the next meeting as well.

@mcollina
Copy link
Member

mcollina commented Oct 2, 2020

I'm highly -1 on using new URL() for req.url. It's extremely slow and inefficient due to the host and protocol parsing.

@ronag
Copy link
Member

ronag commented Oct 2, 2020

I usually use, https://www.npmjs.com/package/request-target. Maybe worth looking into?

@ronag
Copy link
Member

ronag commented Oct 2, 2020

https://github.com/nodejs/next-10/blob/master/VALUES_AND_PRIORITIZAION.md might also be worth considering, i.e. "Performance" vs "Web API compatibility".

@Ethan-Arrowood
Copy link
Contributor

could we do something like req.url returns the faster solution and then req.url() would return the new URL() version? This way users have to opt-in to the performance decrease if they want it?

Additionally, could we look into improving host and protocol parsing? Or is has that already been attempted?

@wesleytodd
Copy link
Member Author

wesleytodd commented Oct 2, 2020

This is why I was strongly questioning adding that (nodejs/node#35323) in yesterdays meeting.

The perf is a big concern I was not aware of (do we have benchmarks on this, or details on why?), but is also a good reason to push back on the web api compat technical value issue.

@ronag
Copy link
Member

ronag commented Oct 2, 2020

do we have benchmarks on this, or details on why and what parts are slow?

From https://www.npmjs.com/package/request-target

$ npm run benchmark
legacy url.parse() x 371,681 ops/sec ±0.88% (297996 samples)
whatwg new URL() x 58,766 ops/sec ±0.3% (118234 samples)
request-target x 552,748 ops/sec ±0.54% (344809 samples)

@wesleytodd
Copy link
Member Author

I hadn't clicked your link, but yeah this looks good. I will dig into what it is doing, thanks for the link!

@wesleytodd
Copy link
Member Author

Well I very quickly ended on this line, and I am fairly confident we will be able to find a better (and even better perf) way than a big regex.

But this leads me to my next question: do we really want to deviate from the existing URL apis?

Seems like this would add "yet another url" to node core, right? Maybe we could get fast parsing but also expose an api which looks like a URL instance, but this gets us into a similar situation of "its a stream but not". Anyway, just thinking out loud on this.

@styfle
Copy link
Member

styfle commented Oct 2, 2020

It's extremely slow and inefficient due to the host and protocol parsing.

Is the slow performance of new URL() inherent to the way its spec'ed or just the Node.js implementation?

If its the latter, then this would be a good excuse to re-evaluate the implementation to squeeze out better performance.

@styfle
Copy link
Member

styfle commented Oct 2, 2020

Overall, I think providing a way to retrieve req.url as a URL would be a great addition to Node core because the URL object is the future, url.parse() is a legacy of the past, before there was a standard.

Using standard JS URL objects allows code to be shared between the browser, Node.js, etc.

One of the big foot-guns today is parsing url query string params. Consider this code:

const { parse } = require('url');
const handler = (req, res) => {
  let { pathname = '/', query = {} } = parse(req.url || '', true);
  const { user = '' } = query;
  console.log(user.toLowerCase());
}

This works fine with https://example.com/api?user=foo but throws when two parameters with the same name are detected https://example.com/api?user=foo&user=bar because user suddenly becomes an array instead of a string.

Consider this similar code using URL:

const handler = (req, res) => {
  const url = new URL(req.url, 'https://example.com');
  const user = url.searchParams.get('user') || '';
  console.log(user.toLowerCase());
}

This bug doesn't happen with URL because it separates getAll() vs get(). Now we won't crash in production.

My workaround for the time being looks like this:

export function getURL(req) {
  const proto = req.headers['x-forwarded-proto'] || 'https';
  const host = req.headers['x-forwarded-host'] || req.headers.host || 'example.com';
  return new URL(req.url || '/', `${proto}://${host}`);
}

@ljharb
Copy link
Member

ljharb commented Oct 2, 2020

It makes sense to me to offer both (obviously under different names) - ie, req.url as the string, and req.getURL() as the lazily-computed-but-memoized URL object.

@wesleytodd
Copy link
Member Author

It makes sense to me to offer both

I think before we make any decisions on this, we need to know if we can help fix the perf issues of URL. If we cannot do that, then I think we should consider alternate api's to expose both. If not, I think one api is best.

My workaround for the time being looks like this:

@styfle There are security concerns here, but if we "do it right" I think it is a reasonable approach (if you know those headers can only be set by trusted proxies, you should be fine). The added support for the meta headers would also change this a bit, but not in a bad way.

Also, we would default them to the values computed from server.address() I think.

@stevenvachon
Copy link

It makes sense to me to offer both (obviously under different names) - ie, req.url as the string, and req.getURL() as the lazily-computed-but-memoized URL object.

Why both when there's URL::href?

@styfle
Copy link
Member

styfle commented Oct 2, 2020

Modifying req.url to return a URL object instead of a string would be a breaking change and it doesn't seem worth the churn to the ecosystem.

@ghermeto
Copy link
Member

ghermeto commented Oct 2, 2020

Modifying req.url to return a URL object instead of a string would be a breaking change and it doesn't seem worth the churn to the ecosystem.

I don't think this is an issue if we are discussing a whole new API

@styfle
Copy link
Member

styfle commented Oct 2, 2020

Maybe I misunderstood the intent of this discussion 🤔

I thought req in the original post was referring to the IncomingMessage class, the first parameter to createServer listener.

If you modify the url property on IncomingMessage, then it would be a breaking change, right?

@awwright
Copy link

awwright commented Oct 3, 2020

Possibly a tangent, but if I were to pick the naming, I would use req.target to mean "what the client provided literally" and req.uri to mean "Resolved (full, absolute) URI"

And if I were modifying the Node.js API, I would deprecate req.url or alias it to req.target instead.

@wesleytodd
Copy link
Member Author

Maybe I misunderstood the intent of this discussion

This discussion is related to this: #55

Because the entire api will be new, breaking is fine.

I would use req.target to mean "what the client provided literally" and req.uri to mean "Resolved (full, absolute) URI"

While I see the goal of this naming, I am strongly against bikeshedding names (at least at this point). I also do not think this would be a meaningful change, nor one which would make it easier/better for any of our constituencies. Once we nail down the particulars of what we need to offer on this api, we can discuss again if name changes will be something to pursue, but lets keep on focus for now.

@wesleytodd
Copy link
Member Author

wesleytodd commented Oct 15, 2020

To recap our discussion on the meeting today:

What would folks think of providing an api at server creation time to pass your url implementation. By default we would create a URL instance, but if you passed a function to the createServer call your function would be called instead. This would be similar to the existing api where you can pass IncomingMessage and ServerResponse. So you could pass URL, for example. Thoughts?

@mcollina
Copy link
Member

I'm still generically -1 to the idea of using

new URL(`${protocol}://${host}${path}`) // simplified

in a hot path by default.
It creates unnecessary computation by generating a string that is then parsed. It also creates a few objects which are slow to create as well (UrlSearchParams).

A similar discussion can be made for the Headers object of fetch and the relative standards.

None of this was defined with throughput in mind. If we want Node.js to shine, it needs to be faster, not slower.

@stevenvachon
Copy link

@mcollina "better" is better than "faster" too.

@ghermeto
Copy link
Member

Agree with @mcollina. Anything that affects performance should be opt-in. In this particular case, if I don't need to parse the URL in most of my requests, why would I pay the price on all of them?

@wesleytodd
Copy link
Member Author

wesleytodd commented Oct 19, 2020

The plan which @ronag and I discussed in this gist involves a multi-layer api. As a compromise, if we kept the "core" api with the string req.url, would it be acceptable to have the higher level api use the URL constructor? Since the higher level api is intended to usable directly (as in not via a framework) it would be preferable to me so we don't introduce "yet another" url implementation.

@ghermeto
Copy link
Member

Can you elaborate on:

Since the higher level api is intended to usable directly

and

we don't introduce "yet another" url implementation

@wesleytodd
Copy link
Member Author

Since the higher level api is intended to usable directly

I think we should probably agree on which APIs have which intended constituencies. @ronag outlined a bit of what the goals of the different undici apis are, and the layering which he does enables really low level usage, but also has "direct end user api's" which are high level. The goal was to take a similar approach (see the table in the comment linked above). What we don't yet have is a clear statement on the target usage of the layers.

we don't introduce "yet another" url implementation

One of the proposals above was to utilize a specialty parser like request-target for the url. My concern with this is we already have 2 implementations, this would add a third.

@ghermeto
Copy link
Member

I think we should probably agree on which APIs have which intended constituencies. @ronag outlined a bit of what the goals of the different undici apis are, and the layering which he does enables really low level usage, but also has "direct end user api's" which are high level. The goal was to take a similar approach (see the table in the comment linked above). What we don't yet have is a clear statement on the target usage of the layers.

And what is wrong with using:

const url = new URL(req.url);

Seems simple enough if a user wants to parse the URL and can also be provided by the frameworks if they must...

One of the proposals above was to utilize a specialty parser like request-target for the url. My concern with this is we already have 2 implementations, this would add a third.

agree with you and I believe this gives an opportunity to let users (or frameworks) chose how they want to parse a URL. As you suggested, we can provide a function to createServer that tells the server how to parse the URL. If not provided, just returns a string by default:

http.createServer({
  allowHTTP1: true,
  parseUrl(stringUrl) {
    return new URL(stringUrl);
  }
})

@styfle
Copy link
Member

styfle commented Jan 17, 2023

One of the proposals above was to utilize a specialty parser like request-target for the url. My concern with this is we already have 2 implementations, this would add a third.

Agreed, that will likely lead to more confusion.

I think the more this conversation goes on we need more clarity on what the use case are. I think that most cases will be that only the pathname is used and that might lean more toward just telling users to parse it themselves if necessary.

@wesleytodd In addition to pathname, I think query string is quite common too.

I think what'd need to make this work is: URL.from({ host, path, protocol }). In this way we can skip the non-needed parts of the algorithm that are really expensive.

@mcollina I'm not sure that solves the problem of parsing req.url into some object. Would you be able to URL.from(req.url) or even URL.from(req)?

@sheplu
Copy link
Member

sheplu commented Jan 18, 2023

thanks for your message @styfle ! But this group isn't really active anymore (last two years I would say)

@styfle
Copy link
Member

styfle commented Jan 19, 2023

I see. I've been bouncing around different issues and all seem to be dead ends.

@mcollina
Copy link
Member

@mcollina I'm not sure that solves the problem of parsing req.url into some object. Would you be able to URL.from(req.url) or even URL.from(req)?

I think URL.from(req) would be fantastic. We probably can't shipt that API, but possibly http.getURL(req), which would be ok. Also req.getURL() might work.

@styfle
Copy link
Member

styfle commented Feb 21, 2023

possibly http.getURL(req), which would be ok. Also req.getURL() might work.

Either option sounds great!

We could use original-url as a starting point (or even vendor it into node core)

@mcollina
Copy link
Member

original-url uses the unsafe require('url').parse method. I don't think anybody should use that implementation as there are multiple security problems with it. The best approach would be to pass via new URL() or analog.

@mcollina
Copy link
Member

would you like to send a PR for this?

@awwright
Copy link

How is url.parse unsafe? If the incoming request conforms to the generic URI syntax (which obviously should be verified, as you always do for untrusted user input, right?), there's no reason there should be problems.

@styfle
Copy link
Member

styfle commented Feb 23, 2023

would you like to send a PR for this?

Looks like someone already did: watson/original-url#11

@marco-ippolito
Copy link
Member

issue related: nodejs/node#51311

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests