Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: file: URL separator deduplication #552

Open
guybedford opened this issue Oct 8, 2020 · 15 comments
Open

Proposal: file: URL separator deduplication #552

guybedford opened this issue Oct 8, 2020 · 15 comments
Labels
addition/proposal New features or enhancements topic: api topic: file Aren't file: URLs the best?

Comments

@guybedford
Copy link

guybedford commented Oct 8, 2020

I would like to propose that file URLs include separator deduplication in the spec as a rule to match the file system identifiers themselves more closely.

The reason for this is that Node.js, Deno, and other platforms, build tools and tooling use file: URLs as their key in the module registry when loading ES modules. Whenever there are duplicate separators, this causes a double entry in the registry which can lead to duplicate module execution and a lack of code sharing.

These platforms and build tools are now the primary consumers of file: URLs since browsers no longer encourage the use case at all, therefore it is important that this spec can aim to meet their needs as primary consumers of this aspect of the spec going forward.

URL resolution already converts \ into /. This leaves only casing, percent encoding, query parameters and hashes as the comprehensive list of remaining non-unique components of the file URL representation's associated path representation.

I understand the main consideration for such a change is considering the potentially major change in browsers, but perhaps it would not be so difficult since browsers so carefully discourage file URLs and the issue is rather that until now there hasn't been pressure on this use case from these new file modules platform directions.

This would smooth a number of environment bugs that otherwise need custom code to handle.

In Node.js we effectively add this renormalization step to all module resolution usage of file URLs. Other cases of fileURL resolution would benefit.

References:

@annevk
Copy link
Member

annevk commented Oct 8, 2021

That was #232 which we ended up overturning to some extent with #405. I'd rather not revisit this again.

cc @alwinb

@alwinb
Copy link
Contributor

alwinb commented Oct 8, 2021

To be fair, It is a bit different, as #232 collapsed a leading sequence of slashes only, somewhat mimicking the ignoring of multiple slashes before the host in other special URLs. This seems to call for collapsing all slashes in the path.

There also were interactions with the host that made things rather complex, especially to find a good trade-off between matching the different browser behaviours.

We recently considered in #606 to add a percent-encoding-equivalence / normalisation to the URL equivalence section. The same could be done for (some necessarily limited form of) file-path normalisation. Then the parser (ao.) would not become more complex.

Whether such a thing should be applied by default, is a question that I don't feel qualified to answer though. Browsers in any case, don't do this.

@karwa has done a lot of research on file paths, posted in #621 (note #618 as well). I think this would be part of that project. It's some amount of work though!

@guybedford
Copy link
Author

I think the proposal would be to just dedupe pathname in the file: scheme URLs and not any other URL parts (so leaving the Windows hostname etc). @alwinb do you know if there were any remaining issues with that kind of approach offhand? It's a lot of information to digest to pick up it seems....

@alwinb
Copy link
Contributor

alwinb commented Oct 8, 2021

If you don't consider the hostname or windows drive letters? I don't see how there would be interactions with other parts of the URL. But it would equate more file URLs.

I'd be surprised to find systems that allow empty directory names, but I would want to be careful about normalising too much. File path normalisation is quite a thing on its own.

@guybedford
Copy link
Author

Certainly agreed, I think for the most part we've accepted the state of percent encoding at this point. This issue was specifically about the / separator deduping only in the pathname.

@annevk
Copy link
Member

annevk commented Oct 8, 2021

Given browser behavior I would tend toward this being something you do after parsing if you need it.

@alwinb
Copy link
Contributor

alwinb commented Oct 9, 2021

I don't think it is an unreasonable proposal. But it requires answering a universal question about file-paths and that makes it a difficult issue.

Specifically: does this normalisation agree with all file-paths, for all systems, and are we willing to impose it on future systems as well?

The normalisation itself isn't complicated and quite natural, so if that question can be answered with a convincing "yes" then I wouldn't be opposed, at least.

Related: RFC8089

@rmisev
Copy link
Member

rmisev commented Oct 9, 2021

RFC 8089 lists two cases, which doesn't play well which separator deduplication, see: https://datatracker.ietf.org/doc/html/rfc8089#appendix-E.3.2

E.3.2. URI with UNC Path
...
Non-local files:

  • The representation of a non-local file with an empty authority and a complete (transformed) UNC string in the path. For example:
    • "file:////host.example.com/path/to/file"
  • As above, with an extra slash between the empty authority and the transformed UNC string, as per the syntax defined in [RFC1738]. For example:
    • "file://///host.example.com/path/to/file"
      This representation is notably used by the Firefox web browser. See Bugzilla#107540 [Bug107540].

If I understand this proposal correctly, then separator deduplication will change their meaning, as they become:
file:///host.example.com/path/to/file

@alwinb
Copy link
Contributor

alwinb commented Oct 10, 2021

Thanks

@guybedford
Copy link
Author

@rmisev the UNC string in a file URL, as far as I'm aware, is always of the form file://host/pathname and not file:///host/pathname. As a result new URL() treats the UNC component as a hostname in all cases I've seen and Node.js relies on this behaviour as well (//cc @bmeck if I'm missing something). Where did you get the file:////host.example.com example from? I've never seen a case like that...

@rmisev
Copy link
Member

rmisev commented Oct 10, 2021

The quote with these examples is taken from RFC 8089: https://datatracker.ietf.org/doc/html/rfc8089#appendix-E.3.2
For example Firefox uses 5 slashes before host, when converts from UNC path:
\\host.example.com\path\to\file -> file://///host.example.com/path/to/file

@guybedford
Copy link
Author

@rmisev in Node.js we explicitly create file://host.example.com/... in these cases so I wasn't aware of this more verbose form. That seems like a good counter example.

@guybedford
Copy link
Author

I guess this isn't a blocker though - it just complicates the story a little:

  1. Is the file URL a UNC form (file://// detection being non-ambiguous as the indicator I believe), and if so, only dedupe separators after the UNC prefix.
  2. If not, dedupe separators as part of the pathname

It still seems like it wouldn't be that complex?

@TimothyGu
Copy link
Member

Right now, with the exception of drive letters, pathnames are pretty consistent between different schemes. I think it'd be unfortunate to have the slash-collapsing behavior just for file URLs.

Additionally, parsing file:///etc//passwd gives consistent results on a wide variety of parsers across different languages:

image

I would really want to preserve this level of interop, which is quite rare in file URLs.

@annevk annevk added the topic: file Aren't file: URLs the best? label Oct 20, 2021
@annevk annevk added topic: api addition/proposal New features or enhancements and removed topic: parser labels Mar 6, 2023
@annevk
Copy link
Member

annevk commented Mar 6, 2023

Reclassifying this as something we could potentially offer on top of #729, but won't change the parser for. Hope that's acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition/proposal New features or enhancements topic: api topic: file Aren't file: URLs the best?
Development

No branches or pull requests

5 participants