Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cargo packages duplicate files on case-insensitive file systems #13722

Open
kornelski opened this issue Apr 7, 2024 · 7 comments · May be fixed by #13885
Open

Cargo packages duplicate files on case-insensitive file systems #13722

kornelski opened this issue Apr 7, 2024 · 7 comments · May be fixed by #13885
Labels
A-filesystem Area: issues with filesystems C-bug Category: bug Command-package O-macos OS: macOS O-windows OS: Windows S-triage Status: This issue is waiting on initial triage.

Comments

@kornelski
Copy link
Contributor

kornelski commented Apr 7, 2024

Problem

It seems that Cargo is excluding an already-packaged files using exact name comparison, which doesn't always match how the file system sees name equality.

   Archiving Cargo.lock
   Archiving Cargo.toml
   Archiving Cargo.toml.orig
   Archiving README.md
   Archiving readme.Md
   Archiving src/main.rs

Example crate:

https://docs.rs/crate/rosu/0.6.0/source/

Steps

[package]

readme = "README.md"
echo case > readme.Md
cargo package

The same applies to license-file, cargo.lock.

Possible Solution(s)

Theoretically there could be other gotchas of this kind, e.g. HFS+ file system on macOS forces file names to use NFD Unicode form, while most text has NFC form, which makes codepoint-by-codepoint comparisons not equal. However HFS+ is on its way out, so perhaps a simple case-insensitive comparison will suffice.

Notes

No response

Version

cargo 1.79.0-nightly (499a61ce7 2024-03-26)
@kornelski kornelski added C-bug Category: bug S-triage Status: This issue is waiting on initial triage. labels Apr 7, 2024
@kornelski kornelski changed the title Cargo packages duplicate README on case-insensitive file systems Cargo packages duplicate files on case-insensitive file systems Apr 7, 2024
@kornelski
Copy link
Contributor Author

In the same vein, if there's a TARGET/ directory, it doesn't get excluded when packaging.

Caused by:
  Source directory was modified by build.rs during cargo publish. Build scripts should not modify anything outside of OUT_DIR.
  Added: /private/tmp/bla/target/package/testx-0.0.0/TARGET/.rustc_info.json
  	/private/tmp/bla/target/package/testx-0.0.0/TARGET/debug
  	/private/tmp/bla/target/package/testx-0.0.0/TARGET/debug/.cargo-lock
  	/private/tmp/bla/target/package/testx-0.0.0/TARGET/debug/.fingerprint

@heisen-li
Copy link
Contributor

@rustbot label Command-package

@epage epage added O-windows OS: Windows O-macos OS: macOS A-filesystem Area: issues with filesystems labels Apr 9, 2024
@VorpalBlade
Copy link

VorpalBlade commented Apr 23, 2024

so perhaps a simple case-insensitive comparison will suffice

While that sounds lovely, in what locale? For the languages I speak it is relatively straight forward, but my understanding is that case handling is lossy in some languages, such as German (ẞ is Ss in upper case I think?) and Turkish (I believe they have the letter "i" both with and without a dot, and the uppper/lower case there isn't straight forward, but don't ask me how exactly).

As a Swedish/English speaker this is all hearsay though, and I don't know how e.g. Windows or Mac OS handle these, though I think I heard that NTFS store a case normalisation table at file system creation time based on the locale set at that point?

@ChrisDenton
Copy link
Contributor

On Windows, the NTFS up case table is initialized when the drive is first formatted. So it'll depend on the Windows version that did that. It is however language neutral and only acts on the Basic Multilingual Plane.

Also, depending on the configuration, NTFS can be case sensitive. In Windows this can even be set differently for each directory.

@VorpalBlade
Copy link

On Windows, the NTFS up case table is initialized when the drive is first formatted. So it'll depend on the Windows version that did that. It is however language neutral and only acts on the Basic Multilingual Plane.

Hm, maybe I'm thinking of FAT and Windows 9x then? Pretty sure things differed depending on code pages and such there. Not sure how modern OSes interacting with FAT32/exFAT works with that. Hopefully it is somewhat sane on any Windows version Rust still supports.

@ChrisDenton
Copy link
Contributor

Ah yes FAT32 is indeed a mess. But then I'm also not sure how well Cargo and rustc support it as it lacks a lot of filesystem features that may be expected. Probably it does at least work if it's only read from (e.g. the target directory is on another drive).

@kornelski
Copy link
Contributor Author

kornelski commented Apr 24, 2024

While that sounds lovely, in what locale?

It is a messy problem, but fortunately the detection algorithm doesn't need to produce user-facing text, so it doesn't need to be perfect from linguistic perspective. It only needs to detect potential collisions between file names. Crates that work with only a specific combination of Windows locale and NTFS vintage are not generally useful, so the detection can also err on the side of over-normalizing (e.g. normalize all dotless ı's to i, forbid all control characters, check against both lower and upper case, treat codepoints with multiple transliterations/decompositions as a wildcard, etc.).

but for a start, even a simple .lowercase() will handle more than enough for the accidental variations of Readme.Md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-filesystem Area: issues with filesystems C-bug Category: bug Command-package O-macos OS: macOS O-windows OS: Windows S-triage Status: This issue is waiting on initial triage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants