New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bsdtar included with Windows does not handle paths with special characters #2092
Comments
@DHowett as FYI Note that this only seems to be true with arguments from the command line. If you specify a directory that contains a filename with Unicode characters, that seems to work correctly. This might be as simple as updating to |
@dunhor It does on my system too, although the For example, here's what I see in powershell 7: (note the
|
@dennisvang I'm guessing that's likely because it's trying to print the strings as Changing the output path to use WCS strings for console output would also work, however I'm not sure how invasive such a change might be. I'm also not 100% sure that the output would appear correct without something like a |
@dunhor I've recently fixed a similar issue for a base64 CLI by adding an application manifest which forces the Active CodePage to UTF-8 (aklomp/base64#139). Note that this approach only works/takes effect on Windows 10 Version 1903 and later. |
I thought about that as well, however libarchive uses Side note: there's an additional related problem of how libarchive interprets filenames in the archives themselves when not explicitly specified by either the archive or the format. It's probably too breaking of a change to default to always setting |
Libarchive arguably does more than it should in these cases. For most formats, the filenames in the archive are either:
(Yes, there are a few exceptions to the above; most notably some Windows-originated formats that use UTF-16 filenames.) For libarchive 4, I really want to explore having libarchive's readers perform no filename translation at all. That would defer any attempt to translate filenames to the point where they are written -- either to disk or to another archive. The question is how much this would complicate life for users of libarchive. I have a few ideas and hope someday soon to have enough time to start working through them. |
Sorry, I worded that statement poorly 🫣. I meant that it's a problem of ambiguity, not a problem unique to libarchive that libarchive has introduced. When the format or the archive itself does not specify how names are encoded, heuristics will always be necessary, particularly when the machine/OS that created the archive may not be the one reading the archive. I actually think the facilities libarchive provides are pretty good in this case (though I do wish it didn't use locale to convert strings to MBS, but that's easily solvable, at least on Windows, by always using the
From my perspective, that would depend a lot on how this is accomplished. For example, if names are exposed as something like a |
What I had in mind was something along the following lines:
Hmmm..... writing all the above out, it seems doable. But it does look like it would end up exposing a lot of different filename APIs, which is certainly awkward. And sophisticated consumers will pretty much always have to deal with the filename issue -- Windows clients can't always be given UTF16 without risk that the result is simply mangled. |
I think that's pretty much in-line with what I would be expecting.
I think non-Windows - and even some Windows - folks would like that, however at least on Windows, that's an operation that could fail (and at least for UTF-16, a conversion that many folks may not want). So as long as it's an optional conversion that ideally wouldn't happen unless asked for, I think it's fine. This could either be an opt-in or opt-out. I also like the idea of having helper conversion functions. That would also help aid folks in upgrading across the major version boundary. I'm not sure if it necessarily needs to blow up the number of filename APIs if they are standalone helpers. E.g. path/symlink/hardlink/etc. could all return the raw bytes+indicator, which can then be passed to a conversion function. IMO - having these be fail-able is fine. That's just a natural consequence of string conversions and developers should be prepared for that as a possibility.
Paths on Windows will, at some point, need to be translated to UTF-16, whether that's done by the application or by the system. Some paths being non-representable in Windows is just a fact of life that applications will need to account for. With the current state of the library, always interfacing with libarchive using UTF-16 strings is the most reliable, especially when you don't have control over the process locale. For the future, as I've been suggesting, I'm okay with names being forwarded on "as they appear" in the archive, so I don't think it's necessary to provide Windows consumers with a way to always get strings as UTF-16 under that model. That's actually probably better in some ways as more advanced applications can choose how to handle such situations (e.g. replacing invalid characters with placeholders, etc.) The one thing that sounds maybe tricky to me is how the archive creation functions work. Would they be permissive and accept strings in many forms? E.g. could I call
|
Optional conversions are a challenge to handle well. We either need to make them modal -- where you say in advance what conversions you want and the library gains a lot of extra logic and storage to track those -- or we make them on-demand, which might incur extra allocations in order to preserve the original. As with any software: More options means more paths to test and validate. My idea here is to push any potentially-lossy translations as late as possible. So if you read from one archive and write to another, the reader would produce entries with (mostly) unmodified contents from the original archive and the writer would be responsible for using and/or translating as necessary. So I would expect a small family of As you point out, there are conceivably cases where this requires complex transformations and it's not yet obvious to me how to provide that in a way that balances accuracy and convenience. Fortunately (??) most formats fall back on "just dump the bytes". So when writing Zip format, for example, we could ask the entry if it can easily provide UTF8 (which will be possible if the original name was UTF8 or UTF16) and if not we just ask for raw bytes and dump those out. The only cases where we potentially end up with truly arbitrary transformations would be when reading formats like CAB that store pathname encoding details while writing to disk on legacy POSIX systems that expect local filename encodings to match the user's locale. But that's a sufficiently borderline situation that I'm tempted to not try to be clever in such cases. I think that means that practically speaking, we might only need UTF8 <-> UTF16 conversions in the core library and selective support for Unicode <-> arbitrary MBS when writing to disk. The former we could implement from scratch; the latter would require iconv or similar for full support. This would be a big project -- I don't have enough time to even do the initial explorations for it right now. But it is the one piece of libarchive's API that I'm actively dissatisfied with which is why I've been thinking about alternatives for a while now, hoping to find time someday to actually implement it. |
Please find a summary below. A more detailed description is provided here.
Actually I'm not sure if this is a Windows issue or a
libarchive
issue, because the issue does not occur withbsdtar
fromlibarchive-tools
on Ubuntu.bsdtar 3.5.2 - libarchive 3.5.2
included in Windows 10
Windows 10 Home, Version 22H2, OS build 19045.4170
n/a
n/a
n/a
Tried to make a tar archive with non-latin character in the archive path, e.g.
tar -c -f Ā.tar foo
Expected this to create a file called
Ā.tar
. (note unicode code point U+0100)A file called
A.tar
was created. (note unicode code point U+0041)none
Run the above tar command on Windows 10, using either cmd, powershell, or git-bash (calling c:\Windows\System32\tar.exe explicitly, because git-bash comes bundled with GNU tar)
Any file, e.g. an empty file.
n/a
The text was updated successfully, but these errors were encountered: