Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bsdtar included with Windows does not handle paths with special characters #2092

Open
dennisvang opened this issue Mar 15, 2024 · 10 comments
Open

Comments

@dennisvang
Copy link

dennisvang commented Mar 15, 2024

Please find a summary below. A more detailed description is provided here.

Actually I'm not sure if this is a Windows issue or a libarchive issue, because the issue does not occur with bsdtar from libarchive-tools on Ubuntu.

Basic Information
Version of libarchive:

bsdtar 3.5.2 - libarchive 3.5.2

How you obtained it:

included in Windows 10

Operating system and version:

Windows 10 Home, Version 22H2, OS build 19045.4170

What compiler and/or IDE you are using (include version):

n/a

If you are using a pre-packaged binary
Exact package name and version:

n/a

Repository you obtained it from:

n/a

Description of the problem you are seeing:
What did you do?

Tried to make a tar archive with non-latin character in the archive path, e.g. tar -c -f Ā.tar foo

What did you expect to happen?

Expected this to create a file called Ā.tar. (note unicode code point U+0100)

What actually happened?

A file called A.tar was created. (note unicode code point U+0041)

What log files or error messages were produced?

none

How the libarchive developers can reproduce your problem:
What other software was involved?

Run the above tar command on Windows 10, using either cmd, powershell, or git-bash (calling c:\Windows\System32\tar.exe explicitly, because git-bash comes bundled with GNU tar)

What other files were involved?

Any file, e.g. an empty file.

How can we obtain any of the above?

n/a

@dunhor
Copy link
Contributor

dunhor commented Mar 15, 2024

@DHowett as FYI

Note that this only seems to be true with arguments from the command line. If you specify a directory that contains a filename with Unicode characters, that seems to work correctly. This might be as simple as updating to wmain for Windows and fixing whatever fallout comes from that.

@dennisvang
Copy link
Author

... If you specify a directory that contains a filename with Unicode characters, that seems to work correctly. ...

@dunhor It does on my system too, although the tar command line output suggests otherwise.

For example, here's what I see in powershell 7: (note the Δ becomes ? in the tar output)

PowerShell 7.4.1
PS C:\Users\Dennis\temp> chcp
Active code page: 437
PS C:\Users\Dennis\temp> mkdir content
...
PS C:\Users\Dennis\temp> new-item content\Δ.txt

    Directory: C:\Users\Dennis\temp\content

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a---          18/03/2024    09:35              0 Δ.txt

PS C:\Users\Dennis\temp> tar -cvf my.tar -C content *
a ?.txt
PS C:\Users\Dennis\temp> tar -tf my.tar
?.txt
PS C:\Users\Dennis\temp> tar -xvf my.tar
x ?.txt
PS C:\Users\Dennis\temp> dir

    Directory: C:\Users\Dennis\temp

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----          18/03/2024    09:35                content
-a---          18/03/2024    09:35           2560 my.tar
-a---          18/03/2024    09:35              0 Δ.txt

@dunhor
Copy link
Contributor

dunhor commented Mar 18, 2024

@dennisvang I'm guessing that's likely because it's trying to print the strings as char* and replacing any character > 127 with ?. I'm guessing that a call to setlocale with en_US.utf-8 would yield proper string conversions, however you'd likely then run into conflicts with the console's code page and get garbage output. Running chcp 65001 would fix the garbage output, but that's probably not a very intuitive thing to do. I'm also not sure how guaranteed en_US.utf-8 is to succeed.

Changing the output path to use WCS strings for console output would also work, however I'm not sure how invasive such a change might be. I'm also not 100% sure that the output would appear correct without something like a chcp call, but @DHowett is the console expert and would probably have a better idea.

@BurningEnlightenment
Copy link

@dunhor I've recently fixed a similar issue for a base64 CLI by adding an application manifest which forces the Active CodePage to UTF-8 (aklomp/base64#139). Note that this approach only works/takes effect on Windows 10 Version 1903 and later.

@dunhor
Copy link
Contributor

dunhor commented Apr 3, 2024

@dunhor I've recently fixed a similar issue for a base64 CLI by adding an application manifest which forces the Active CodePage to UTF-8 (aklomp/base64#139). Note that this approach only works/takes effect on Windows 10 Version 1903 and later.

I thought about that as well, however libarchive uses setlocale to determine the active codepage of the process, which AFAIK will always default to "C" and therefore try and convert all char* strings to ANSI, regardless of ACP for the process. Unless the manifest also affects the CRT, which I don't think it does, but I could easily be wrong 🙂. Ideally, all codepaths on Windows would use wchar_t* strings throughout - which is more efficient anyway - however I'm not sure how invasive such a change might be.

Side note: there's an additional related problem of how libarchive interprets filenames in the archives themselves when not explicitly specified by either the archive or the format. It's probably too breaking of a change to default to always setting hdrcharset=UTF-8, however it might be reasonable to add a command line argument (IIRC one does not yet exist last time I checked).

@kientzle
Copy link
Contributor

kientzle commented Apr 4, 2024

Side note: there's an additional related problem of how libarchive interprets filenames in the archives themselves when not explicitly specified by either the archive or the format

Libarchive arguably does more than it should in these cases. For most formats, the filenames in the archive are either:

  • UTF-8. For example, Zip and Pax can store UTF-8 filenames.
  • Sequence of non-null bytes. Tar, cpio, zip, and many others typically store the bytes of the filename without interpretation. Generally, this means the reader of such an archive has literally no information at all about what character set the filenames might use or indeed whether they use any character set at all. (POSIX literally does not require a filename to be in any character set at all.)

(Yes, there are a few exceptions to the above; most notably some Windows-originated formats that use UTF-16 filenames.)

For libarchive 4, I really want to explore having libarchive's readers perform no filename translation at all. That would defer any attempt to translate filenames to the point where they are written -- either to disk or to another archive. The question is how much this would complicate life for users of libarchive. I have a few ideas and hope someday soon to have enough time to start working through them.

@dunhor
Copy link
Contributor

dunhor commented Apr 4, 2024

Libarchive arguably does more than it should in these cases

Sorry, I worded that statement poorly 🫣. I meant that it's a problem of ambiguity, not a problem unique to libarchive that libarchive has introduced. When the format or the archive itself does not specify how names are encoded, heuristics will always be necessary, particularly when the machine/OS that created the archive may not be the one reading the archive. I actually think the facilities libarchive provides are pretty good in this case (though I do wish it didn't use locale to convert strings to MBS, but that's easily solvable, at least on Windows, by always using the _w functions).

The question is how much this would complicate life for users of libarchive

From my perspective, that would depend a lot on how this is accomplished. For example, if names are exposed as something like a void*, byte count, and some encoding enum (e.g. "UTF-8", "UTF-16", "unspecified stream of bytes", etc.), that would complicate consumers, but would ultimately work out fine I'd think. It's easy enough to wrap filename handling/conversion ourselves. If names are always exposed as char* then I'd be curious how UTF-16 strings are handled. E.g. if they are always converted using locale with no way to get the original string, that would be problematic, at least for us.

@kientzle
Copy link
Contributor

kientzle commented Apr 4, 2024

What I had in mind was something along the following lines:

  • When libarchive reads a filename from an archive, it stores those exact bytes in the archive_entry struct along with a length and some indicator of how the name is encoded. (E.g., POSIX byte sequence, UTF8, UTF16, RAR-variant UTF16, Windows code page, etc.) Q: should any massaging be done at all? For example, should this proactively translate UTF16 to native byte order?
  • Low-level convenience APIs can request the filename in various formats. Those APIs do only basic conversions (e.g., UTF8 <-> UTF16) and/or validations and are all failable. For example, you can request UTF8, and UTF16 will be converted, a byte sequence will be returned if it validates properly, etc.
  • Higher-level APIs might use iconv-based conversions. Note that I don't want to do anything non-trivial without that being clear in the API. If I could go back 20 years 😁 I would probably not implement such conversions at all, leaving that to the consumer. But to help current users, we'd probably need to expose something that is clearly documented to perform heavyweight conversions. I'm not clear whether we can expose such APIs in a way that they never fail. That's certainly desirable for some basic use cases.

Hmmm..... writing all the above out, it seems doable. But it does look like it would end up exposing a lot of different filename APIs, which is certainly awkward. And sophisticated consumers will pretty much always have to deal with the filename issue -- Windows clients can't always be given UTF16 without risk that the result is simply mangled.

@dunhor
Copy link
Contributor

dunhor commented Apr 5, 2024

I think that's pretty much in-line with what I would be expecting.

Q: should any massaging be done at all? For example, should this proactively translate UTF16 to native byte order?

I think non-Windows - and even some Windows - folks would like that, however at least on Windows, that's an operation that could fail (and at least for UTF-16, a conversion that many folks may not want). So as long as it's an optional conversion that ideally wouldn't happen unless asked for, I think it's fine. This could either be an opt-in or opt-out.

I also like the idea of having helper conversion functions. That would also help aid folks in upgrading across the major version boundary. I'm not sure if it necessarily needs to blow up the number of filename APIs if they are standalone helpers. E.g. path/symlink/hardlink/etc. could all return the raw bytes+indicator, which can then be passed to a conversion function. IMO - having these be fail-able is fine. That's just a natural consequence of string conversions and developers should be prepared for that as a possibility.

Windows clients can't always be given UTF16 without risk that the result is simply mangled.

Paths on Windows will, at some point, need to be translated to UTF-16, whether that's done by the application or by the system. Some paths being non-representable in Windows is just a fact of life that applications will need to account for. With the current state of the library, always interfacing with libarchive using UTF-16 strings is the most reliable, especially when you don't have control over the process locale. For the future, as I've been suggesting, I'm okay with names being forwarded on "as they appear" in the archive, so I don't think it's necessary to provide Windows consumers with a way to always get strings as UTF-16 under that model. That's actually probably better in some ways as more advanced applications can choose how to handle such situations (e.g. replacing invalid characters with placeholders, etc.)


The one thing that sounds maybe tricky to me is how the archive creation functions work. Would they be permissive and accept strings in many forms? E.g. could I call archive_entry_set_pathname and pass in a pointer with a descriptor saying that it's a Windows MBS string encoded using CP_XXX? And if I did so, would the library do necessary translations for me (e.g. if the archive format has a well-defined character encoding, or if it supports setting hdrcharset, etc.)? Or would the set of allowable encodings be limited? E.g. in the worst case I'd have to know or check what the format is expecting and convert strings to that format before calling any archive_entry_set* function. Or somewhere in between.

archive_read_disk/archive_write_disk may also be a little interesting, however I'd guess that their implementations would be forked based on target OS similar to how they are today, so ultimately maybe not that interesting.

@kientzle
Copy link
Contributor

kientzle commented Apr 5, 2024

Optional conversions are a challenge to handle well. We either need to make them modal -- where you say in advance what conversions you want and the library gains a lot of extra logic and storage to track those -- or we make them on-demand, which might incur extra allocations in order to preserve the original.

As with any software: More options means more paths to test and validate.

My idea here is to push any potentially-lossy translations as late as possible. So if you read from one archive and write to another, the reader would produce entries with (mostly) unmodified contents from the original archive and the writer would be responsible for using and/or translating as necessary. So I would expect a small family of archive_entry_set_pathname_utf8, etc, that would accept and store filenames, and a comparable family of functions that would provide the pathname back in other formats.

As you point out, there are conceivably cases where this requires complex transformations and it's not yet obvious to me how to provide that in a way that balances accuracy and convenience. Fortunately (??) most formats fall back on "just dump the bytes". So when writing Zip format, for example, we could ask the entry if it can easily provide UTF8 (which will be possible if the original name was UTF8 or UTF16) and if not we just ask for raw bytes and dump those out. The only cases where we potentially end up with truly arbitrary transformations would be when reading formats like CAB that store pathname encoding details while writing to disk on legacy POSIX systems that expect local filename encodings to match the user's locale. But that's a sufficiently borderline situation that I'm tempted to not try to be clever in such cases. I think that means that practically speaking, we might only need UTF8 <-> UTF16 conversions in the core library and selective support for Unicode <-> arbitrary MBS when writing to disk. The former we could implement from scratch; the latter would require iconv or similar for full support.

This would be a big project -- I don't have enough time to even do the initial explorations for it right now. But it is the one piece of libarchive's API that I'm actively dissatisfied with which is why I've been thinking about alternatives for a while now, hoping to find time someday to actually implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants