New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Windows] Fix a bunch of Unicode related failures #2095
base: master
Are you sure you want to change the base?
Conversation
archive_entry_copy_symlink(entry, | ||
(const char *)symname); | ||
|
||
/* Symbolic links are embedded as UTF-8 strings */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finding good documentation on this is tricky. AFAICT the "official" documentation is from the LZMA SDK, however I didn't see any references to symbolic links there. I found different documentation here which states that link paths are UTF-8 encoded, however IDK how trustworthy that source is. That said, the 7-Zip application on Windows and WinRAR all behave correctly when this is a UTF-8 encoded string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably UTF-8 encoded for archives created on Windows, but I'd be curious about POSIX platforms that don't use Unicode filename encodings. (In POSIX, a filename is a sequences of non-zero bytes, with no guarantee that the sequence can be interpreted as characters at all. That can make things ... interesting ... for any code that assumes Unicode semantics.) I'm not sure it's possible to get this completely correct given libarchive's current architecture, though. I'm sure this is an improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. I can give these the ifdef
treatment if you think that would be best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm honestly not sure what would be best here. The state of the art right now is pretty awkward: Major platforms (Windows, macOS) have standardized on Unicode filenames, Linux is moving that way, but there's still a sizable base of POSIX systems for which filenames can not be reliably handled as text. So different archive formats are going to do different things depending on what platforms they are mostly targeting. And frankly, libarchive's current system of proactive character set conversions is far from ideal. I have some vague ideas how it might be improved, but nothing concrete yet.
So I would go ahead with what you have here, but be prepared to reconsider some aspects as we get feedback from POSIX folks.
@kinentzle maybe we could finally go towards enabling MSVC tests with this patch |
… archive_read here...)
I'm looking forward to have the compiler fixes (for Visual Studio Express 2022 on Windows 11 amd64 to be very specific) merged in, but there's a lot going on in this PR... @dunhor, would you mind if I (or we) split this PR into multiple ones with specific topics covered, i.e. at least one which is solely there to make libarchive compile without errors with Visual Studio 2022? |
I wouldn't mind at all. I was actually contemplating breaking this PR into several separate changes - that being one of them - but I'm currently on paternity leave at the moment and haven't really had the time yet. |
This mostly fixes issues with archive creation, typically where improper accesses assume that MBS strings are valid/correct. I've added tests for every bug I've found in every format that we care about at the moment. There may be additional bugs lingering elsewhere, however I don't currently have the means for testing such scenarios out easily at the moment.
Note that this also includes the changes for #2091 because I wanted to be extra sure I wasn't breaking anything.
This also fixes a few inefficiencies where it's assumed that UTF8 <-> MBS conversions are direct/most efficient, which is not the case on Windows. On Windows, WCS strings are needed as intermediate representations, so it's wasteful to throw them away, especially if we'll need them later (or for the immediate operation).
I also updated the CMake files & some sources to allow compiling as Debug on Windows as well as to compile with Clang on Windows in "MSVC mode" (i.e.
clang-cl
)