[Windows] Fix a bunch of Unicode related failures #2095

dunhor · 2024-03-19T17:23:23Z

This mostly fixes issues with archive creation, typically where improper accesses assume that MBS strings are valid/correct. I've added tests for every bug I've found in every format that we care about at the moment. There may be additional bugs lingering elsewhere, however I don't currently have the means for testing such scenarios out easily at the moment.

Note that this also includes the changes for #2091 because I wanted to be extra sure I wasn't breaking anything.

This also fixes a few inefficiencies where it's assumed that UTF8 <-> MBS conversions are direct/most efficient, which is not the case on Windows. On Windows, WCS strings are needed as intermediate representations, so it's wasteful to throw them away, especially if we'll need them later (or for the immediate operation).

I also updated the CMake files & some sources to allow compiling as Debug on Windows as well as to compile with Clang on Windows in "MSVC mode" (i.e. clang-cl)

libarchive/archive_entry_link_resolver.c

dunhor · 2024-03-19T19:57:49Z

libarchive/archive_read_support_format_7zip.c

-			archive_entry_copy_symlink(entry,
-			    (const char *)symname);
+
+			/* Symbolic links are embedded as UTF-8 strings */


Finding good documentation on this is tricky. AFAICT the "official" documentation is from the LZMA SDK, however I didn't see any references to symbolic links there. I found different documentation here which states that link paths are UTF-8 encoded, however IDK how trustworthy that source is. That said, the 7-Zip application on Windows and WinRAR all behave correctly when this is a UTF-8 encoded string.

This is probably UTF-8 encoded for archives created on Windows, but I'd be curious about POSIX platforms that don't use Unicode filename encodings. (In POSIX, a filename is a sequences of non-zero bytes, with no guarantee that the sequence can be interpreted as characters at all. That can make things ... interesting ... for any code that assumes Unicode semantics.) I'm not sure it's possible to get this completely correct given libarchive's current architecture, though. I'm sure this is an improvement.

Fair. I can give these the ifdef treatment if you think that would be best.

I'm honestly not sure what would be best here. The state of the art right now is pretty awkward: Major platforms (Windows, macOS) have standardized on Unicode filenames, Linux is moving that way, but there's still a sizable base of POSIX systems for which filenames can not be reliably handled as text. So different archive formats are going to do different things depending on what platforms they are mostly targeting. And frankly, libarchive's current system of proactive character set conversions is far from ideal. I have some vague ideas how it might be improved, but nothing concrete yet.

So I would go ahead with what you have here, but be prepared to reconsider some aspects as we get feedback from POSIX folks.

libarchive/archive_write_set_format_pax.c

libarchive/archive_write_set_format_gnutar.c

libarchive/archive_write_set_format_pax.c

mmatuska · 2024-04-23T06:42:18Z

@kinentzle maybe we could finally go towards enabling MSVC tests with this patch

mmatuska · 2024-04-23T22:30:51Z

@dunhor could you please rebase? "test_read_format_rar5_unicode" conflicts with a test with the same name introduced in PR #1978 (aafb078)

… fixing

… archive_read here...)

dunhor · 2024-04-27T02:37:13Z

@dunhor could you please rebase? "test_read_format_rar5_unicode" conflicts with a test with the same name introduced in PR #1978 (aafb078)

Woops, didn't realize I used the same name. Resolved and deleted the old test since the new one is more comprehensive and they otherwise test the same thing.

stoeckmann · 2024-05-08T20:02:44Z

I'm looking forward to have the compiler fixes (for Visual Studio Express 2022 on Windows 11 amd64 to be very specific) merged in, but there's a lot going on in this PR...

@dunhor, would you mind if I (or we) split this PR into multiple ones with specific topics covered, i.e. at least one which is solely there to make libarchive compile without errors with Visual Studio 2022?

dunhor · 2024-05-10T05:41:38Z

I'm looking forward to have the compiler fixes (for Visual Studio Express 2022 on Windows 11 amd64 to be very specific) merged in, but there's a lot going on in this PR...

@dunhor, would you mind if I (or we) split this PR into multiple ones with specific topics covered, i.e. at least one which is solely there to make libarchive compile without errors with Visual Studio 2022?

I wouldn't mind at all. I was actually contemplating breaking this PR into several separate changes - that being one of them - but I'm currently on paternity leave at the moment and haven't really had the time yet.

@dunhor

Fixes all test-related compiler warnings with Visual Studio 2022 on Windows 11. Contains some changes from #2095. CC: @dunhor --------- Co-authored-by: Duncan Horn <dunhor@microsoft.com>

dunhor commented Mar 19, 2024

View reviewed changes

libarchive/archive_entry_link_resolver.c Show resolved Hide resolved

dunhor commented Mar 19, 2024

View reviewed changes

libarchive/archive_write_set_format_pax.c Outdated Show resolved Hide resolved

kientzle reviewed Mar 20, 2024

View reviewed changes

libarchive/archive_write_set_format_gnutar.c Outdated Show resolved Hide resolved

dunhor commented Mar 20, 2024

View reviewed changes

libarchive/archive_write_set_format_pax.c Show resolved Hide resolved

mmatuska requested a review from kientzle April 23, 2024 06:41

dunhor added 16 commits April 26, 2024 19:00

Fix compilation with Clang and when compiling as Debug

b6b7fda

Fix various archive creation failures on Windows with Unicode names

2c6f1a7

Change 7zip test to use archive reader instead of inspecting memory &…

2dbde6a

… fixing

Fix & test link_resolver

8b54397

Fix PAX link encoding and tests

97a962b

Unicode fix reading RAR files

480904c

Use the macro

db169e1

Revamp and test archive_mstring_update_utf8

97f2f81

Unused var

b02d90f

Update C locale test to be Windows only

877515c

Update makefile.am to see if this is why things are breaking

dc340ed

Apparently I don't know how WCS strings behave

d164c07

Try and get a more useful error message

48e9625

Hopefully fix issue and add hardlink_is_set function

af67faf

I think tab size is supposed to be 8 for this file

5a25f79

Update zip test after merge with master (though perhaps we should use…

b7d6625

… archive_read here...)

dunhor force-pushed the unicode-tests branch from 0852380 to b7d6625 Compare April 27, 2024 02:03

De-duplicate test and fix compilation errors

8cd66f9

dunhor added 3 commits April 28, 2024 23:41

Test to see if setting locale gets entry names to be non-null

bf768cc

Missing include

c25b463

Always forget the makefile...

e630b07

stoeckmann mentioned this pull request May 10, 2024

[Windows] Fix test compilation warnings with Visual Studio #2178

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Windows] Fix a bunch of Unicode related failures #2095

[Windows] Fix a bunch of Unicode related failures #2095

dunhor commented Mar 19, 2024

dunhor Mar 19, 2024 •

edited

kientzle Mar 20, 2024

dunhor Mar 20, 2024

kientzle Mar 20, 2024

mmatuska commented Apr 23, 2024 •

edited

mmatuska commented Apr 23, 2024 •

edited

dunhor commented Apr 27, 2024

stoeckmann commented May 8, 2024 •

edited

dunhor commented May 10, 2024

[Windows] Fix a bunch of Unicode related failures #2095

Are you sure you want to change the base?

[Windows] Fix a bunch of Unicode related failures #2095

Conversation

dunhor commented Mar 19, 2024

dunhor Mar 19, 2024 • edited

Choose a reason for hiding this comment

kientzle Mar 20, 2024

Choose a reason for hiding this comment

dunhor Mar 20, 2024

Choose a reason for hiding this comment

kientzle Mar 20, 2024

Choose a reason for hiding this comment

mmatuska commented Apr 23, 2024 • edited

mmatuska commented Apr 23, 2024 • edited

dunhor commented Apr 27, 2024

stoeckmann commented May 8, 2024 • edited

dunhor commented May 10, 2024

dunhor Mar 19, 2024 •

edited

mmatuska commented Apr 23, 2024 •

edited

mmatuska commented Apr 23, 2024 •

edited

stoeckmann commented May 8, 2024 •

edited