Fix wchar array encoding in structs #1348

Emperor-RE · 2023-05-16T13:05:44Z

Checklist

Which kind of PR do you create?

This PR only contains minor fixes.
This PR contains major feature update.
This PR introduces a new function/api for Qiling Framework.

Coding convention?

The new code conforms to Qiling Framework naming convention.
The imports are arranged properly.
Essential comments are added.
The reference of the new code is pointed out.

Extra tests?

No extra tests are needed for this PR.
I have added enough tests for this PR.
[] Tests will be added after some discussion and review.

Changelog?

This PR doesn't need to update Changelog.
Changelog will be updated after some proper review.
Changelog has been updated in my PR.

Target branch?

The target branch is dev branch.

One last thing

I have read the contribution guide

Structs that use the ctypes.c_wchar_array type were being double encoded, causing the bytes to be padded with '\x00\x00\x00' instead of just '\x00'. I added a few lines to struct.py that fix this.
Example code that shows the problem by reading KUSER_SHARED_DATA->NtSystemRoot:

from qiling import *

ql = Qiling(["x86_windows/bin/x86_hello.exe"], "x86_windows")
ql.log.info(f"NtSystemRoot bytes: {ql.mem.read(0x7ffe0030, 0x20).hex()}")
ql.log.info(f"NtSystemRoot wstring: {ql.os.utils.read_wstring(0x7ffe0030)}")

elicn · 2023-05-17T11:05:34Z

Thanks for reporting that and taking the time to write a patch.

As the one who wrote that code, I was a bit surprised I haven't taken this scenario into account and checked that myelf. Your repro script showed it quite well, but testing it on another environment actually worked well. After more digging, it turned out that this is a misbehavior of Python 3.8 where Python 3.9 works fine the way it is.

Is it possible you were testing this using Python 3.8?
If so, we'll need to come up with a workaround approach that could be easily removed when we proceed to a newer Python version.

Python 3.8:

Same script executed on Python 3.9 (Win version, but I don't think it matters):

A few more experiments showed that ctypes.create_unicode_buffer behaves differently on those versions. I guess this is related.

ubuff = ctypes.create_unicode_buffer("hello", 8)
print(bytes(ubuff).hex(" "))

Emperor-RE · 2023-05-17T11:58:28Z

Thanks for the quick reaction!
I am running Python version 3.10.8 on MacOS, with that version it produces the results mentioned in my pull request.
It might also be caused by the operating system, could you check if it produces the same results for you on version 3.10 on Windows or Linux?

Emperor-RE · 2023-05-17T12:36:57Z

As a follow-up, this problem also seems to be ocurring with struct.ref assignments, example of my own project where the same problem occurs (in assigning fdata_obj.cFileName), even after my patch:

fdata_struct = make_win32_find_data(ql.arch.bits, wide=True)
with fdata_struct.ref(ql.mem, pointer) as fdata_obj:
    fdata_obj.cFileName          = filename

I'll await your reply to see if this also needs its own patch or if we need to find another solution with regards to the python versions.

elicn · 2023-05-17T12:38:01Z

I tested that on a Python 3.9 running on top of WSL, and it behaves like 3.8 (that is, use excessive padding) so it is not a matter of versions after all. I beleive that could be a result of a default encoding, where Python over POSIX set it to UTF-8 by default [means that any string you provide it is already implemented as a UTF-8 buffer under the hood, which gets re-padded by ctypes].

My Windows Python encoding is set to cp1252 (Western Europe), so it might be the reason it works OK there.
However, I tried forcing the default encoding on WSL and it did not change anything:

$ python3.9.exe -X utf8=1 :

>>> import locale ; locale.getpreferredencoding()
'UTF-8'
>>> import ctypes ; bytes(ctypes.create_unicode_buffer("hello", 8))
b'h\x00e\x00l\x00l\x00o\x00\x00\x00\x00\x00\x00\x00'
>>>

$ python3.9.exe -X utf8=0:

>>> import locale ; locale.getpreferredencoding()
'cp1252'
>>> import ctypes ; bytes(ctypes.create_unicode_buffer("hello", 8))
b'h\x00e\x00l\x00l\x00o\x00\x00\x00\x00\x00\x00\x00'

So I am a bit confued here.. I am pretty much convienced this is an encoding / locale issue, but I cannot figure out how to determine that in order to work around it.

Emperor-RE · 2023-05-17T12:41:44Z

Yeah i'm pretty sure it has to do with encoding of arrays. When i tested it with some dummy code that called bytes() seperately for each ctypes.c_wchar, that does result in proper encoding (with proper \x00 padding) Calling bytes() on a ctypes_c_wchar_array object results in double encoding (with \x00\x00\x00 padding).

Emperor-RE · 2023-05-17T17:41:33Z

I think it is dependent on if your python build is compiled using UCS2 or UCS4, can you check which type you are running?
I tested it in a windows VM which was compiled using UCS2, that had no issue with creating a unicode buffer, whilst a MacOS UCS4 build does produce the problem.
https://stackoverflow.com/questions/1446347/how-to-find-out-if-python-is-compiled-with-ucs-2-or-ucs-4

elicn · 2023-05-17T17:55:56Z

Both show a result of 1114111, but behave differently with regards to the c_wchar type.
When running:

ctypes.sizeof(ctypes.c_wchar)

One returns 4 ("extra padding"), while the other returns 2 ("normal padding"). I do believe this has something to do with the locale, but I can't figure it out.

Emperor-RE · 2023-05-17T19:09:07Z

I've spent quite some time following this up, but the most logical conclusion i can think of is that ctypes follows the "internal" python representation of the OS, and it is still related to Linux/Mac representing a wchar_t as 4 bytes while windows represents it as 2 bytes.

…Mac/Linux

elicn · 2023-05-18T10:21:34Z

No doubt about that. ctypes follows the hosting system specs for everything else, and this one is not different.
However, I designed Qiling's struct module in a way that one could get the structures layed out according to the emulated system, and not the hosting one. The question now is how to use c_wchar in a way that it can be guaranteed to be UCS2 (as expected in Windows structures) regardless of the host it is emualted on.

Emperor-RE · 2023-05-18T10:26:19Z

I just pushed a fix that adds a wrapper class to substitue c_wchar arrays and represent them as an array of c_ubytes under the hood. This should ensure that they are always handled as UCS2 regardless of the host os. I tested it and it passes the struct tests of qiling & it works with the binary i'm emulating 👍

…entation

elicn · 2023-05-21T17:28:15Z

Why did you add +2 for null terminator? I think you are confusing it with LPCWSTR or something similar.
Adding those extra bytes will change the layout of the structure, causing all following fields to be unaligned.

Emperor-RE · 2023-05-23T09:35:42Z

My bad, that indeed causes misalignment. For my use-case that didn't matter, but i'll change it to the proper length of bytes in a new commit.

I have a different question about some of the other windows struct types, could you elaborate on why you internally represent a lot of the string pointers types as the STRING type in qiling/os/windows/api.py? For example PANSI_STRING = STRING. I'm passing an executable through qiling that uses the LdrGetProcedureAddress API call, but the function hook for that does not work properly because it is trying to use a string instead of a pointer to an ANSI string (Code causing this). I have created a fix for it locally that dereferences the variable and fetches the correct struct values, but i'm assuming there is a reason why you did it that way, so i'd prefer to fix it using your intended approach.

elicn · 2023-05-23T10:42:31Z

Tagging an argument as STRING lets Qiling treat it as a pointer to a null-terminating ASCII string. Qiling safely dereferences the pointer and reads the string from memory, so the user doesn't have to do it every time a C string is passed in to a function.

However, ANSI_STRING and UNICODE_STRING are not ordinary ASCII and wide strings, they are structures. I did create a dedicated structure for UNICODE_STRING (see here) but not for ANSI_STRING. Mapping them both to STRING and WSTRING (respectively) must be a leftover from older code.

I will add this to my fix backlog [probably through OS resolver handlers].
Thanks for pointing this out.

Fix wchar array encoding in structs

3f02c07

elicn self-assigned this May 17, 2023

NoelNW added 2 commits May 18, 2023 12:05

Add wchar wrapper struct to fix UCS-4 encoding of unicode strings on …

cf82260

…Mac/Linux

Move Wchar array wrapper classes to Windows/structs.py

4b95742

NoelNW and others added 2 commits May 18, 2023 12:35

Fix bug that would not resolve value of c_char_array type correctly

6032813

Add null terminators to wchar wrapper classes as per Microsoft implem…

82d69d8

…entation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix wchar array encoding in structs #1348

Fix wchar array encoding in structs #1348

Emperor-RE commented May 16, 2023 •

edited

elicn commented May 17, 2023 •

edited

Emperor-RE commented May 17, 2023 •

edited

Emperor-RE commented May 17, 2023 •

edited

elicn commented May 17, 2023

Emperor-RE commented May 17, 2023 •

edited

Emperor-RE commented May 17, 2023 •

edited

elicn commented May 17, 2023 •

edited

Emperor-RE commented May 17, 2023 •

edited

elicn commented May 18, 2023

Emperor-RE commented May 18, 2023

elicn commented May 21, 2023

Emperor-RE commented May 23, 2023

elicn commented May 23, 2023

Fix wchar array encoding in structs #1348

Are you sure you want to change the base?

Fix wchar array encoding in structs #1348

Conversation

Emperor-RE commented May 16, 2023 • edited

Checklist

Which kind of PR do you create?

Coding convention?

Extra tests?

Changelog?

Target branch?

One last thing

elicn commented May 17, 2023 • edited

Emperor-RE commented May 17, 2023 • edited

Emperor-RE commented May 17, 2023 • edited

elicn commented May 17, 2023

Emperor-RE commented May 17, 2023 • edited

Emperor-RE commented May 17, 2023 • edited

elicn commented May 17, 2023 • edited

Emperor-RE commented May 17, 2023 • edited

elicn commented May 18, 2023

Emperor-RE commented May 18, 2023

elicn commented May 21, 2023

Emperor-RE commented May 23, 2023

elicn commented May 23, 2023

Emperor-RE commented May 16, 2023 •

edited

elicn commented May 17, 2023 •

edited

Emperor-RE commented May 17, 2023 •

edited

Emperor-RE commented May 17, 2023 •

edited

Emperor-RE commented May 17, 2023 •

edited

Emperor-RE commented May 17, 2023 •

edited

elicn commented May 17, 2023 •

edited

Emperor-RE commented May 17, 2023 •

edited