Added UTF8 tests #146

Krinkle · 2023-06-02T20:38:14Z

Upstreaming additional test cases written by @jbboehr. These were helpful in developing https://github.com/jbboehr/libmustache.

Original location: https://github.com/jbboehr/mustache-spec/commits/b96be9fd4c6d6984828d93169fe7e86d8a8aec2f

jgonggrijp

Thanks for contributing @Krinkle!

At first sight, these tests seem totally reasonable. Of course, a Mustache implementation (or any software, for that matter) should ideally support UTF-8 encoded text.

The more I think about it, however, the more I believe that these tests lack a clear rationale. Why exactly do these tests need to exist? What does it mean if an implementation does not pass them?

If an implementation fails, does that actually say anything about the implementation, or perhaps rather about the library that is responsible for reading the file from disk and/or for parsing the YAML/JSON?

Putting it differently: suppose that the standard library for my programming language is capable of reading UTF-8 encoded text from a file and loading it into a standard text datastructure for the programming language (that will be string, String or str in most cases). Furthermore, suppose that my Mustache implementation reads templates from that standard datastructure and that my implementation is already known to support ASCII-encoded templates. Is it even possible for my implementation to fail in that case? After all, UTF-8 is a superset of ASCII. I don't think the octet 0x7b can mean anything other than { in UTF-8. Likewise for } and all the sigils.

Has libmustache ever failed these tests, @jbboehr? If it did, what did you need to do in order to pass them?

agentgt · 2023-11-05T17:59:12Z

Is it even possible for my implementation to fail in that case? After all, UTF-8 is a superset of ASCII. I don't think the octet 0x7b can mean anything other than { in UTF-8. Likewise for } and all the sigils.

I don't think it needs to be in the spec or have tests but I believe this is possible in some languages. I'm not going to reference where I discuss this as I got rather embarrassingly unhinged but apparently if you write say a PHP lambda code in latin1 (what encoding the file is saved in is what the literals become) and then take a section that is UTF-8 because the template is and return a new template and there is a chance of corruption.

While the probability of that happening in 2023 is pretty darn low back in the 2000s when I started my career it happened often (albeit usually with XML...).

Also if an implementation naively assumes ASCII and parses as UTF-8 template you will get corruption but that case is pretty obvious (you can concat ASCII bytes to UTF-8 but you cannot split UTF-8 strings as though they were ASCII).

That is you cannot just naively parse byte by byte looking for the { (0x7b) as the second bytes in a multi-byte character very well could be 0x7b. (I think. I'm fairly sure that UTF-8 does not say multi-bytes after the first have to be 127 but maybe I'm mistaken).

I guess it is not possible. I thought there were fringe cases with BOM so yes I guess your safe parsing byte by byte.

However if someone sets a delimiter above 127 then there will be problems. Maybe that is something that could be specified. Delimiters must be below 127. But again probably not worth it.

Added UTF8 tests

c5724af

jgonggrijp reviewed Jun 3, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added UTF8 tests #146

Added UTF8 tests #146

Krinkle commented Jun 2, 2023

jgonggrijp left a comment

agentgt commented Nov 5, 2023 •

edited

Added UTF8 tests #146

Are you sure you want to change the base?

Added UTF8 tests #146

Conversation

Krinkle commented Jun 2, 2023

jgonggrijp left a comment

Choose a reason for hiding this comment

agentgt commented Nov 5, 2023 • edited

agentgt commented Nov 5, 2023 •

edited