Skip to content
This repository has been archived by the owner on Aug 17, 2022. It is now read-only.

Support UTF-16 as an additional encoding #136

Open
dcodeIO opened this issue Jun 15, 2021 · 9 comments
Open

Support UTF-16 as an additional encoding #136

dcodeIO opened this issue Jun 15, 2021 · 9 comments

Comments

@dcodeIO
Copy link

dcodeIO commented Jun 15, 2021

From WebAssembly/design#1419 (comment):

@lukewagner: I think it would make sense to talk about supporting UTF-16 as an additional encoding in the canonical ABI of string. But that's a whole separate topic with a few options, so I don't want to mix that up with the abstract string semantics which need to be understood first.

I am very interested in making this happen, as it would already be a considerable improvement for languages using a 16-bit Unicode representation. What I could imagine currently is having either separate instructions, an immediate (but then it may as well be separate instructions I guess) or a parameter. For example:

list.lift_utf8 [...]
list.lift_utf16 [...]

list.is_utf8 [...]
list.is_utf16 [...]

list.lower_utf8 [...]
list.lower_utf16 [...]

Is that what you had in mind? If not, I am of course very interested in the other options :)

It may also be worthwhile to consider list.lift_latin1, which corresponds to narrow UTF-16 (with the high zero bytes left out), as it is a common optimization strategy in UTF-16 languages (to save memory and better utilize the CPU cache when possible). I do not feel strongly about whether or not we need the latter in an MVP already, though.

@lukewagner
Copy link
Member

First a general comment on a recent realization of why it's a good idea to have >1 strings encodings in the MVP's canonical ABI:

The current plan of record is to start with only the canonical ABI and add custom ABI support via adapter functions after that. Once we get to the latter, instructions like list.lift don't let us know the encoded length of the destination string up front, thus the canonical ABI for string lowering can't simply malloc the destination buffer ahead of time. If we rule out multi-pass solutions (which might not be possible (e.g., when list.lifting a forward iterator) and in general may lead to bugs that only arise in corner cases), that means we have to realloc the destination buffer while performing a single-pass iteration. To anticipate this future case, the current canonical ABI requires the core module to export realloc (which is used in place of malloc(sz) via realloc(NULL, sz)) so that the future length-unknown-upfront case can call realloc(old, newsz)).

The worry, though, is that, since the actually-reallocating path isn't exercised by the MVP, it will be broken in practice. E.g., in the MVP, you could get away with void* realloc(void *old, size_t sz) { return malloc(sz); }, because old will always be NULL, and this bug would only become evident when the component was used with a custom adapter function. However, if the canonical ABI contains >1 string encodings, then even MVP components will exercise realloc when transcoding and so adapter functions won't be hitting new code paths.


Regarding the actual design of supporting multiple encodings: it's already the case that the canonical ABI is intended to be parameterized by linear-memory-vs-gc-memory. This WASI presentation slide gives a concrete example of the same Interface Typed signature with the options memory=linear and memory=gc, and later slides show another hypothetical additional parameter for how async calls work. Thus, you could imagine a "how are strings encoded?" parameter with string=utf8 and string=utf16 as options.

Next, the canonical ABI shows up in two ways:

  1. In the MVP, it would show up as whole canonical adapter functions.
  2. When custom adapter functions are added, these parameters would show up as immediates on list.lift_canon and list.lower_canon (so, e.g., you could have list.lift_canon string=utf16 as an adapter instruction).

Thus, it's the same (parameterized) canonical ABI, just in two different contexts.

Lastly, there's the questions of what encodings to actually support. If our goal here is to optimize UTF-16 languages, then I think it's important to realize that most production VMs are using a dual UTF-16/Latin-1 representations (sometimes called compact strings). Moreover, the UTF-16/Latin-1 choice is on a per-string basis, so even adding a latin1 option wouldn't be sufficient to avoid inflating/deflating in one of the cases. Moreover, while latin1 decodes just fine into a list of USVs, there's not a clear answer for how to encode from a list of USVs. But I think we can solve both at the same time with a compact-utf16 (actual name open for bikeshedding) option that says:

  • on the canonical lifting side, in addition the ptr and length arguments, there would be third compact bool i32 indicating whether ptr pointed to Latin-1 or UTF-16 bytes
  • on the canonical lowering side, the engine would attempt to allocate the incoming list-of-USVs as Latin-1, falling back (and reallocing, if necessary) to UTF-16 for code points > 255, passing the final (ptr length encoding) tuple to core wasm.

Thus, altogether, I think 3 canonical ABI parameter values make sense for string: utf8, utf16 and compact-utf16.

It is worth asking whether all this additional complexity in the canonical ABI is worth it, and I've had different opinions on this over time but I think the "test realloc" argument strongly motivates having >1, if we're going to have >1, we might as well optimize the most common non-UTF-8 case. Also, unlike supporting multiple abstract string semantics, the additional complexity here could be pretty well encapsulated by the toolchain and ultimately not that much extra code. But I'd be interested to hear more thoughts on this.

@conrad-watt
Copy link

conrad-watt commented Jun 15, 2021

First, +1 to UTF-16 support!

I am slightly worried that a WTF-16-encoded binary string might get inadvertently corrupted if toolchains default to passing it around via a UTF-16 interface type that performs silent replacement of unpaired surrogates. Does it make sense to have the canonical lift function explicitly trap if an unpaired surrogate is encountered, or would that be too unfriendly? I remember this was already discussed for UTF-8, but the balance seems slightly different here since WTF-16 binary strings are more of a thing.

I admit this is a fringe concern, since a careful toolchain setup can expose such strings as list u16, so if silent replacement is the direction of travel, I'd prefer UTF-16 with silent replacement to derailing the discussion.

EDIT: is the plan to allow (implicit?) conversion between at least utf16 and compact-utf16 when composing components using the canonical ABI?

@dcodeIO
Copy link
Author

dcodeIO commented Jun 15, 2021

+1 from me as well, sounds very good!

If a lossless alternative cannot find consensus, at AssemblyScript we would very likely fall back to canonical UTF-16 while documenting what can go wrong at Interface Types boundaries. We would prefer silent replacement so whole applications don't accidentally break. Compact strings are a nice addition that I could imagine to explore as well.

@lukewagner
Copy link
Member

I can definitely understand the desire to catch bugs, but implicit replacement of surrogates in WTF-16 content already appears to be the default experience when externalizing a WTF-16 string, especially on the Web, but also in a number of other cases I noticed while investigating other languages' transcoding paths.

@rossberg
Copy link
Member

A couple of comments:

  • "Canonical" to me implies that there is one representation (at least per storage kind, i.e., we can argue that something like a future memory attribute still makes sense). With the addition of utf16 modes, the central feature of the MVP simplification seems out the window.

  • VMs are not just using dual representations but something much more sophisticated. For example, V8's string representation includes ropes, slices, and other things. How would ropes be handled by this? Moreover, ropes in fact imply the possibility of heterogenous encodings even in a single logical string. AFAICS, no amount of variations on utf16 modes is gonna be able to handle any of that efficiently without copying, except in a few lucky cases.

  • In practice, the vast majority of strings in a JS heap are likely 8-bit encoded strings or ropes over 8-bit encoded fragments.

So I question that utf16 modes have the practical benefit that some folks here seem to assume. Their addition would be biased towards specific assumptions about JS implementations that do not reflect the common case. And as such, they only raise completely wrong expectations.

(I understand the realloc argument, but it seems a bit odd to argue for the addition of a feature B on the sole basis that it enforces debugging of a feature A. (I believe there is a phrase for this kind of rationale, but I can't remember it right now :) )

@conrad-watt
Copy link

conrad-watt commented Jun 16, 2021

@rossberg I'd argue that the "canonical" nature of the string ABI comes from fact that at the boundary between components, the string type is viewed as a list of USVs. As @lukewagner pointed out here, this is orthogonal to the question of what the "at rest" string representation is within a component, or how this representation is lifted/lowered between list-of-USV at the boundary. So I'd agree that adding a utf16 mode somewhat acts against "simplicity" as a design principle, but not that it compromises the canonical nature of the ABI.

I'm somewhat more lukewarm on compact-utf16, because as you point out, it seems a little ad-hoc. However, I think that the primary beneficiary of the compact-u16 mode isn't a JS VM that's already doing something more complicated, but a runtime for a UTF-16 language targetting Wasm that wants to match its internal string representation to something that is transferable through the canonical ABI without significant re-encoding (AssemblyScript now, maybe Java tomorrow?).

@dcodeIO
Copy link
Author

dcodeIO commented Jun 16, 2021

I appreciate the excursion into complex string representations and non-standard optimizations that some VMs do. I think it's a little one-sided / too early to only look at the complex VMs of today, though, in that languages we want to compile to Wasm have independent requirements, say avoiding their half of re-encoding on the core module side before feeding into an adapter, which is unnecessary code and work, or generally keeping code bloat / runtime overhead low. Modules are frequently shipped over the wire, while VMs are not.

As such, what is suggested here does help Wasm languages, while VMs can of course still optimize and adapt however they find appropriate. This can freely change anyhow. And who knows, perhaps one day someone will ask for ropes or slices (wouldn't slices already work?), could well be, but I haven't seen anyone asking for it yet. Apart from that, I think what's basically encoders/decoders for UTF-8/16/Latin-1 is an obvious start regardless.

So I question that utf16 modes have the practical benefit that some folks here seem to assume.

Btw, I would have preferred a clarifying question instead :)

@rossberg
Copy link
Member

@conrad-watt:

@rossberg I'd argue that the "canonical" nature of the string ABI comes from fact that at the boundary between components, the string type is viewed as a list of USVs.

I think that's a misunderstanding. Luke's presentation clearly talked about limiting the MVP to "canonical adapter functions", which implicitly define a "canonical ABI". The set of types did not change.

I'm somewhat more lukewarm on compact-utf16, because as you point out, it seems a little ad-hoc. However, I think that the primary beneficiary of the compact-u16 mode isn't a JS VM that's already doing something more complicated, but a runtime for a UTF-16 language targetting Wasm that wants to match its internal string representation to something that is transferable through the canonical ABI without significant re-encoding (AssemblyScript now, maybe Java tomorrow?).

This would likely only help inside JS embeddings of Wasm, as most other host environments, including browsers themselves, predominantly use UTF-8.

And for JS embeddings it would only help in one direction, going from X-compiled-to-Wasm to JS; for the inverse direction, it won't buy much in contemporary JS engines.

And even when going to JS it only is significant assuming X does not itself have a smarter string representation; e.g. Java VMs like HotSpot also default to single byte string representations, so that UTF-16 will be rare.

Additionally, this all assumes that UTF re-encoding is significantly more expensive than UTF validation + copying. Do we have evidence for that?

So the use case seems rather narrow, and the benefit unclear.

@dcodeIO:

I appreciate the excursion into complex string representations and non-standard optimizations that some VMs do.

Things like ropes are fairly established implementation techniques nowadays, not just in JS, as O(1) string concatenation is typically expected in scripting and other high-level languages.

Apart from that, I think what's basically encoders/decoders for UTF-8/16/Latin-1 is an obvious start regardless.

Don't forget that we are talking about an "MVP". This obviously isn't minimal, and ITs are perfectly viable without. For the MVP, I think it's good advice to be extra wary of scope creep, bias, and premature optimisation.

@lukewagner
Copy link
Member

lukewagner commented Jun 16, 2021

On the meaning of "canonical ABI": we can discuss whether the word "canonical" is the right one, but the defining characteristic here is that the lifting/lowering scheme (between core wasm and abstract interface-typed values) is baked into the engine, not programmable via adapter functions. This both avoids all the novel problems of how to do adapter functions and also simplifies the way this whole thing looks from the POV of a traditional toolchain.

With the addition of utf16 modes, the central feature of the MVP simplification seems out the window.

From multiple points of view (spec, engine impl, toolchain impl): the addition of UTF-16 to the canonical ABI still maintains all the high-order-bit simplifications and should be a fairly modest delta in effort. (We can have a more concrete experience report on this in a few months.)

VMs are not just using dual representations but something much more sophisticated.

Yes, but on:

  • ingress, the produced strings do in practice start out in one a contiguous latin-1/utf-16 array
  • egress, the first step in practice is to "flatten" (converting ropes to dependent strings and delazifying all the other lazy representations) the string into one of a contiguous latin-1/utf-16 array

and that's the only place where Interface Types exist, so I think compact-utf16 is exactly the right fit, even for a JS engine. We're not talking about the permanent in-memory representation here.

Their addition would be biased towards specific assumptions about JS implementations that do not reflect the common case.

IIUC, compact-utf16 is also what most JVMs and .NET VMs do; it's the obvious optimization of WTF-16 strings b/c it saves a ton of memory while preserving trivial random access. Note: we're not expecting or assuming any regularity in the string header (which of course is highly engine-dependent); they will have to be manually serialized/deserialized in nested (e.g., array-of-strings) scenarios, but that's fine, as it's the leaves (string buffers) that have 99% of the bytes in most cases.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants