Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pick a developer-friendly name (avoid “USV”) #12

Closed
mathiasbynens opened this issue Aug 27, 2021 · 28 comments
Closed

Pick a developer-friendly name (avoid “USV”) #12

mathiasbynens opened this issue Aug 27, 2021 · 28 comments

Comments

@mathiasbynens
Copy link
Member

mathiasbynens commented Aug 27, 2021

I understand that USV refers to “Unicode Scalar Value” and that isUSVString is a technically accurate name. However, I suspect that it may not be as intuitive or understandable as other options. (#6 is an example of this.)

I don’t have any hard data to back this up, but here’s an anecdote: we championed an entire proposal ensuring JSON.stringify produces USV strings without ever mentioning the term “USV” (in the repo, slides, TC39 discussions). The term we used instead was “well-formed”.

TL;DR How about the name isWellFormed?

@dcodeIO
Copy link
Contributor

dcodeIO commented Aug 29, 2021

Over the course of the years I have become somewhat unhappy with the terms well-formed and ill-formed, or valid and invalid respectively. Given the fact that DOMStrings are idiomatic JS strings, these terms are not only missing the necessary context to understand what they refer to but are also unnecessarily suggestive, easily inducing prejudice in what is supposed to be constructive technical discussion.

Perhaps is(WellFormed)Unicode/UTF16?

@theScottyJam
Copy link

isUnicode/isUTF16() make it feel like the string is somehow carrying information of how it was encoded, which it's not doing. I really like isWellFormedUTF16 (or isValidUTF16) - I think that conveys the purpose of this function very clearly. And I agree - most people reading "isUSVString" won't have the foggiest idea what that means without looking it up.

Someone can correct me if I'm wrong here, but I assume "UTF16" would be better than "unicode", because it might be possible that you provide it a valid unicode string, but it fails the isValidUnicode test, because you gave it UTF-32 instead of UTF-16, or something like that. (I don't really know what I'm talking about though).

@guybedford
Copy link
Collaborator

Putting together the list so far it seems we have:

  • isWellFormed
  • isUnicode
  • isUTF16
  • isWellFormedUTF16

Does anyone want to suggest any others for the list? Shall we just remove objections then vote on the remainder from there?

@domenic
Copy link
Member

domenic commented Aug 31, 2021

-1 for UTF16-related ones. That's a term that applies to byte sequences, not to strings.

@ljharb
Copy link
Member

ljharb commented Aug 31, 2021

Is the only characteristic of a string that returns false here that it has lone surrogates?

If so, what about hasLoneSurrogates?

@guybedford
Copy link
Collaborator

I assume "UTF16" would be better than "unicode", because it might be possible that you provide it a valid unicode string, but it fails the isValidUnicode test

It's worth noting that even though the lone surrogates being tested are UTF16 lone surrogate bytes, they are still invalid Unicode in general as this code point range is fully reserved by the spec for all UTF encodings.

@mathiasbynens
Copy link
Member Author

Big +1 to not incorrectly referring to any particular encodings in the name.

If so, what about hasLoneSurrogates?

It's accurate and descriptive, but it makes the name of its toWellFormed counterpart (if we decide to go with #13) less clear. replaceLoneSurrogates?

From the suggestions so far IMHO isWellFormed / toWellFormed remain the simplest, clearest names.

@dcodeIO
Copy link
Contributor

dcodeIO commented Sep 1, 2021

There is nothing incorrect, Unicode is the context. Any positively or negatively connoted term in isolation is primarily serving (Google) ideology that has turned many similar topics into literal warzones for a decade or so already, and this issue already exhibits the typical signs.

@ljharb
Copy link
Member

ljharb commented Sep 1, 2021

@dcodeIO that comment both sounds like it violates our code of conduct, but also just makes no sense whatsoever. it'd be great if you rephrased it in a way that isn't attacking an individual company, and also that is a bit more clear?

@dcodeIO
Copy link
Contributor

dcodeIO commented Sep 1, 2021

I am not sure if this is a cultural difference, but where I come from we answer gaslighting with honesty. I can only condemn this style of CoCing me for pointing out sabotage. I don't have anything else to add.

@ljharb
Copy link
Member

ljharb commented Sep 1, 2021

I don’t see any gaslighting here. Please feel free to contact me directly (let’s not continue in this thread) if you’re interested in explaining further.

@bakkot
Copy link
Collaborator

bakkot commented Sep 1, 2021

I would lean towards isWellFormedUnicode over isWellFormed.

This is the usual tradeoff between being explicit and being terse. For relatively niche APIs like this one, where most developers won't know about it, I would generally put more weight on explicitness, and less on terseness.

@MaxGraey
Copy link

MaxGraey commented Sep 1, 2021

Another option: isUnicodeAware. I think it is better to avoid the wording "Well Formed". It can be confusing for users. And the term "Unicode Aware" is quite common.

@theScottyJam
Copy link

"Unicode Aware" is a property of a function that operates on a unicode string. It means the function is aware of unicode rules and won't do invalid transformations against a unicode string. Asking if the string itself is unicode aware is sort of nonsense.

@MaxGraey
Copy link

MaxGraey commented Sep 1, 2021

Well in this case it could be isUnicodeValid

@Pauan
Copy link

Pauan commented Sep 1, 2021

@domenic However, JS strings are not Unicode strings, they are always a specific encoding (WTF-16).

The same is true for DOMString (which is also defined as WTF-16), and USVString (which is defined as UTF-16).

But, using UTF16 in the name does tie us to a specific encoding (which could be a problem if JS ever gets more encodings in the future).

@domenic
Copy link
Member

domenic commented Sep 1, 2021

That's not true; JS strings are not a particular "encoding". Recall that an encoding refers to how Unicode code point sequences (or, equivalently, code unit sequences) are encoded in bytes.

On one level saying "JS strings are a specific encoding" is a nonsensical statement; strings are not byte sequences, but instead code unit sequences, and so talking about encoding makes no sense. On another level, maybe you are talking about how implementations represent strings in memory. But that's not with WTF-16 either: usually it's some complicated amalgam of length + bytes + is-latin-1. Maybe even with compression!

What is more accurate to say is that if you want to transform JS strings into bytes, you can use the WTF-16 encoding for lossless transformation, or any Unicode encoding (such as UTF-16, UTF-32, or UTF-8) for a "lossy" encoding that censors (or throws on) lone surrogates.

And, all of this is irrelevant to the properties of the actual JS string, which again, is not bytes.

@dcodeIO
Copy link
Contributor

dcodeIO commented Sep 1, 2021

UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that
assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair.

UTF-16 is a Unicode encoding form.

In the Unicode character encoding model, precisely defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units. The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16-bit, and 32-bit units

UTF-16 defines how each code point is expressed as a sequences of 16-bit code units. Here we get to surrogates, which closely matches what the proposed API is about.

A character encoding scheme consists of a specified character encoding form plus a specification of how the code units are serialized into bytes.

UTF-16LE and UTF-16BE are Unicode encoding schemes. Only here we get to bytes.

As such, isWellFormedUTF16 is just more precise than isWellFormedUnicode. Both are fine.

@MaxGraey
Copy link

MaxGraey commented Sep 1, 2021

Please 🙏 at least let's not argue here about what JavaScript strings are and are not.

Honestly, I personally don't even care how this validation method called. "isUSVString" also a working name as for me. Given that this method has a fairly specific domain application and I don't think anyone will have a problem diving into documentation especially since users will still have to go deeper and figure out what USV and unpaired surrogates problem is anyway.

@dcodeIO
Copy link
Contributor

dcodeIO commented Sep 1, 2021

I guess the elephant in the room will always remain why this breakage exists in the first place. It's so unnecessary, breaks at least one Wasm CG member entirely, and telling users that they have to check manually after learning about the ins and outs of string encodings isn't really helping.

The primary reason all of this exists and has become so frustrating is that there are some people in Web standards who are somehow convinced that JS strings are broken, even though they are literally exactly like that in the ECMAScript standard, in Java, in C# and others and can't be changed due to how the respective String APIs are designed, yet now they additional have a motivation to promote C++ concepts in Wasm, driving nonsense like encouraging UTF-8 way beyond the point where there will be concrete breakage to ultimately render websites entirely in canvases.

As such, all I can recommend in context of this issue is to give the API a half-way neutral name to encourage positive discussion where, say, the facts quoted from the Unicode standard aren't immediately downvoted and not always the same people jump in to do whatever that is above.

@domenic
Copy link
Member

domenic commented Sep 1, 2021

This discussion is making me think that maybe this proposal is best not advancing at all, since it appears to be ideologically motivated.

@bakkot
Copy link
Collaborator

bakkot commented Sep 1, 2021

I want this so that I can answer the question "if I give this to an API which does not know how to handle lone surrogates, is it going to choke?", which I think is a common thing to want, and which does not require or imply any particular idealogical commitments. For example, many JSON deserializers choke on lone surrogates, both before and after proposal-well-formed-stringify, so I very often want to be able to make a JSON serializer which throws or sanitizes lone surrogates while serializing.

I have no idea where the claim that this is ideologically motivated is coming from. I have not been involved in the interface types proposal and want this for reasons completely unrelated to WASM or C++.

@dcodeIO
Copy link
Contributor

dcodeIO commented Sep 2, 2021

Agree that the proposal can be useful in certain cases that sadly exist in practice. Yet I see it as my duty to emphasize that we have too many of these cases now, way beyond the point of failure, which motivates this proposal at this point in time. And instead of tackling the actual problem people are once again lied to, harassed and threatened to not advance a perfectly fine proposal over the simple suggestion to pick a neutral name. All of that should raise eyebrows and is an indicator that papering the problem over is not working once again and never did. In fact, a decade worth of discussion dominated by hostility could be the motivation we need to finally make Unicode symmetric in practice and not just in theory as it always should have been. But that's a topic for another place that doesn't exist.

Just for reference, here is what it could take if we did our job properly and put our influence and what we've been taught to practice:

  1. Imagine a world where the problem would not exist at all, no sanitization is necessary, nothing chokes
  2. Propose a new version of Unicode that undisallows encoding surrogates as three bytes in UTF-8
  3. Give it some time until most existing encoders and concatenation functions are updated, make a concentrated effort
  4. New version of Unicode where neither proposal-is-usv-string, proposal-well-formed-stringify, nor USVString are necessary
  5. Happy people constructively working together when the topic is JS strings, where no language has to take a hit

Another alternative would of course have been to realize that the Web is inherently a mixed system that must not be broken, so the most sensible choice would have been to use WTF-8/16 which exists exactly for this purpose. Sadly, the Wasm CG decided otherwise for no good reason.

@andreubotella
Copy link
Member

andreubotella commented Sep 2, 2021

Just for reference, here is what it could take if we did our job properly and put our influence and what we've been taught to practice:

1. Imagine a world where the problem would not exist at all, no sanitization is necessary, nothing chokes

2. Propose a new version of Unicode that undisallows encoding surrogates as three bytes in UTF-8

3. Give it some time until most existing encoders and concatenation functions are updated, make a concentrated effort

4. New version of Unicode where neither proposal-is-usv-string, proposal-well-formed-stringify, nor USVString are necessary

5. Happy people constructively working together when the topic is JS strings, where no language has to take a hit

That would imply that some UTF-8 byte sequences would not be losslessly decodable to JS strings: 0xED 0xA0 0xBD 0xED 0xB2 0xA9 would be interpreted as the two code points U+D83D U+DCA9, but the JS string "\uD83D\uDCA9" would be interpreted as the single code point U+1F4A9. IMO that seems much less intuitive and much messier than the other way around, i.e. letting JS strings not be losslessly encodable to UTF-8.

@dcodeIO
Copy link
Contributor

dcodeIO commented Sep 2, 2021

The general concept isn't new and is what some decoders do in practice when UCS-2-evolved-UTF-16 languages are involved. There, the given example is an invalid surrogate pair byte sequence that must not be produced, that's why I mentioned concatenation. The trade-off is of course that instead of taking something away from one category of languages we would add something to another, which is generally easier. Neither is great, but one is a heroic effort for the better while the other is see above. I know of course how unlikely it is that we go the heroic route ;)

@michaelficarra
Copy link
Member

I've gone with isWellFormed for now. I feel that the USV initialism is just too opaque for your average user, even your average user in need of this API. Since mentioning the encoding appears controversial, that rules out putting UTF16 in the name. I'm open to still considering isWellFormedUnicode if anyone feels strongly and there's no objections. I personally tend to lean toward longer, more descriptive names.

@bakkot
Copy link
Collaborator

bakkot commented Aug 23, 2022

I'd personally prefer isWellFormedUnicode, as mentioned above.

@dcodeIO
Copy link
Contributor

dcodeIO commented Sep 15, 2022

I'd also prefer isWellFormedUnicode to avoid the misconception that some ECMAScript strings somehow are not well-formed ECMAScript strings. Having "Unicode" in the name, OTOH, indicates that some ECMAScript strings are really not well-formed Unicode strings, which is perfectly in-line with the proposal's title: "Well-Formed Unicode Strings".

@dcodeIO dcodeIO mentioned this issue Sep 15, 2022
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests