-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pick a developer-friendly name (avoid “USV”) #12
Comments
Over the course of the years I have become somewhat unhappy with the terms well-formed and ill-formed, or valid and invalid respectively. Given the fact that DOMStrings are idiomatic JS strings, these terms are not only missing the necessary context to understand what they refer to but are also unnecessarily suggestive, easily inducing prejudice in what is supposed to be constructive technical discussion. Perhaps |
isUnicode/isUTF16() make it feel like the string is somehow carrying information of how it was encoded, which it's not doing. I really like isWellFormedUTF16 (or isValidUTF16) - I think that conveys the purpose of this function very clearly. And I agree - most people reading "isUSVString" won't have the foggiest idea what that means without looking it up. Someone can correct me if I'm wrong here, but I assume "UTF16" would be better than "unicode", because it might be possible that you provide it a valid unicode string, but it fails the isValidUnicode test, because you gave it UTF-32 instead of UTF-16, or something like that. (I don't really know what I'm talking about though). |
Putting together the list so far it seems we have:
Does anyone want to suggest any others for the list? Shall we just remove objections then vote on the remainder from there? |
-1 for UTF16-related ones. That's a term that applies to byte sequences, not to strings. |
Is the only characteristic of a string that returns If so, what about |
It's worth noting that even though the lone surrogates being tested are UTF16 lone surrogate bytes, they are still invalid Unicode in general as this code point range is fully reserved by the spec for all UTF encodings. |
Big +1 to not incorrectly referring to any particular encodings in the name.
It's accurate and descriptive, but it makes the name of its From the suggestions so far IMHO |
There is nothing incorrect, Unicode is the context. Any positively or negatively connoted term in isolation is primarily serving (Google) ideology that has turned many similar topics into literal warzones for a decade or so already, and this issue already exhibits the typical signs. |
@dcodeIO that comment both sounds like it violates our code of conduct, but also just makes no sense whatsoever. it'd be great if you rephrased it in a way that isn't attacking an individual company, and also that is a bit more clear? |
I am not sure if this is a cultural difference, but where I come from we answer gaslighting with honesty. I can only condemn this style of CoCing me for pointing out sabotage. I don't have anything else to add. |
I don’t see any gaslighting here. Please feel free to contact me directly (let’s not continue in this thread) if you’re interested in explaining further. |
I would lean towards This is the usual tradeoff between being explicit and being terse. For relatively niche APIs like this one, where most developers won't know about it, I would generally put more weight on explicitness, and less on terseness. |
Another option: |
"Unicode Aware" is a property of a function that operates on a unicode string. It means the function is aware of unicode rules and won't do invalid transformations against a unicode string. Asking if the string itself is unicode aware is sort of nonsense. |
Well in this case it could be |
@domenic However, JS strings are not Unicode strings, they are always a specific encoding (WTF-16). The same is true for DOMString (which is also defined as WTF-16), and USVString (which is defined as UTF-16). But, using UTF16 in the name does tie us to a specific encoding (which could be a problem if JS ever gets more encodings in the future). |
That's not true; JS strings are not a particular "encoding". Recall that an encoding refers to how Unicode code point sequences (or, equivalently, code unit sequences) are encoded in bytes. On one level saying "JS strings are a specific encoding" is a nonsensical statement; strings are not byte sequences, but instead code unit sequences, and so talking about encoding makes no sense. On another level, maybe you are talking about how implementations represent strings in memory. But that's not with WTF-16 either: usually it's some complicated amalgam of length + bytes + is-latin-1. Maybe even with compression! What is more accurate to say is that if you want to transform JS strings into bytes, you can use the WTF-16 encoding for lossless transformation, or any Unicode encoding (such as UTF-16, UTF-32, or UTF-8) for a "lossy" encoding that censors (or throws on) lone surrogates. And, all of this is irrelevant to the properties of the actual JS string, which again, is not bytes. |
UTF-16 is a Unicode encoding form.
UTF-16 defines how each code point is expressed as a sequences of 16-bit code units. Here we get to surrogates, which closely matches what the proposed API is about.
UTF-16LE and UTF-16BE are Unicode encoding schemes. Only here we get to bytes. As such, |
Please 🙏 at least let's not argue here about what JavaScript strings are and are not. Honestly, I personally don't even care how this validation method called. "isUSVString" also a working name as for me. Given that this method has a fairly specific domain application and I don't think anyone will have a problem diving into documentation especially since users will still have to go deeper and figure out what USV and unpaired surrogates problem is anyway. |
I guess the elephant in the room will always remain why this breakage exists in the first place. It's so unnecessary, breaks at least one Wasm CG member entirely, and telling users that they have to check manually after learning about the ins and outs of string encodings isn't really helping. The primary reason all of this exists and has become so frustrating is that there are some people in Web standards who are somehow convinced that JS strings are broken, even though they are literally exactly like that in the ECMAScript standard, in Java, in C# and others and can't be changed due to how the respective String APIs are designed, yet now they additional have a motivation to promote C++ concepts in Wasm, driving nonsense like encouraging UTF-8 way beyond the point where there will be concrete breakage to ultimately render websites entirely in canvases. As such, all I can recommend in context of this issue is to give the API a half-way neutral name to encourage positive discussion where, say, the facts quoted from the Unicode standard aren't immediately downvoted and not always the same people jump in to do whatever that is above. |
This discussion is making me think that maybe this proposal is best not advancing at all, since it appears to be ideologically motivated. |
I want this so that I can answer the question "if I give this to an API which does not know how to handle lone surrogates, is it going to choke?", which I think is a common thing to want, and which does not require or imply any particular idealogical commitments. For example, many JSON deserializers choke on lone surrogates, both before and after proposal-well-formed-stringify, so I very often want to be able to make a JSON serializer which throws or sanitizes lone surrogates while serializing. I have no idea where the claim that this is ideologically motivated is coming from. I have not been involved in the interface types proposal and want this for reasons completely unrelated to WASM or C++. |
Agree that the proposal can be useful in certain cases that sadly exist in practice. Yet I see it as my duty to emphasize that we have too many of these cases now, way beyond the point of failure, which motivates this proposal at this point in time. And instead of tackling the actual problem people are once again lied to, harassed and threatened to not advance a perfectly fine proposal over the simple suggestion to pick a neutral name. All of that should raise eyebrows and is an indicator that papering the problem over is not working once again and never did. In fact, a decade worth of discussion dominated by hostility could be the motivation we need to finally make Unicode symmetric in practice and not just in theory as it always should have been. But that's a topic for another place that doesn't exist. Just for reference, here is what it could take if we did our job properly and put our influence and what we've been taught to practice:
Another alternative would of course have been to realize that the Web is inherently a mixed system that must not be broken, so the most sensible choice would have been to use WTF-8/16 which exists exactly for this purpose. Sadly, the Wasm CG decided otherwise for no good reason. |
That would imply that some UTF-8 byte sequences would not be losslessly decodable to JS strings: 0xED 0xA0 0xBD 0xED 0xB2 0xA9 would be interpreted as the two code points U+D83D U+DCA9, but the JS string |
The general concept isn't new and is what some decoders do in practice when UCS-2-evolved-UTF-16 languages are involved. There, the given example is an invalid surrogate pair byte sequence that must not be produced, that's why I mentioned concatenation. The trade-off is of course that instead of taking something away from one category of languages we would add something to another, which is generally easier. Neither is great, but one is a heroic effort for the better while the other is see above. I know of course how unlikely it is that we go the heroic route ;) |
I've gone with |
I'd personally prefer |
I'd also prefer |
I understand that USV refers to “Unicode Scalar Value” and that
isUSVString
is a technically accurate name. However, I suspect that it may not be as intuitive or understandable as other options. (#6 is an example of this.)I don’t have any hard data to back this up, but here’s an anecdote: we championed an entire proposal ensuring
JSON.stringify
produces USV strings without ever mentioning the term “USV” (in the repo, slides, TC39 discussions). The term we used instead was “well-formed”.TL;DR How about the name
isWellFormed
?The text was updated successfully, but these errors were encountered: