-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a UnicodeString type? #15
Comments
Making a string constructor (which is what it looks like) that has no syntactic method of producing it seems unwise, and the resulting value would be indistinguishable from a normal string, which also seems unwise. If |
See also whatwg/webidl#716 for some other ideas around what a newish string type might look like. |
I think it would be best to think of UnicodeString as a separate type from String, so The fact that functions like UnicodeString.prototype.split are capable of operating on non-unicode string was merely a convenience that maybe doesn't really make sense.
Well, we could provide the variable "u" on globalThis, and have it be an alias to UnicodeString.from(). It's technically not dedicated syntax, but pretty close.
|
adding a new primitive for something that very few developers will ever encounter seems strange to me. |
I know it's not how you intended it, but if you define "encounter" as "creating a unicode-related bug", then people are encountering this issue a lot. Many developers just don't realized they've encountered this type of issue, because we're often not testing our string-processing functions with smiley face emojis, even though we should. The idea is that people would start using unicode strings instead of normal strings everywhere, in order to provide automatic protection against these kinds of bugs. It's not the prettiest idea, but apparently it's possible if it's been done before. Though, perhaps a better alternative would be to just provide a set of unicode-aware functions on the String object itself. Like String.prototype.uSplit(), String.prototype.uReplaceAll, etc. Then, developers who are trying to be careful can just remember to always use the u-prefixed functions. |
I'd love to see some citation for "a lot". These bugs definitely happen, but how often will the fix be "check if the string is well-formed at runtime" versus "use the proper string methods"? |
"check if the string is well-formed at runtime" just lets you know that something went wrong. "use the proper string methods" would actually let you prevent that thing from going wrong. For example, say I'm making a webpage, and I want to render a snippet of an article. You can click on it to show more information. The naive and common solution would be as follows: snippetElement.innerText = allText.slice(0, 30) + '…' Now let's imagine a couple of different examples for article content, and see how their snippet would look. let allText = 'Welcome to my first article 👋🏾 - I am excited to start writing on this platform.'
snippetElement.innerText = allText.slice(0, 30) + '…'
// This will render "Welcome to my first article 👋…"
// Note the added comma after "Welcome" in this example article text
let allText = 'Welcome, to my first article 👋🏾 - I am excited to start writing on this platform.'
snippetElement.innerText = allText.slice(0, 30) + '…'
// This will render "Welcome, to my first article �…" Writing unicode-aware code in Javascript today currently is not easy. There's no good way to handle the surrogate pair issue without writing a bunch of custom string-manipulation functions or using a library. So no, I don't have any hard data about how often these kinds of bugs crop up, but hopefully, it's easy to see how easy it is to make these types of bugs, and how difficult it is to fix them. Really, any time you're trying to manipulate unicode data, you're probably introducing bugs, because you're probably not using unicode aware functions. |
I agree that the oldest String methods being the most convenient makes such mistakes likely, but this example doesn't need any capability that isn't either imminent or already available. const segmenter = new Intl.Segmenter('en', {granularity: 'grapheme'})
let allText
allText = 'Welcome to my first article 👋🏾 - I am excited to start writing on this platform.'
[...allText].slice(0, 30).join('') + '…'
// concatenated code points: 'Welcome to my first article 👋🏾…'
Array.from(segmenter.segment(allText), g => g.segment).slice(0, 30).join('') + '…'
// concatenated grapheme clusters: 'Welcome to my first article 👋🏾 …'
// Note the added comma after "Welcome" in this example article text
allText = 'Welcome, to my first article 👋🏾 - I am excited to start writing on this platform.'
[...allText].slice(0, 30).join('') + '…'
// concatenated code points: 'Welcome, to my first article 👋…'
Array.from(segmenter.segment(allText), g => g.segment).slice(0, 30).join('') + '…'
// concatenated grapheme clusters: 'Welcome, to my first article 👋🏾…' |
I didn't know you could do that - that's good to know. It's still not the most user-friendly solution, and it's currently (unfortunately) not an easily discoverable solution. Anyone who's wanting to know how to slice a string will receive the .slice() answer, not Array.from(segmenter.segment(text), g => g.segment).slice(...).join('') answer, which is why having something like a .uSlice() would be really nice - it would make it much easier to write bug-free code. Though, what's surprising to me is the fact that you have to provide it a language for it to properly know how to split the string's characters - I have no idea why that would be the case, and MDN's documentation is currently pretty slim (as seen here). This makes it sound like it's not really possible to simply provide a simple functions such as "uSplit()", but I don't really know. Anyways, I'm thinking having u-prefixed versions of different functions would be better than creating an entirely new primitive, and if we won't be making a new primitive, than there really isn't a relationship anymore between what I was proposing here and this original proposal - they can exist independent of each other. So, I'll go ahead and close this. |
@theScottyJam Languages are complicated, different languages have different rules for splitting words, lines, and grapheme clusters. Unicode tries to accurately represent every language as best as it can, so it has to represent those language differences. https://www.unicode.org/reports/tr29/ It should be possible to split based on Unicode code points (that should work for every language), however in practice you generally don't want to do that, instead what you actually want is to split based on grapheme clusters. |
We're working on that too: mdn/content#8402 Note that Intl.Segmenter is currently Stage 3, although it should reach Stage 4 at the next TC39 meeting. |
Ah, thanks you two. That was an interesting read @Pauan, and it helped me understand why this is a trickier problem than I thought. And it makes sense that the documentation is slim if that feature is still in a proposal phase. Thanks! |
I know this idea is out of the scope of this proposal, but if we decide to go this route, it could change how this proposal gets implemented.
What if we added a new UnicodeString primitive? The String primitive could be thought more as a binary string, that can hold any arbitrary sequence of bytes, while a UnicodeString must only contain valid unicode. What's more, it can provide the same API as a String, but all of its functions would be unicode-aware (a feature that would be really nice to have in Javascript).
If this were done, then the "toUSVString" feature requested in #13 could be provided like this:
That alone gives you the ability to check if a string is valid unicode like this:
This is assuming that "===" works across String and UnicodeString - I think it should, but that can be debated.
If needed, we could provide a separate function on UnicodeString that converts a String to a UnicodeString, but throws an error if the original string contained invalid unicode - this could also be used to detect an invalid unicode string. Or, we just add an isUSVString() function like this proposal is proposing.
A couple more examples on how the UnicodeString could work:
The text was updated successfully, but these errors were encountered: