Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a UnicodeString type? #15

Closed
theScottyJam opened this issue Sep 1, 2021 · 12 comments
Closed

Adding a UnicodeString type? #15

theScottyJam opened this issue Sep 1, 2021 · 12 comments

Comments

@theScottyJam
Copy link

I know this idea is out of the scope of this proposal, but if we decide to go this route, it could change how this proposal gets implemented.

What if we added a new UnicodeString primitive? The String primitive could be thought more as a binary string, that can hold any arbitrary sequence of bytes, while a UnicodeString must only contain valid unicode. What's more, it can provide the same API as a String, but all of its functions would be unicode-aware (a feature that would be really nice to have in Javascript).

If this were done, then the "toUSVString" feature requested in #13 could be provided like this:

const unicodeString = new UnivocdeString(someInvalidUnicode)

That alone gives you the ability to check if a string is valid unicode like this:

if (someInvalidUnicode === new UnicodeString(someInvalidUnicode)) ...

This is assuming that "===" works across String and UnicodeString - I think it should, but that can be debated.

If needed, we could provide a separate function on UnicodeString that converts a String to a UnicodeString, but throws an error if the original string contained invalid unicode - this could also be used to detect an invalid unicode string. Or, we just add an isUSVString() function like this proposal is proposing.


A couple more examples on how the UnicodeString could work:

const unicodeStr = UnicodeString.from`my unicode string`
const u = UnicodeString.from
const unicodeStr2 = u`This "u" tag is a little python2-like, isn't it?`

u'abc' + 'def' // Coerces "def" into a UnicodeString - the result will be a unicode string.

UnicodeString.prototype.split.call(someNormalStr, ',') // Works

// With the pipeline operator
const { split, replaceAll } = UnicodeString.prototype
someNormalStr
  |> replaceAll.call(%, 'x', 'X')
  |> split.call(%, ',')
@ljharb
Copy link
Member

ljharb commented Sep 1, 2021

Making a string constructor (which is what it looks like) that has no syntactic method of producing it seems unwise, and the resulting value would be indistinguishable from a normal string, which also seems unwise.

If typeof x === 'string', then Object.getPrototypeOf(x) can only ever be String.prototype, so I have no idea how UnicodeString.prototype can be a thing unless we created a brand new primitive type whose typeof value was not "string".

@annevk
Copy link
Member

annevk commented Sep 1, 2021

See also whatwg/webidl#716 for some other ideas around what a newish string type might look like.

@theScottyJam
Copy link
Author

theScottyJam commented Sep 1, 2021

If typeof x === 'string', then Object.getPrototypeOf(x) can only ever be String.prototype, so I have no idea how UnicodeString.prototype can be a thing unless we created a brand new primitive type whose typeof value was not "string".

I think it would be best to think of UnicodeString as a separate type from String, so typeof u'whatever' === 'unicode-string'. Neither one can really be thought of as a subtype of the other, rather, they both implement the same set of functions but with differing behaviors. Perhaps it's better to think of both string types are implementing the same AbstractString interface.

The fact that functions like UnicodeString.prototype.split are capable of operating on non-unicode string was merely a convenience that maybe doesn't really make sense.

Making a string constructor (which is what it looks like) that has no syntactic method of producing it seems unwise, and the resulting value would be indistinguishable from a normal string, which also seems unwise.

Well, we could provide the variable "u" on globalThis, and have it be an alias to UnicodeString.from(). It's technically not dedicated syntax, but pretty close.

u`xyz` // Javascript unicode string (not using dedicated syntax)
u'xyz' // Python 2 unicode string (using dedicated syntax)

@ljharb
Copy link
Member

ljharb commented Sep 1, 2021

adding a new primitive for something that very few developers will ever encounter seems strange to me.

@theScottyJam
Copy link
Author

theScottyJam commented Sep 1, 2021

I know it's not how you intended it, but if you define "encounter" as "creating a unicode-related bug", then people are encountering this issue a lot. Many developers just don't realized they've encountered this type of issue, because we're often not testing our string-processing functions with smiley face emojis, even though we should.

The idea is that people would start using unicode strings instead of normal strings everywhere, in order to provide automatic protection against these kinds of bugs. It's not the prettiest idea, but apparently it's possible if it's been done before.


Though, perhaps a better alternative would be to just provide a set of unicode-aware functions on the String object itself. Like String.prototype.uSplit(), String.prototype.uReplaceAll, etc. Then, developers who are trying to be careful can just remember to always use the u-prefixed functions.

@ljharb
Copy link
Member

ljharb commented Sep 1, 2021

I'd love to see some citation for "a lot". These bugs definitely happen, but how often will the fix be "check if the string is well-formed at runtime" versus "use the proper string methods"?

@theScottyJam
Copy link
Author

"check if the string is well-formed at runtime" just lets you know that something went wrong. "use the proper string methods" would actually let you prevent that thing from going wrong.

For example, say I'm making a webpage, and I want to render a snippet of an article. You can click on it to show more information. The naive and common solution would be as follows:

snippetElement.innerText = allText.slice(0, 30) + '…'

Now let's imagine a couple of different examples for article content, and see how their snippet would look.

let allText = 'Welcome to my first article 👋🏾 - I am excited to start writing on this platform.'
snippetElement.innerText = allText.slice(0, 30) + '…'
// This will render "Welcome to my first article 👋…"

// Note the added comma after "Welcome" in this example article text
let allText = 'Welcome, to my first article 👋🏾 - I am excited to start writing on this platform.'
snippetElement.innerText = allText.slice(0, 30) + '…'
// This will render "Welcome, to my first article �…"

Writing unicode-aware code in Javascript today currently is not easy. There's no good way to handle the surrogate pair issue without writing a bunch of custom string-manipulation functions or using a library.

So no, I don't have any hard data about how often these kinds of bugs crop up, but hopefully, it's easy to see how easy it is to make these types of bugs, and how difficult it is to fix them. Really, any time you're trying to manipulate unicode data, you're probably introducing bugs, because you're probably not using unicode aware functions.

@gibson042
Copy link

gibson042 commented Sep 2, 2021

Writing unicode-aware code in Javascript today currently is not easy. There's no good way to handle the surrogate pair issue without writing a bunch of custom string-manipulation functions or using a library.

I agree that the oldest String methods being the most convenient makes such mistakes likely, but this example doesn't need any capability that isn't either imminent or already available.

const segmenter = new Intl.Segmenter('en', {granularity: 'grapheme'})
let allText

allText = 'Welcome to my first article 👋🏾 - I am excited to start writing on this platform.'
[...allText].slice(0, 30).join('') + '…'
// concatenated code points: 'Welcome to my first article 👋🏾…'
Array.from(segmenter.segment(allText), g => g.segment).slice(0, 30).join('') + '…'
// concatenated grapheme clusters: 'Welcome to my first article 👋🏾 …'

// Note the added comma after "Welcome" in this example article text
allText = 'Welcome, to my first article 👋🏾 - I am excited to start writing on this platform.'
[...allText].slice(0, 30).join('') + '…'
// concatenated code points: 'Welcome, to my first article 👋…'
Array.from(segmenter.segment(allText), g => g.segment).slice(0, 30).join('') + '…'
// concatenated grapheme clusters: 'Welcome, to my first article 👋🏾…'

@theScottyJam
Copy link
Author

I didn't know you could do that - that's good to know. It's still not the most user-friendly solution, and it's currently (unfortunately) not an easily discoverable solution. Anyone who's wanting to know how to slice a string will receive the .slice() answer, not Array.from(segmenter.segment(text), g => g.segment).slice(...).join('') answer, which is why having something like a .uSlice() would be really nice - it would make it much easier to write bug-free code.

Though, what's surprising to me is the fact that you have to provide it a language for it to properly know how to split the string's characters - I have no idea why that would be the case, and MDN's documentation is currently pretty slim (as seen here). This makes it sound like it's not really possible to simply provide a simple functions such as "uSplit()", but I don't really know.

Anyways, I'm thinking having u-prefixed versions of different functions would be better than creating an entirely new primitive, and if we won't be making a new primitive, than there really isn't a relationship anymore between what I was proposing here and this original proposal - they can exist independent of each other. So, I'll go ahead and close this.

@Pauan
Copy link

Pauan commented Sep 2, 2021

@theScottyJam Languages are complicated, different languages have different rules for splitting words, lines, and grapheme clusters.

Unicode tries to accurately represent every language as best as it can, so it has to represent those language differences.

https://www.unicode.org/reports/tr29/

It should be possible to split based on Unicode code points (that should work for every language), however in practice you generally don't want to do that, instead what you actually want is to split based on grapheme clusters.

@gibson042
Copy link

MDN's documentation is currently pretty slim (as seen here)

We're working on that too: mdn/content#8402

Note that Intl.Segmenter is currently Stage 3, although it should reach Stage 4 at the next TC39 meeting.

@theScottyJam
Copy link
Author

Ah, thanks you two. That was an interesting read @Pauan, and it helped me understand why this is a trickier problem than I thought. And it makes sense that the documentation is slim if that feature is still in a proposal phase.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants