Add `toUSVString` / `toWellFormed` (alongside `isUSVString` / `isWellFormed`) #13

mathiasbynens · 2021-08-27T07:31:08Z

@annevk suggested this here: #11 (comment)

Having both methods be part of the same proposal makes sense to me.

Example toWellFormed userland implementation:

function toWellFormed(string) {
  return string.replaceAll(/\p{Surrogate}/gu, '\uFFFD');
}

The text was updated successfully, but these errors were encountered:

ckknight · 2021-08-31T22:08:39Z

A possible enhancement, allowing the replacement to be user-determined:

function toWellFormed(string, replacement = '\uFFFD') {
  return string.replaceAll(/\p{Surrogate}/gu, replacement);
}

michaelficarra · 2022-08-10T17:58:23Z

I can envision use cases for both isWellFormed and toWellFormed. However, since both could be implemented rather simply (as above), I see the main motivator for this proposal as the potential sub-linear time performance for isWellFormed that only engines could provide. Given that, I'm inclined to only include isWellFormed in this proposal.

mathiasbynens · 2022-08-11T05:17:24Z

Note that the hypothetical sub-linear time performance benefit applies to toWellFormed as well — in the case where the string is already well-formed, toWellFormed could essentially be a no-op.

If we assume that all toWellFormed calls are gated on an isWellFormed check in userland code, this benefit disappears — but I’m not sure if I’d make that assumption.

michaelficarra · 2022-08-12T16:24:17Z

If we assume that all toWellFormed calls are gated on an isWellFormed check in userland code, this benefit disappears — but I’m not sure if I’d make that assumption.

That is the assumption I was making. I'm curious to know why it's not safe to make that assumption.

annevk · 2022-08-15T07:50:06Z

There are many places in the web platform where we make a string well-formed before further processing it. (E.g., input to the URL parser.) Those code paths do not have separate branches for "is well-formed". I would assume userland to have similar cases.

michaelficarra · 2022-08-15T14:56:31Z

@annevk I don't think I follow the point you're trying to make. Assuming this proposal adds an efficient built-in String.prototype.isWellFormed, a user-land implementation of toWellFormed would look something like

function toWellFormed(string) {
  return string.isWellFormed() ? string : string.replaceAll(/\p{Surrogate}/gu, '\uFFFD');
}

Are you asking for a built-in toWellFormed? If so, in what ways would it be better than this user-land implementation?

annevk · 2022-08-17T16:03:34Z

I'm saying that the web platform has many code paths that do string.replaceAll(/\p{Surrogate}/gu, '\uFFFD') (I'm assuming here your version is correct, didn't test), regardless of the actual contents of string. And as such I expect that web developers would be well served by having equivalent functionality. To what extent that operation is then optimized would be up to engine developers.

michaelficarra · 2022-08-17T18:29:16Z

Okay, which of these better reflects your position?

You expect toWellFormed to be used frequently, and therefore think it is sufficiently motivated despite it being easy to implement in user space.
Though infrequent, you expect toWellFormed to be needed sometimes and there is a risk that it would be implemented differently than the platform does it, so you want to give them something easy to reach for to encourage them to do it the same as the platform.

Or am I still misunderstanding?

mathiasbynens · 2022-08-22T09:22:53Z

You expect toWellFormed to be used frequently, and therefore think it is sufficiently motivated despite it being easy to implement in user space.

isWellFormed is just as “easy” to implement in user space:

function isWellFormed(string) {
  return !/\p{Surrogate}/u.test(string);
}

My point is I don’t understand the distinction you’re making between isWellFormed and toWellFormed; they’re both easily implemented in userland, and they could both benefit from potential engine-level optimizations. IMHO they’re equally well-motivated and would fit nicely into a single proposal.

michaelficarra · 2022-08-22T16:51:36Z

@mathiasbynens In the presence of an engine-optimised isWellFormed, a user-space implementation of toWellFormed (that switches on the result of isWellFormed) is also optimised (I feel like we are going in circles). Given that and that the stated motivation of this proposal is entirely performance-based, what is the argument for including toWellFormed? I imagine it's one of the two I listed in my last comment, if there is one.

domenic · 2022-08-22T23:19:25Z

If the stated motivation of this proposal is entirely performance-based, perhaps it shouldn't advance unless there are actual commitments from engines to restructure their string representations to track this hypothetical bit? I'm not aware of any such plans at the moment.

michaelficarra · 2022-08-22T23:27:48Z

@domenic Agreed. I will make sure to discuss that during the presentation to committee.

bakkot · 2022-08-22T23:46:23Z

Personally I still want this for usability/clarity reasons, fwiw. The performance part is not important to me; the part I care about is that a reader who sees

if (str.isWellFormedUnicode()) ...

is much more likely to get what's going on than a reader who sees

if (!/\p{Surrogate}/u.test(str)) ...

since the latter requires a lot more background knowledge.

That said, it is my understanding that some engines (at least V8) do already track a distinction between ASCII and other strings (or rather "strings whose UTF-16 code units are all < 256", which is not quite "ASCII"), and therefore can at least have a fast path for this API for ASCII strings, which is already better performance (in that common case) than userland is capable of.

annevk · 2022-08-23T10:17:47Z

Yeah, I mainly see the benefit of these features in terms of clarity, not performance. And from that perspective you want both.

(I don't like the Unicode suffix though. Strings are already Unicode and we don't really expose that term directly anywhere I think.)

michaelficarra · 2022-08-26T18:10:28Z

FYI I've opened #20 to add String.prototype.toWellFormed. I'll wait until after feedback from plenary to merge, assuming the committee thinks it's worth it.

guybedford mentioned this issue Aug 27, 2021

some prior art in node #11

Closed

mathiasbynens mentioned this issue Aug 31, 2021

Pick a developer-friendly name (avoid “USV”) #12

Closed

theScottyJam mentioned this issue Sep 1, 2021

Adding a UnicodeString type? #15

Closed

michaelficarra added a commit that referenced this issue Aug 26, 2022

fixes #13: add String.prototype.toWellFormed

2696882

michaelficarra mentioned this issue Aug 26, 2022

fixes #13: add String.prototype.toWellFormed #20

Merged

michaelficarra closed this as completed in #20 Sep 13, 2022

michaelficarra added a commit that referenced this issue Sep 13, 2022

fixes #13: add String.prototype.toWellFormed (#20)

f91aaa4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `toUSVString` / `toWellFormed` (alongside `isUSVString` / `isWellFormed`) #13

Add `toUSVString` / `toWellFormed` (alongside `isUSVString` / `isWellFormed`) #13

mathiasbynens commented Aug 27, 2021

ckknight commented Aug 31, 2021

michaelficarra commented Aug 10, 2022 •

edited

mathiasbynens commented Aug 11, 2022

michaelficarra commented Aug 12, 2022

annevk commented Aug 15, 2022

michaelficarra commented Aug 15, 2022

annevk commented Aug 17, 2022

michaelficarra commented Aug 17, 2022

mathiasbynens commented Aug 22, 2022

michaelficarra commented Aug 22, 2022

domenic commented Aug 22, 2022

michaelficarra commented Aug 22, 2022

bakkot commented Aug 22, 2022 •

edited

annevk commented Aug 23, 2022

michaelficarra commented Aug 26, 2022

Add toUSVString / toWellFormed (alongside isUSVString / isWellFormed) #13

Add toUSVString / toWellFormed (alongside isUSVString / isWellFormed) #13

Comments

mathiasbynens commented Aug 27, 2021

ckknight commented Aug 31, 2021

michaelficarra commented Aug 10, 2022 • edited

mathiasbynens commented Aug 11, 2022

michaelficarra commented Aug 12, 2022

annevk commented Aug 15, 2022

michaelficarra commented Aug 15, 2022

annevk commented Aug 17, 2022

michaelficarra commented Aug 17, 2022

mathiasbynens commented Aug 22, 2022

michaelficarra commented Aug 22, 2022

domenic commented Aug 22, 2022

michaelficarra commented Aug 22, 2022

bakkot commented Aug 22, 2022 • edited

annevk commented Aug 23, 2022

michaelficarra commented Aug 26, 2022

Add `toUSVString` / `toWellFormed` (alongside `isUSVString` / `isWellFormed`) #13

Add `toUSVString` / `toWellFormed` (alongside `isUSVString` / `isWellFormed`) #13

michaelficarra commented Aug 10, 2022 •

edited

bakkot commented Aug 22, 2022 •

edited