Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add toUSVString / toWellFormed (alongside isUSVString / isWellFormed) #13

Closed
mathiasbynens opened this issue Aug 27, 2021 · 15 comments · Fixed by #20
Closed

Add toUSVString / toWellFormed (alongside isUSVString / isWellFormed) #13

mathiasbynens opened this issue Aug 27, 2021 · 15 comments · Fixed by #20

Comments

@mathiasbynens
Copy link
Member

@annevk suggested this here: #11 (comment)

Having both methods be part of the same proposal makes sense to me.

Example toWellFormed userland implementation:

function toWellFormed(string) {
  return string.replaceAll(/\p{Surrogate}/gu, '\uFFFD');
}
@ckknight
Copy link

A possible enhancement, allowing the replacement to be user-determined:

function toWellFormed(string, replacement = '\uFFFD') {
  return string.replaceAll(/\p{Surrogate}/gu, replacement);
}

@michaelficarra
Copy link
Member

michaelficarra commented Aug 10, 2022

I can envision use cases for both isWellFormed and toWellFormed. However, since both could be implemented rather simply (as above), I see the main motivator for this proposal as the potential sub-linear time performance for isWellFormed that only engines could provide. Given that, I'm inclined to only include isWellFormed in this proposal.

@mathiasbynens
Copy link
Member Author

Note that the hypothetical sub-linear time performance benefit applies to toWellFormed as well — in the case where the string is already well-formed, toWellFormed could essentially be a no-op.

If we assume that all toWellFormed calls are gated on an isWellFormed check in userland code, this benefit disappears — but I’m not sure if I’d make that assumption.

@michaelficarra
Copy link
Member

If we assume that all toWellFormed calls are gated on an isWellFormed check in userland code, this benefit disappears — but I’m not sure if I’d make that assumption.

That is the assumption I was making. I'm curious to know why it's not safe to make that assumption.

@annevk
Copy link
Member

annevk commented Aug 15, 2022

There are many places in the web platform where we make a string well-formed before further processing it. (E.g., input to the URL parser.) Those code paths do not have separate branches for "is well-formed". I would assume userland to have similar cases.

@michaelficarra
Copy link
Member

@annevk I don't think I follow the point you're trying to make. Assuming this proposal adds an efficient built-in String.prototype.isWellFormed, a user-land implementation of toWellFormed would look something like

function toWellFormed(string) {
  return string.isWellFormed() ? string : string.replaceAll(/\p{Surrogate}/gu, '\uFFFD');
}

Are you asking for a built-in toWellFormed? If so, in what ways would it be better than this user-land implementation?

@annevk
Copy link
Member

annevk commented Aug 17, 2022

I'm saying that the web platform has many code paths that do string.replaceAll(/\p{Surrogate}/gu, '\uFFFD') (I'm assuming here your version is correct, didn't test), regardless of the actual contents of string. And as such I expect that web developers would be well served by having equivalent functionality. To what extent that operation is then optimized would be up to engine developers.

@michaelficarra
Copy link
Member

Okay, which of these better reflects your position?

  1. You expect toWellFormed to be used frequently, and therefore think it is sufficiently motivated despite it being easy to implement in user space.
  2. Though infrequent, you expect toWellFormed to be needed sometimes and there is a risk that it would be implemented differently than the platform does it, so you want to give them something easy to reach for to encourage them to do it the same as the platform.

Or am I still misunderstanding?

@mathiasbynens
Copy link
Member Author

  • You expect toWellFormed to be used frequently, and therefore think it is sufficiently motivated despite it being easy to implement in user space.

isWellFormed is just as “easy” to implement in user space:

function isWellFormed(string) {
  return !/\p{Surrogate}/u.test(string);
}

My point is I don’t understand the distinction you’re making between isWellFormed and toWellFormed; they’re both easily implemented in userland, and they could both benefit from potential engine-level optimizations. IMHO they’re equally well-motivated and would fit nicely into a single proposal.

@michaelficarra
Copy link
Member

@mathiasbynens In the presence of an engine-optimised isWellFormed, a user-space implementation of toWellFormed (that switches on the result of isWellFormed) is also optimised (I feel like we are going in circles). Given that and that the stated motivation of this proposal is entirely performance-based, what is the argument for including toWellFormed? I imagine it's one of the two I listed in my last comment, if there is one.

@domenic
Copy link
Member

domenic commented Aug 22, 2022

If the stated motivation of this proposal is entirely performance-based, perhaps it shouldn't advance unless there are actual commitments from engines to restructure their string representations to track this hypothetical bit? I'm not aware of any such plans at the moment.

@michaelficarra
Copy link
Member

@domenic Agreed. I will make sure to discuss that during the presentation to committee.

@bakkot
Copy link
Collaborator

bakkot commented Aug 22, 2022

Personally I still want this for usability/clarity reasons, fwiw. The performance part is not important to me; the part I care about is that a reader who sees

if (str.isWellFormedUnicode()) ...

is much more likely to get what's going on than a reader who sees

if (!/\p{Surrogate}/u.test(str)) ...

since the latter requires a lot more background knowledge.


That said, it is my understanding that some engines (at least V8) do already track a distinction between ASCII and other strings (or rather "strings whose UTF-16 code units are all < 256", which is not quite "ASCII"), and therefore can at least have a fast path for this API for ASCII strings, which is already better performance (in that common case) than userland is capable of.

@annevk
Copy link
Member

annevk commented Aug 23, 2022

Yeah, I mainly see the benefit of these features in terms of clarity, not performance. And from that perspective you want both.

(I don't like the Unicode suffix though. Strings are already Unicode and we don't really expose that term directly anywhere I think.)

@michaelficarra
Copy link
Member

FYI I've opened #20 to add String.prototype.toWellFormed. I'll wait until after feedback from plenary to merge, assuming the committee thinks it's worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants