Skip to content

Commit

Permalink
fixes #13: add String.prototype.toWellFormed (#20)
Browse files Browse the repository at this point in the history
  • Loading branch information
michaelficarra committed Sep 13, 2022
1 parent f2361da commit f91aaa4
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 0 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ The WebAssembly [Component Model](https://github.com/WebAssembly/component-model

The proposal is to define in ECMA-262 a method to verify if a given ECMAScript string is well-formed or not. As a highly common scenario for interfaces between JavaScript/web APIs and those that operate on Unicode text, this test should be as efficient as possible, ideally scaling independently from the length of the string. In addition to improved performance, this method will also increase the clarity for readers of code where this test is being performed — especially those without extensive Unicode or regular expression knowledge.

The proposal also adds a method to ensure that a String is well-formed by replacing any lone or out-of-order surrogates, if any are present, with U+FFFD (REPLACEMENT CHARACTER). This operation mimics the operation already performed within various web and non-ECMAScript APIs, and reduces the chance that a consumer who would otherwise have to write this method does so in an incompatible or incorrect way.

## Algorithm

The validation algorithm is effectively the standard UTF-16 validation algorithm, iterating through the string and pairing UTF-16 surrogates, failing validation for any unpaired surrogates.
Expand Down
21 changes: 21 additions & 0 deletions spec.html
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,25 @@ <h1>String.prototype.isWellFormed ( )</h1>
1. Return IsStringWellFormedUnicode(_S_).
</emu-alg>
</emu-clause>

<emu-clause number="11" id="sec-string.prototype.towellformed">
<h1>String.prototype.toWellFormed ( )</h1>
<p>This method returns a String representation of this object with all leading surrogates and trailing surrogates that are not part of a surrogate pair replaced with U+FFFD (REPLACEMENT CHARACTER).</p>
<p>It performs the following steps when called:</p>
<emu-alg>
1. Let _O_ be ? RequireObjectCoercible(*this* value).
1. Let _S_ be ? ToString(_O_).
1. Let _strLen_ be the length of _S_.
1. Let _k_ be 0.
1. Let _result_ be the empty String.
1. Repeat, while _k_ &lt; _strLen_,
1. Let _cp_ be CodePointAt(_S_, _k_).
1. If _cp_.[[IsUnpairedSurrogate]] is *true*, then
1. Set _result_ to the string-concatenation of _result_ and 0xFFFD (REPLACEMENT CHARACTER).
1. Else,
1. Set _result_ to the string-concatenation of _result_ and UTF16EncodeCodePoint(_cp_.[[CodePoint]]).
1. Set _k_ to _k_ + _cp_.[[CodeUnitCount]].
1. Return _result_.
</emu-alg>
</emu-clause>
</emu-clause>

0 comments on commit f91aaa4

Please sign in to comment.