Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util: expose toUSVString #39814

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
12 changes: 12 additions & 0 deletions doc/api/util.md
Expand Up @@ -2492,6 +2492,18 @@ const util = require('util');
util.log('Timestamped message.');
```


### `util.toUSVString(string)`
ronag marked this conversation as resolved.
Show resolved Hide resolved
<!-- YAML
added: REPLACEME
-->

* `string` {string}

Returns the `string` after replacing any surrogate code points
(or equivalently, any unpaired surrogate code units) with the
Unicode "replacement character" U+FFFD.

[Common System Errors]: errors.md#errors_common_system_errors
[Custom inspection functions on objects]: #util_custom_inspection_functions_on_objects
[Custom promisified functions]: #util_custom_promisified_functions
Expand Down
21 changes: 6 additions & 15 deletions lib/internal/url.js
Expand Up @@ -19,7 +19,6 @@ const {
ReflectApply,
ReflectGetOwnPropertyDescriptor,
ReflectOwnKeys,
RegExpPrototypeExec,
String,
StringPrototypeCharCodeAt,
StringPrototypeIncludes,
Expand All @@ -40,7 +39,12 @@ const {
isHexTable
} = require('internal/querystring');

const { getConstructorOf, removeColors } = require('internal/util');
const {
getConstructorOf,
removeColors,
toUSVString,
} = require('internal/util');

const {
ERR_ARG_NOT_ITERABLE,
ERR_INVALID_ARG_TYPE,
Expand Down Expand Up @@ -79,7 +83,6 @@ const {
domainToASCII: _domainToASCII,
domainToUnicode: _domainToUnicode,
encodeAuth,
toUSVString: _toUSVString,
parse,
setURLConstructor,
URL_FLAGS_CANNOT_BE_BASE,
Expand Down Expand Up @@ -113,18 +116,6 @@ const IteratorPrototype = ObjectGetPrototypeOf(
ObjectGetPrototypeOf([][SymbolIterator]())
);

const unpairedSurrogateRe =
/(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])/;
function toUSVString(val) {
const str = `${val}`;
// As of V8 5.5, `str.search()` (and `unpairedSurrogateRe[@@search]()`) are
// slower than `unpairedSurrogateRe.exec()`.
const match = RegExpPrototypeExec(unpairedSurrogateRe, str);
if (!match)
return str;
return _toUSVString(str, match.index);
}

// Refs: https://html.spec.whatwg.org/multipage/browsers.html#concept-origin-opaque
const kOpaqueOrigin = 'null';

Expand Down
18 changes: 18 additions & 0 deletions lib/internal/util.js
Expand Up @@ -17,6 +17,7 @@ const {
Promise,
ReflectApply,
ReflectConstruct,
RegExpPrototypeExec,
RegExpPrototypeTest,
SafeMap,
SafeSet,
Expand All @@ -27,6 +28,10 @@ const {
SymbolFor,
} = primordials;

const {
toUSVString: _toUSVString,
} = internalBinding('url');

const {
hideStackFrames,
codes: {
Expand All @@ -53,6 +58,18 @@ const experimentalWarnings = new SafeSet();

const colorRegExp = /\u001b\[\d\d?m/g; // eslint-disable-line no-control-regex

const unpairedSurrogateRe =
/(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])/;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])/;
/\p{Surrogate}/u;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch: #39891

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried both this and negative lookbehind but they are slower.

import Benchmark from 'benchmark';

const re1 =
  /(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])/;
const re2 =
  /(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])/;
const re3 = /\p{Surrogate}/u;

const str = 'foo bar baz qux\uD800';

const suite = new Benchmark.Suite();

suite.add('re1', function () {
  re1.exec(str);
});

suite.add('re2', function () {
  re2.exec(str);
});

suite.add('re3', function () {
  re3.exec(str);
});

suite.on('cycle', function (event) {
  console.log(String(event.target));
});

suite.on('complete', function () {
  console.log('Fastest is ' + this.filter('fastest').map('name'));
});

suite.run({ async: true });
re1 x 11,864,933 ops/sec ±1.84% (87 runs sampled)
re2 x 8,003,257 ops/sec ±1.38% (90 runs sampled)
re3 x 7,856,998 ops/sec ±1.54% (88 runs sampled)
Fastest is re1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s currently slower for the one specific scenario you’re testing, yes. Change the test string to something like:

const str = 'foo bar baz qux\uD800 foo \uDBFF foo \uDC00 foo \uDFFF';

…and now re3 is the second fastest:

re1 x 12,680,512 ops/sec ±0.75% (89 runs sampled)
re2 x 8,116,354 ops/sec ±0.68% (86 runs sampled)
re3 x 8,460,972 ops/sec ±3.11% (89 runs sampled)
Fastest is re1

Change it to

const str = '\uD800_\uDC00 foo \uD800 foo \uDBFF foo \uDC00 foo \uDFFF';

…and now re3 wins the race:

re1 x 22,303,784 ops/sec ±0.57% (89 runs sampled)
re2 x 21,924,367 ops/sec ±0.56% (86 runs sampled)
re3 x 23,288,295 ops/sec ±0.91% (87 runs sampled)
Fastest is re3

It’s possible to construct a benchmark that shows any three options as the “fastest”, but I don’t think it’s very meaningful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and now re3 wins the race

by a tiny margin and I think because the surrogate is at the beginning of the string.

Copy link
Member Author

@ronag ronag Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, we should optimize for the case where there are no surrogates. We could even have false positives if that helps with performance since this is just an optimization to avoid calling into native code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A benchmark where re3 is the fastest option by 2%? 4%? 10%

A benchmark where re3 is the fastest by 100%.

These are artificially constructed microbenchmarks, not real-world benchmarks. IMHO any of the three options seems fine from a performance perspective, and given that, we might as well go with the most readable & idiomatic option.

I agree, no one will notice this in a real world application but switching to a slower option for a minor readability improvement seems a bit silly to me. The original regex is not that complex. Also, fwiw, I think re2 is the most readable / easier to grok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, we should optimize for the case where there are no surrogates.

@ronag I tend to agree — that’s probably the common case. All three solutions perform similarly in this case:

const str = 'a'.repeat(1024 * 1024);
re1 x 48,489,097 ops/sec ±2.01% (92 runs sampled)
re2 x 47,831,569 ops/sec ±0.67% (89 runs sampled)
re3 x 48,501,127 ops/sec ±0.68% (89 runs sampled)
Fastest is re3

Do we all agree that all other things equal, \p{Surrogate} is the simplest, most readable pattern? Because in that case, this discussion becomes a matter of choosing which specific use case we optimize for: the one from @lpinca’s benchmark (lone lead surrogate at the end of a short string) or the one from mine (lone trail surrogate at the start of a short string). I think both are edge cases, and I doubt one is more common than the other.

A benchmark where re3 is the fastest option by 2%? 4%? 10%

A benchmark where re3 is the fastest by 100%.

@lpinca So you’d reject a readability improvement unless it also happens to double performance? Woah, that’s a high bar :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a 2x performance drop does not justify a minor readability improvement in my opinion. Anyway I'm not blocking #39891. Just wanted to report this because I tried with the negative lookbehind regex (re2) some time ago and decided to not open a PR because it was slower.

Copy link
Member

@lpinca lpinca Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you’d reject a readability improvement unless it also happens to double performance? Woah, that’s a high bar :)

FWIW I would be happy with this readability improvement if performance was more or less the same for all inputs. This seems to be the case only for some inputs. Some inputs show a 100% performance drop, in those cases I would like to see a 100% improvement. I don't want it to be always 100% faster. If performance is the same ± 10% for all inputs, then that's great.

function toUSVString(val) {
const str = `${val}`;
// As of V8 5.5, `str.search()` (and `unpairedSurrogateRe[@@search]()`) are
// slower than `unpairedSurrogateRe.exec()`.
const match = RegExpPrototypeExec(unpairedSurrogateRe, str);
if (!match)
return str;
return _toUSVString(str, match.index);
}

let uvBinding;

function lazyUv() {
Expand Down Expand Up @@ -487,6 +504,7 @@ module.exports = {
sleep,
spliceOne,
structuredClone,
toUSVString,
removeColors,

// Symbol used to customize promisify conversion
Expand Down
4 changes: 3 additions & 1 deletion lib/util.js
Expand Up @@ -72,7 +72,8 @@ const {
deprecate,
getSystemErrorMap,
getSystemErrorName: internalErrorName,
promisify
promisify,
toUSVString,
} = require('internal/util');

let internalDeepEqual;
Expand Down Expand Up @@ -368,6 +369,7 @@ module.exports = {
isPrimitive,
log,
promisify,
toUSVString,
TextDecoder,
TextEncoder,
types
Expand Down
2 changes: 2 additions & 0 deletions test/parallel/test-util.js
Expand Up @@ -148,6 +148,8 @@ assert.strictEqual(util.isFunction(function() {}), true);
assert.strictEqual(util.isFunction(), false);
assert.strictEqual(util.isFunction('string'), false);

assert.strictEqual(util.toUSVString('string\ud801'), 'string\ufffd');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we’d test lone lead surrogates, lone trail surrogates, and also surrogate pairs (to ensure those are unaffected). Patch: #39891


{
assert.strictEqual(util.types.isNativeError(new Error()), true);
assert.strictEqual(util.types.isNativeError(new TypeError()), true);
Expand Down