Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util: simplify util.toUSVString #39891

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mathiasbynens
Copy link
Contributor

This patch also adds some more tests for this new feature.

Follow-up to #39814.

@mathiasbynens mathiasbynens changed the title Simplify util.toUSVString util: simplify util.toUSVString Aug 26, 2021
@nodejs-github-bot nodejs-github-bot added needs-ci PRs that need a full CI run. util Issues and PRs related to the built-in util module. labels Aug 26, 2021
@@ -57,8 +57,7 @@ const experimentalWarnings = new SafeSet();

const colorRegExp = /\u001b\[\d\d?m/g; // eslint-disable-line no-control-regex

const unpairedSurrogateRe =
/(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])/;
const unpairedSurrogateRe = /\p{Surrogate}/u;
Copy link
Member

@lpinca lpinca Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slower, see #39814 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily. It depends on the string you use it with, just like the current solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing that discussion: the performance characteristics are different before/after this change. The trade-off is as follows:

  • For strings without surrogates (the common case?), performance is equivalent.
  • For strings with a lone trail surrogate at the start, the patch performs better.
  • For strings with a lone lead surrogate at the end, the old/current version performs better.

Copy link
Member

@lpinca lpinca Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be pedandic:

  • For strings without surrogates (the common case?), performance is equivalent.
  • For strings with a lone trail surrogate at the start, the patch performs sligtly better (~5%).
  • For strings with a lone lead surrogate at the end, the old/current version performs much better (~50%).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I'm good with either approach but I'm happy to keep things as they are if folks think that {Surrogate} way is too slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you stick to the current approach, then consider renaming unpairedSurrogateRe — the current name is misleading, since it doesn’t just match unpaired surrogates. E.g. in the case of '\u{10000}\uDC00' (i.e. a surrogate pair followed by a lone surrogate), it matches '\uDC00\uDC00' (i.e. the trail surrogate of the pair + the lone surrogate) as opposed to just the lone surrogate '\uDC00' — which is fine for this specific use case, but could become problematic if someone were to re-use this pattern elsewhere. For example, in a user-land implementation:

function toUSVString(str) {
  // This works fine:
  return str.replaceAll(/\p{Surrogate}/gu, '\uFFFD');
  // …but it subtly wouldn’t work correctly with the current `unpairedSurrogateRe`.
}

Copy link
Member

@ronag ronag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m +0. Though I think performance trumps maintainability per the priorities defined in https://github.com/nodejs/next-10/blob/main/VALUES_AND_PRIORITIZATION.md. I don’t necessarily personally agree with that but that’s the current consensus.

function toUSVString(val) {
const str = `${val}`;
// As of V8 5.5, `str.search()` (and `unpairedSurrogateRe[@@search]()`) are
// slower than `unpairedSurrogateRe.exec()`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of the meta-discussion we’re having. The problem with hardcoding micro-performance optimizations is that V8 keeps optimizing for idiomatic coding patterns over time, to the point where hand-written micro-optimizations often end up being slower than the idiomatic patterns. Unless there is some mechanism to track these hardcoded assumptions over time and adjust the code if necessary, short-term micro-optimizations end up hurting performance over the longer term.

This comment about search vs. exec might’ve been true for V8 v5.5, but in more recent versions the opposite seems to be the case. My latest commit switches to using StringPrototypeSearch which makes the new version (even with the “slow” regular expression) ~21% faster:

before patch x 21,177,919 ops/sec ±0.64% (90 runs sampled)
after patch x 25,642,796 ops/sec ±0.72% (86 runs sampled)

(My colleague Peter Marshall once gave a nice presentation about this. Relevant segment: https://www.youtube.com/watch?v=1bZxs5J5n3M&t=708s)

Copy link
Member

@lpinca lpinca Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but in this specific case we are trading 50% performance for this change

/(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])/

/\p{Surrogate}/u

and using String.prototype.search() instead of RegExp.prototype.exec() does not fix the 50% performance drop.

I understand that in a future V8 version /\p{Surrogate}/u might be faster for all inputs, but that is not the case now. It also makes backporting stuff harder.

I think we should strive for what is better now, not what might be better in future, otherwise things like this #39778 do no make any sense.

Unless there is some mechanism to track these hardcoded assumptions over time and adjust the code if necessary, short-term micro-optimizations end up hurting performance over the longer term.

This is the key and the reason why I'm not blocking this (because there is no such mechanism).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/\p{Surrogate}/u [...] also makes backporting stuff harder.

How so?

Copy link
Member

@lpinca lpinca Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine changes like this that are good in newer V8 versions but bad in older ones. Should they be backported to old but still suppported release lines? Should "dont-land-on" labels be used? Or should we simply not care?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how that applies here — we shipped \p{...} in V8 v6.4. Even the oldest Maintenance LTS version of Node.js (12 at the time of writing) comes with a much more recent V8 version.

Copy link
Member

@lpinca lpinca Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did not make my point clear but I think you already answered my questions with "It doesn't matter if performance is worse in older V8 versions. All that matters is that the feature is supported".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about simply not backporting it? 🤔

@mathiasbynens mathiasbynens force-pushed the simplify-util.tousvstring branch 2 times, most recently from ad9b1f5 to 4d54445 Compare August 27, 2021 10:50
Copy link
Member

@BridgeAR BridgeAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't land this due to builds without intl not being supported. @mathiasbynens do you have an idea how to address this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-ci PRs that need a full CI run. util Issues and PRs related to the built-in util module.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants