Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploring upstreaming "fast-string-width" #57

Closed
fabiospampinato opened this issue Feb 25, 2024 · 4 comments
Closed

Exploring upstreaming "fast-string-width" #57

fabiospampinato opened this issue Feb 25, 2024 · 4 comments

Comments

@fabiospampinato
Copy link

fabiospampinato commented Feb 25, 2024

Hey 馃憢 Following bun's announcement of a supposedly 100_000x faster built-in "stringWidth" function I tried to rewrite this function in JS with performance in mind, and I think I got some good results.

I would be interested in trying to upstream the optimizations, so that they can actually reach people, would this be something that you could be interested in also?

If so, I think there are currently the following differences between the implementations that would need to be handled one way or the other:

  1. Intl.Segmenter is borderline buggy in V8, in the sense that it's way slower than the JSC equivalent, and it gets massively slower and slower the longer the input string is, so I'm not using that and instead I'm stripping out all the combining marks with a regex, assuming that's what \p{M} actually matches, this stuff seems poorly documented. There may be other subtle differences here, I don't understand exactly what Intl.Segmenter does. Also Intl.Segmenter seems unavailable in Firefox at the moment, and by default I prefer if my code could run ~everywhere, so that's another reason for temporarily dropping Intl.Segmenter I guess.
  2. I'm using a slightly simplified regex for matching ANSI escapes. Potentially the regex from ansi-regex could be used instead, but in my tests it slowed things down a bit in some cases, even though in my benchmarks there was no additional ANSI escape to match that this more robust regex was accounting for, so maybe that could be tweaked somehow to be just as fast as the simplified regex, or potentially we could just switch to ansi-regex anyway, the difference in performance wasn't huge.
  3. I'm using a simpler regex for matching emojis, mostly out of convenience, but I haven't checked what the difference in performance would be like with the one from emoji-regex. The regex I'm using is potentially slightly problematic because it can consider emojis joined by a ZWJ as a single emoji even if they are not supposed to collapse into a single emoji like some family emojis are.
  4. I'm using a different options object, one that can be used to set the width of full-width, wide, emoji, and other classes of characters, mainly because how much space a string occupies depends on the context really, so if these numbers are not customizable there are scenarios where this function just can't give us the expected result.
  5. I'm inlining some of the lookup functions from get-east-asian-width, since those don't seem to be exported from a standalone package. I could have potentially called getEastAsianWidth directly, but mainly it just doesn't expose the options that I want.
  6. My implementation is measurably slower on CJK inputs for example, it would be nice to remove that regression somehow.
  7. I haven't implemented it yet, but since I've only seen this function being used for CLIs for the purpose of truncating and padding I think there's actually a better algorithm for that, so I'd like to also expose a function that somehow tells people where to truncate the string, ignoring the width of whatever follows the truncation point since we most probably don't actually care about that, in some cases.

Basically my reimplementation is not immediately mergable, but maybe it could be merged after some changes?

Either way pretty much the entire speedup comes from not using Intl.Segmenter and from trying to do as little as possible for common latin/ansi characters, by just matching them with some simple regexes, so maybe it's more practical to just port those two changes.

@sindresorhus
Copy link
Owner

sindresorhus commented Feb 26, 2024

I would be happy to improve performance here, but keep in mind that correctness is just as important.

I also want to make it clear though that string-width is already plenty fast for most use-cases and any improvement will have dubvious benefit in real-world scenarios. You don't generally call string-width millions of times a second. I think terminal animations are when it would be called most often, and even then we are talking 30 times a second.

What Node.js version did you use to measure the performance improvements?


  1. Instead of switching from Intl.Segmenter, I would recommend opening an issue on V8. You could fix performance for this specific package or help get Int.Segmenter improved and fix it for the whole ecosystem, which would have a much bigger impact. I'm also sceptical of the change because you don't seem to know exactly what Intl.Segmenter does.

  2. Stripping ANSI escapes correctly is complicated. I'm sure your simplified regex is missing many edge-cases. I would prefer to keep using strip-ansi.

  3. I plan to use /\p{RGI_Emoji}/v when targeting Node.js 20.

  4. Not sure I see the value of this. The example in your readme is flawed. Emojis may be 1.5 in your terminal, but that can vary. Also, setting emoji width to 1.5 will only work in that specific example where you have the string defined statically, and then you don't even need this package as you can just use a constant for the width. But I'm happy to consider if there are some convincing arguments.

  5. I'm happy to expose more things there. We exposed what we needed. Probably good to open an issue first though.

@fabiospampinato
Copy link
Author

fabiospampinato commented Feb 26, 2024

I also want to make it clear though that string-width is already plenty fast for most use-cases and any improvement will have dubvious benefit in real-world scenarios.

That's basically what I've said to Jarred, it's probably going to be fast enough in most cases, and for truncation there are better algorithms anyway.

However I didn't know that Intl.Segmenter goes quadratic, everything else being equal I'd rather build upon linear algorithms instead when possible.

What Node.js version did you use to measure the performance improvements?

I can reproduce the slowness of Intl.Segmenter in Node v18, v20 and v21. If you want to reproduce the problem yourself, under Node v21.6.2 (latest) with this code:

import stringWidth from 'string-width';
import fastStringWidth from 'fast-string-width';

const input = 'a'.repeat ( 25_000 );

console.time ( 'string-width' );
stringWidth ( input );
console.timeEnd ( 'string-width' );

console.time ( 'fast-string-width' );
fastStringWidth ( input );
console.timeEnd ( 'fast-string-width' );

I see this output:

string-width: 104.71ms
fast-string-width: 0.091ms

Using a 25k string seems kind of an edge case, but it's to illustrate that there is in fact a problem in V8's Intl.Segmenter implementation, and it's not like things could just be marginally faster and it wouldn't matter in any case.

Instead of switching from Intl.Segmenter, I would recommend opening an issue on V8. You could fix performance for this specific package or help get Int.Segmenter improved and fix it for the whole ecosystem, which would have a much bigger impact.

There was an issue already. And as of 3 hours ago it looks like the problem that makes Intl.Segmenter go quadratic has been fixed. Presumably it might take some years until this fix reaches most Node users.

I'm also sceptical of the change because you don't seem to know exactly what Intl.Segmenter does.

Sure, do you know of a case where the approach I'm using breaks because I'm not using Intl.Segmenter? I don't have complete knowledge of Unicode and these algorithms so I guess there could definitely be some. My current thinking is that if all combining marks are deleted what's left are the only things that we need to measure, as presumably there are no combining marks that change the width of whatever they are combined with, especially in a CLI-context where things have to fit within an integer number of columns, but that may indeed be an incorrect assumption.

Stripping ANSI escapes correctly is complicated. I'm sure your simplified regex is missing many edge-cases. I would prefer to keep using strip-ansi.

Sure 馃憤

I plan to use /\p{RGI_Emoji}/v when targeting Node.js 20.

Sure 馃憤

Not sure I see the value of this. The example in your readme is flawed. Emojis may be 1.5 in your terminal, but that can vary. Also, setting emoji width to 1.5 will only work in that specific example where you have the string defined statically, and then you don't even need this package as you can just use a constant for the width. But I'm happy to consider if there are some convincing arguments.

I'm not sure I'm following. In the example in the readme I'm not even using the returned number for anything, so where are you seeing the flaw?

The idea is that for example if you look at this picture CJK characters don't take 2 spaces each in CodeMirror, but 1.666 spaces each, I guess, so if word wrapping were to be enabled in CM, then the editor could potentially use this function with custom width values to understand if a line has to be wrapped or not, while if these numbers are not customizable this function wouldn't be able to tell the editor a number that makes sense for it.

Basically if one is under a situation where emojis/CJKs/whatever are rendered in something other than 2 spaces then that number must be customizable or string-width would be useless in that context.

I'm happy to expose more things there. We exposed what we needed. Probably good to open an issue first though.

馃憤


By the way Intl.Segmenter doesn't seem buggy in JSC and I still see a fairly big difference in performance there:

Screen Shot 2024-02-26 at 12 57 16

It's unclear what that difference will look like in real applications though.


To summarize, as I understand it: you prefer the way string-width currently works, and while using Intl.Segmenter could potentially be a performance problem the issue has been already fixed upstream so for users that's just going to fix itself over time, and the additional changes don't seem particularly worth it, and if I'd like to propose some new features I should open a dedicated issue for each of these features first. Right?

I'm ok with whatever path you think it's best, I'm just trying to understand if there's something you'd want me to do here, it seems like there may not actually be any particularly interesting change that you'd like me to make in string-width?

@sindresorhus
Copy link
Owner

There was an issue already. And as of 3 hours ago it looks like the problem that makes Intl.Segmenter go quadratic has been fixed. Presumably it might take some years until this fix reaches most Node users.

The Node.js team sometimes backport bug fixes.

do you know of a case where the approach I'm using breaks because I'm not using Intl.Segmenter?

No. I'm using Intl.Segmenter so I don't need to know.

To summarize, as I understand it: you prefer the way string-width currently works, and while using Intl.Segmenter could potentially be a performance problem the issue has been already fixed upstream so for users that's just going to fix itself over time, and the additional changes don't seem particularly worth it, and if I'd like to propose some new features I should open a dedicated issue for each of these features first. Right?

Correct

it seems like there may not actually be any particularly interesting change that you'd like me to make in string-width?

Yeah. It's a good exploration, but it seems like the most valuable improvements will come from Node.js itself.

@fabiospampinato
Copy link
Author

Closing as there seems to be nothing left to do.


do you know of a case where the approach I'm using breaks because I'm not using Intl.Segmenter?

No. I'm using Intl.Segmenter so I don't need to know.

By the way I'd like to point out how for the string in this issue the Intl.Segmenter approach is giving out the wrong result, while my approach of stripping out combining marks is producing the correct number, so it may be worth digging deeper into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants