-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
matchAll and replaceAll scale nonlinearly #194
Comments
I should probably mention that it looked like a manual .exec() loop wasn't any faster than matchAll, but I didn't investigate that closely. |
I have discovered that
|
You do aware that this function is quadratic: function matchAll(input, regex) {
return [...input.matchAll(regex)];
} Specifically, the It looks like You can even the field by using a loop: function matchAll(input, regex) {
const matches = [];
for (const m of input.matchAll(regex)) {
matches.push(m);
}
return matches;
} But I doubt it'll be much different. It is likely that you measured the difference between a callback I call directly from C++ and a Node's implementation of iterators. See matchAll(). While One user even built an evaluator for that: https://github.com/vt-iwamoto/node-re2-benchmark (listed in this project's wiki). |
I fully expect
As far as I know appending to an array is amortized constant because it only has to reallocate a logarithmic number of times. No different from
Makes sense for overall performance, but surely it can't explain a difference in scaling? The callback is only executed a linear number of times. In comparing Also, I did confirm that |
I think you are right about that.
Again, here is a link to the |
I've confirmed this is entirely from UTF-16 / UTF-8 translation. When using This makes sense because, currently, the conversion runs on the entire input string for each call to |
It is good actionable data. We can consider moving it to C++, or switching to a buffer version in JS. TBH, the C++ implementation is the bigger endeavor that can require some code restructuring to reuse the internal implementation of In any case, PRs are welcome. |
Observed behavior
My goal is to get a list of all matches in a string. In production, switching from native regexes to RE2 caused an outage because RE2 has worse scaling behavior on that task, at least in some cases. Something that had previously taken milliseconds ended up taking minutes.
In benchmarking a toy input (a string of N
x
s, matching the regex/x/g
), I was surprised to find two things:matchAll
andreplaceAll
scale nonlinearly in the number of matches (or maybe the input size). Specifically, they seem to be almost O(n^2). 10x longer input produces about 100x longer runtime. Native regexes do not have this behavior. In my crude benchmarking,replaceAll
is rock-steady linear. AndmatchAll
might be just barely super-linear.replaceAll
with a replacer function that accumulates the matches into a list manually is faster (about 3x for me) than usingmatchAll
. This seems true for native regexes as well, though not to the same degree.The benchmarking is not robust at all, I wouldn't trust it for absolute numbers, especially since the input is very unrealistic, but it's plenty to see the scaling behavior.
Desired behavior
I have no expectation that node-re2 would out-perform native regexes here. But I do expect it to have good scaling. And I especially expect it to have scaling at least as good as native regexes. So, linear scaling would be my primary desire here. (Or - whatever the scaling that native regexes have, maybe that's not linear for some inputs. Other than the ones that trigger non-linear individual match behavior, ofc.)
Second and not very important, it's weird to me that
replaceAll
is the fastest way to get a list of matches, as I would have thought it must have quite a bit of overhead in constructing the new string. I suspect this must be frommatchAll
being iterator-based. If so, maybe it would be nice if node-re2 would offer some kind of special "get list of all matches" function that performs faster than either of them. Particularly if, for some strange reason, that could be done linearly but matchAll/replaceAll as spec'd cannot.As far as I can tell, these two functions are really part of the node-re2 package more than they are the underlying RE2 library, so I don't think fixing the scaling is likely to require a change to RE2 itself. But I could be wrong.
Benchmarking code and results
This was on Node v18.16.1 with RE2 1.20.5.
Code:
And here is some output, I stopped when it started taking too long (the numbers are ms):
The text was updated successfully, but these errors were encountered: