Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matchAll and replaceAll scale nonlinearly #194

Open
matthewvalentine opened this issue Oct 27, 2023 · 7 comments · May be fixed by #200
Open

matchAll and replaceAll scale nonlinearly #194

matthewvalentine opened this issue Oct 27, 2023 · 7 comments · May be fixed by #200

Comments

@matthewvalentine
Copy link

matthewvalentine commented Oct 27, 2023

Observed behavior

My goal is to get a list of all matches in a string. In production, switching from native regexes to RE2 caused an outage because RE2 has worse scaling behavior on that task, at least in some cases. Something that had previously taken milliseconds ended up taking minutes.

In benchmarking a toy input (a string of N xs, matching the regex /x/g), I was surprised to find two things:

  1. Both matchAll and replaceAll scale nonlinearly in the number of matches (or maybe the input size). Specifically, they seem to be almost O(n^2). 10x longer input produces about 100x longer runtime. Native regexes do not have this behavior. In my crude benchmarking, replaceAll is rock-steady linear. And matchAll might be just barely super-linear.
  2. Using replaceAll with a replacer function that accumulates the matches into a list manually is faster (about 3x for me) than using matchAll. This seems true for native regexes as well, though not to the same degree.

The benchmarking is not robust at all, I wouldn't trust it for absolute numbers, especially since the input is very unrealistic, but it's plenty to see the scaling behavior.

Desired behavior

I have no expectation that node-re2 would out-perform native regexes here. But I do expect it to have good scaling. And I especially expect it to have scaling at least as good as native regexes. So, linear scaling would be my primary desire here. (Or - whatever the scaling that native regexes have, maybe that's not linear for some inputs. Other than the ones that trigger non-linear individual match behavior, ofc.)

Second and not very important, it's weird to me that replaceAll is the fastest way to get a list of matches, as I would have thought it must have quite a bit of overhead in constructing the new string. I suspect this must be from matchAll being iterator-based. If so, maybe it would be nice if node-re2 would offer some kind of special "get list of all matches" function that performs faster than either of them. Particularly if, for some strange reason, that could be done linearly but matchAll/replaceAll as spec'd cannot.

As far as I can tell, these two functions are really part of the node-re2 package more than they are the underlying RE2 library, so I don't think fixing the scaling is likely to require a change to RE2 itself. But I could be wrong.

Benchmarking code and results

This was on Node v18.16.1 with RE2 1.20.5.

Code:

const re2 = require('re2');

function matchAll(input, regex) {
  return [...input.matchAll(regex)];
}

function replaceAll(input, regex, replacement) {
  const matches = [];
  input.replaceAll(regex, m => {
    matches.push(m);
    return replacement;
  });
  return matches;
}

function run() {
  console.log('Size            matchAll      replaceAllEmpty    replaceAllSame    replaceAllDifferent');
  for (let i = 0; ; i++) {
    const n = Math.floor(Math.pow(10, 3 + i/3));
    console.log(n);
    for (const native of [true, false]) {
      const { matchAll, replaceAllEmpty, replaceAllSame, replaceAllDifferent } = benchmark(n, 1, native);
      function logRes(title, div) {
        console.log(
          `  ${title}`.padEnd(15),
          (matchAll / div).toFixed(3).padEnd(13),
          (replaceAllEmpty / div).toFixed(2).padEnd(18),
          (replaceAllSame / div).toFixed(2).padEnd(17),
          (replaceAllDifferent / div).toFixed(2)
        );
      }
      console.log(native ? ' Native' : ' RE2');
      logRes('Per 1k char', n / 1000);
      logRes('Per call', 1);
      console.log('');
    }
  }
}

function benchmark(inputSize, iters, native) {
  const input = (new Array(inputSize)).join('x');
  let regex = /x/g;
  if (!native) regex = new re2(regex);
  return {
    matchAll: timeit(iters, () => matchAll(input, regex)) / iters,
    replaceAllEmpty: timeit(iters, () => replaceAll(input, regex, '')) / iters,
    replaceAllSame: timeit(iters, () => replaceAll(input, regex, 'x')) / iters,
    replaceAllDifferent: timeit(iters, () => replaceAll(input, regex, 'yyyyy')) / iters,
  };
}

function timeit(n, fn) {
  const start = Date.now();
  for (let i = 0; i < n; i ++) {
    fn();
  }
  return Date.now() - start;
}

And here is some output, I stopped when it started taking too long (the numbers are ms):

Size            matchAll      replaceAllEmpty    replaceAllSame    replaceAllDifferent
1000
 Native
  Per 1k char   0.000         0.00               0.00              0.00
  Per call      0.000         0.00               0.00              0.00

 RE2
  Per 1k char   5.000         1.00               1.00              1.00
  Per call      5.000         1.00               1.00              1.00

2154
 Native
  Per 1k char   0.000         0.00               0.46              0.00
  Per call      0.000         0.00               1.00              0.00

 RE2
  Per 1k char   5.107         1.39               1.39              1.39
  Per call      11.000        3.00               3.00              3.00

4641
 Native
  Per 1k char   0.215         0.00               0.22              0.00
  Per call      1.000         0.00               1.00              0.00

 RE2
  Per 1k char   6.464         1.51               1.51              1.51
  Per call      30.000        7.00               7.00              7.00

10000
 Native
  Per 1k char   0.100         0.10               0.00              0.10
  Per call      1.000         1.00               0.00              1.00

 RE2
  Per 1k char   7.800         2.40               2.50              2.50
  Per call      78.000        24.00              25.00             25.00

21544
 Native
  Per 1k char   0.046         0.05               0.05              0.05
  Per call      1.000         1.00               1.00              1.00

 RE2
  Per 1k char   15.642        5.06               5.01              5.06
  Per call      337.000       109.00             108.00            109.00

46415
 Native
  Per 1k char   0.065         0.04               0.04              0.04
  Per call      3.000         2.00               2.00              2.00

 RE2
  Per 1k char   33.330        10.45              10.45             10.56
  Per call      1547.000      485.00             485.00            490.00

100000
 Native
  Per 1k char   0.080         0.04               0.04              0.06
  Per call      8.000         4.00               4.00              6.00

 RE2
  Per 1k char   66.640        22.24              22.11             22.12
  Per call      6664.000      2224.00            2211.00           2212.00

215443
 Native
  Per 1k char   0.097         0.04               0.07              0.04
  Per call      21.000        8.00               15.00             9.00

 RE2
  Per 1k char   142.116       47.44              47.96             47.87
  Per call      30618.000     10221.00           10333.00          10314.00

464158
 Native
  Per 1k char   0.153         0.05               0.04              0.05
  Per call      71.000        25.00              20.00             24.00

 RE2
  Per 1k char   305.876       102.56             102.59            102.92
  Per call      141975.000    47605.00           47616.00          47771.00

1000000
 Native
  Per 1k char   0.191         0.04               0.04              0.05
  Per call      191.000       43.00              45.00             48.00

 RE2
  Per 1k char   654.652       219.59             219.46            219.85
  Per call      654652.000    219593.00          219463.00         219851.00
@matthewvalentine
Copy link
Author

I should probably mention that it looked like a manual .exec() loop wasn't any faster than matchAll, but I didn't investigate that closely.

@matthewvalentine
Copy link
Author

matthewvalentine commented Oct 27, 2023

I have discovered that match with a global regex appears to scale linearly for this test. So

  1. If I can confirm that, then that's a good mitigation for me right now.
  2. It does suggest this should possibly be a solvable problem for matchAll, whatever the issue may be.
  3. Maybe, if it isn't solvable, this is ultimately down to capture groups? (Even though my example regex does not have any.) Since that's something match does not provide, but I believe matchAll does.

@uhop uhop self-assigned this Oct 29, 2023
@uhop uhop added the question label Oct 29, 2023
@uhop
Copy link
Owner

uhop commented Oct 29, 2023

You do aware that this function is quadratic:

function matchAll(input, regex) {
  return [...input.matchAll(regex)];
}

Specifically, the [...iterator] part, which adds new elements at the end reallocating its array from time to time.

It looks like replaceAll() does the same but differently.

You can even the field by using a loop:

function matchAll(input, regex) {
  const matches = [];
  for (const m of input.matchAll(regex)) {
    matches.push(m);
  }
  return matches;
}

But I doubt it'll be much different. It is likely that you measured the difference between a callback I call directly from C++ and a Node's implementation of iterators. See matchAll().

While RegExp uses a primitive algorithm (last time I checked) and used to be slower than RE2 many versions ago, nowadays it is JITed, which makes it fast in most cases. The only reasonable use case for RE2 is when we deal with "bad input", or cannot control/sanitize it, e.g., it comes from users. You can find examples of such inputs custom-tailored for a regular expression used in DOS attacks.

One user even built an evaluator for that: https://github.com/vt-iwamoto/node-re2-benchmark (listed in this project's wiki).

@matthewvalentine
Copy link
Author

matthewvalentine commented Oct 30, 2023

I fully expect RE2 to be much slower on average. Like you said, the goal here is predictable performance on bad input. This is a kind of input (a string with lots of matches) where it is not bad input for RegExp but IS bad input for node-re2.

You do aware that this function is quadratic:
Specifically, the [...iterator] part, which adds new elements at the end reallocating its array from time to time.

As far as I know appending to an array is amortized constant because it only has to reallocate a logarithmic number of times. No different from std::vector. So, apart from what the iterator is doing internally, I think that the loop itself is linear.

It is likely that you measured the difference between a callback I call directly from C++ and a Node's implementation of iterators.

Makes sense for overall performance, but surely it can't explain a difference in scaling? The callback is only executed a linear number of times.

In comparing RegExp and node-re2, the observed scaling is drastically different, even though the loop logic is identical. Different not just in constant multiple, but in exponent. It seems to me the difference can only come from what the iterator is doing internally (so, RegExp vs node-re2).

Also, I did confirm that node-re2's .match() function (with global flag) appears to be linear for this task. So it doesn't seem like a fundamental limitation of the task, nor of RE2 itself. We could throw away all the talk of RegExp and rephrase this issue ".matchAll() scales worse than .match()". Again, I'd expect it to perform worse, but not to scale worse.

@uhop
Copy link
Owner

uhop commented Nov 3, 2023

As far as I know appending to an array is amortized constant because it only has to reallocate a logarithmic number of times. No different from std::vector.

I think you are right about that.

It seems to me the difference can only come from what the iterator is doing internally (so, RegExp vs node-re2).

Again, here is a link to the node-re2's implementation of matchAll() — about a dozen lines. Feel free to improve it and submit a PR.

@matthewvalentine
Copy link
Author

matthewvalentine commented Nov 16, 2023

I've confirmed this is entirely from UTF-16 / UTF-8 translation. When using Buffer input and .useBuffers = true on the replacer, the nonlinearity disappears from both .matchAll() and .replace[All]().

This makes sense because, currently, the conversion runs on the entire input string for each call to .exec(), which means each iteration of .matchAll(). And UTF-16 index calculations are done from the beginning of the string, both for .exec() and .replace(). Meanwhile .match() only does the string conversion once, and does not have to return an index for each match.

@uhop
Copy link
Owner

uhop commented Nov 18, 2023

It is good actionable data. We can consider moving it to C++, or switching to a buffer version in JS. TBH, the C++ implementation is the bigger endeavor that can require some code restructuring to reuse the internal implementation of exec() keeping JS bridging (and type coercion) separate.

In any case, PRs are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants