Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do UTF-16 / UTF-8 conversion in JS for matchAll and replace, fixes #194 #200

Conversation

matthewvalentine
Copy link

@matthewvalentine matthewvalentine commented Dec 8, 2023

This should fix #194. It's not complete in that it does not include sufficient comments, testing, or benchmarking (beyond confirming that it fixes the scaling issue). It may even be functionally broken. But I want to check that this is the general approach you'd prefer. There is very little change to C++ code, but the JS changes are fairly involved, and it's a very bolt-on-wrapper kind of solution.

There is presumably a certain amount of overhead for small inputs that didn't have scaling problems, due to the extra JS overhead. Doing the changes in C++ instead might avoid that. But I have not benchmarked it. I also don't know whether using the C++ getUtf16Length as I am is faster or slower than doing the same logic in JS.

In theory after these changes the C++ replace function could be simplified to only deal with buffer/UTF-8 inputs and outputs, since JS is handling the conversion.

Note that even with these changes, the scaling issue in #194 still exists for hand-written exec loops. That would be a much harder problem to solve. I think it would involve keeping state. And even with that, you'd still have to somehow be able to tell whether the input string to exec is the same string as before, preferably without holding a strong reference to the string.

@matthewvalentine
Copy link
Author

In fact, I know it is functionally broken because it fails the existing tests. Still, I'm look for feedback on the approach.

@matthewvalentine
Copy link
Author

matthewvalentine commented Dec 8, 2023

I split what used to be this comment into #201 since it is about master and not this PR.

Kept for posterity.

Hm, in trying to figure out why it breaks the tests, I've either come across a different bug or I am misunderstanding something. On master, without these changes, the indexes returned by execing a Buffer input seem incorrect.

This code:

const RE2 = require('./re2');
const r = new RE2('.', 'g');
const b = Buffer.from('test1test2');
while ((m = r.exec(b))) {
	console.log(m[0].toString(), m.index, r.lastIndex, String.fromCharCode(b[m.index]));
}

produces this output:

t 0 1 t
e 2 2 s
s 4 3 1
t 6 4 e
1 8 5 t
t 10 6
e 12 7
s 14 8
t 16 9
2 18 10

Compared to the same sort of thing with a string input:

t 0 1 t
e 1 2 e
s 2 3 s
t 3 4 t
1 4 5 1
t 5 6 t
e 6 7 e
s 7 8 s
t 8 9 t
2 9 10 2

The .index value in the Buffer input case doesn't seem to correspond to the value returned as the match, and also doesn't line up with the resulting .lastIndex. In fact, .index even starts exceeding the buffer length.

Anyway, a least one group of test failures for the changes in this PR are because, by switching to all-buffers, matchAll is now inheriting that same behvior.

@uhop
Copy link
Owner

uhop commented Dec 22, 2023

I was thinking about a fix like that:

  • If it is a buffer: do it like we do now.
  • If it is a string: convert it to a buffer. See above.
    • Converting a string to a UTF-8 buffer (and back) is trivial. Node's Buffer supports it directly.

The only problem is to convert offsets from UTF-8 to UTF-16. Plus we need to convert buffers to strings. We can do matchAll() as a post-processing step. So the whole matchAll() can be a JS function. We will need a modified GetUtf16Length() function, but it looks like you already started on that.

Relevant Buffer APIs:

replaceAll() is a little more involved because of callbacks from C++. That one can be implemented like this:

  • If it is a buffer: do it like we do now.
  • If it is a string: convert it to a buffer. See above.

The post-processing step should be in the C++ code before/if calling a JS callback. We can indicate it using a custom flag, e.g., using '\b' as an indicator. Checking this boolean flag can be added to the existing code, which does this conversion already (internally re2 works with UTF-8 buffers anyway).

This way the replaceAll() code would include:

const re = new RE2(this, this.flags + '\b');

It will hide the fact that the source is a buffer.

A possible twist: we can do the post-processing for matchAll() in C++ too if it is more efficient. I doubt it'll save much but the proof in the benchmarking. For that, we can use the same '\b' flag.

I hope it makes sense. The idea was to minimize C++ changes yet fix inefficiencies. If readers think that my rambles were incoherent do not hesitate to ask questions and poke holes in the ideas.

@uhop uhop closed this in e82ca27 May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

matchAll and replaceAll scale nonlinearly
2 participants