Do UTF-16 / UTF-8 conversion in JS for matchAll and replace, fixes #194 #200
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This should fix #194. It's not complete in that it does not include sufficient comments, testing, or benchmarking (beyond confirming that it fixes the scaling issue). It may even be functionally broken. But I want to check that this is the general approach you'd prefer. There is very little change to C++ code, but the JS changes are fairly involved, and it's a very bolt-on-wrapper kind of solution.
There is presumably a certain amount of overhead for small inputs that didn't have scaling problems, due to the extra JS overhead. Doing the changes in C++ instead might avoid that. But I have not benchmarked it. I also don't know whether using the C++
getUtf16Length
as I am is faster or slower than doing the same logic in JS.In theory after these changes the C++
replace
function could be simplified to only deal with buffer/UTF-8 inputs and outputs, since JS is handling the conversion.Note that even with these changes, the scaling issue in #194 still exists for hand-written
exec
loops. That would be a much harder problem to solve. I think it would involve keeping state. And even with that, you'd still have to somehow be able to tell whether the input string toexec
is the same string as before, preferably without holding a strong reference to the string.