Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid transpilation for . with su flags #23

Open
RReverser opened this issue Nov 21, 2018 · 6 comments
Open

Invalid transpilation for . with su flags #23

RReverser opened this issue Nov 21, 2018 · 6 comments

Comments

@RReverser
Copy link

RReverser commented Nov 21, 2018

The example in README with su flag works for a single character, but output seemed odd, so I tried to fuzz it for various strings and looks like output is incorrect / incompatible with native implementation (unless it's a bug in V8 instead).

Let's try an example with two any characters and an ill-formed string:

> s = '\0\uDC00'
'\u0000�'
> r1 = /^.{2}$/su
/.{2}/su
> r1.test(s)
true

As you can see, native regexp implementation (Node.js 11.0.0) matches it as expected.

Let's try with regexpu + useUnicodeFlag:

> r2 = new RegExp(rewritePattern('^.{2}$', 'su', { dotAllFlag: true, useUnicodeFlag: true }), 'u')
/^[\0-\u{10FFFF}]{2}$/u
> r2.test(s)
true

So far, so good.

Now what if we ask it to expand Unicode ranges?

> r3 = new RegExp(rewritePattern('^.{2}$', 'su', { dotAllFlag: true }))
/^(?:[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]){2}$/
> r3.test(s)
false

Oops, looks like a bug.

I didn't dig into reasons yet, but I suspect it has something to do with [^\uD800-\uDBFF]|^ part.

@mathiasbynens
Copy link
Owner

This is a known limitation of regenerate: https://github.com/mathiasbynens/regenerate#regenerateprototypetostringoptions

Note that lone low surrogates cannot be matched accurately using regular expressions in JavaScript. Regenerate’s output makes a best-effort approach but there can be false negatives in this regard.

Nowadays we have lookbehind support in JS, but a transpiler such as regexpu can’t rely on that (at least not until lookbehinds are widely supported).

Here’s the old relevant comment: mathiasbynens/regenerate#28 (comment) I’d love to hear your thoughts on this!

@RReverser
Copy link
Author

Hmm, yeah, I guess it's not an easy thing to solve in a general case and needs more research.

However, for #24 instead of having post-rewrite optimisation layer, I was rather thinking of having specialisations for common cases so that they wouldn't have to be expanded in the first place.

And, specifically for ., which, arguably, is likely to be a common case, having a specialisation would both produce more optimal output and remove it from such false negatives (for that particular false negative) if we convert /./u to something like /(?:.|[\uD800-\uDBFF][\uDC00-\uDFFF])/.

Arguably this is just a partial solution, but then, current implementation already covers only part of the full range anyway.

What do you think?

@RReverser
Copy link
Author

RReverser commented Nov 21, 2018

Actually, maybe we could do the same for any case where we know the codepoint range (that is, when original regex doesn't explicitly try to match lone surrogates either)...

@mathiasbynens
Copy link
Owner

I was rather thinking of having specialisations for common cases so that they wouldn't have to be expanded in the first place.

I’m open to this as well. 👍🏻

@RReverser
Copy link
Author

Digging into this deeper, I'm starting to think the issue can be solved deeper in regenerate. In splitBMP it tries to extract lone surrogates even when a range could be used as a whole (like with dot above) and splitting could be avoided. Hmm...

@RReverser
Copy link
Author

Ah I found original issue mathiasbynens/regexpu#16 now, and indeed the "simpler" transpilation as shown above would break it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants