Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Named backreferences will always cause a syntax error for non-Unicode regexes in strict parsing mode #23

Open
RunDevelopment opened this issue Jun 3, 2021 · 5 comments
Labels
bug Something isn't working pending for spec bug

Comments

@RunDevelopment
Copy link

RunDevelopment commented Jun 3, 2021

When parsing a non-Unicode regex that contains named backreferences with the strict: true option, a syntax error will always be throws regardless of whether the regex is actually correct or not.

Example:

const { RegExpValidator } = require("regexpp")

const validator = new RegExpValidator({ strict: true, ecmaVersion: 2020 })
validator.validatePattern(/(?<foo>A)\k<foo>/.source, undefined, undefined, false)

This produces the following error:

SyntaxError: Invalid regular expression: /(?<foo>A)\k<foo>/: Invalid escape
    at RegExpValidator.raise ([...]\regexpp\.temp\src\validator.ts:847:15)
    at RegExpValidator.consumeAtomEscape ([...]\regexpp\.temp\src\validator.ts:1475:18)
    at RegExpValidator.consumeReverseSolidusAtomEscape ([...]\regexpp\.temp\src\validator.ts:1245:22)
    at RegExpValidator.consumeAtom ([...]\regexpp\.temp\src\validator.ts:1213:18)
    at RegExpValidator.consumeTerm ([...]\regexpp\.temp\src\validator.ts:1027:23)
    at RegExpValidator.consumeAlternative ([...]\regexpp\.temp\src\validator.ts:1000:53)
    at RegExpValidator.consumeDisjunction ([...]\regexpp\.temp\src\validator.ts:976:18)
    at RegExpValidator.consumePattern ([...]\regexpp\.temp\src\validator.ts:901:14)
    at RegExpValidator.validatePattern ([...]\regexpp\.temp\src\validator.ts:531:14)
    at validateRegExpPattern (my-project\app.ts:12:75)

However, the regex /(?<foo>A)\k<foo>/ is valid. As stated in the proposal:

In this proposal, \k<foo> in non-Unicode RegExps will continue to match the literal string "k<foo>" unless the RegExp contains a named group, in which case it will match that group or be a syntax error, depending on whether or not the RegExp has a named group named foo.

Since the regex contains a named capturing group, \k<foo> has to be parsed as a backreference. Since Annex B doesn't say anything about named backreferences, regexpp should parse this regex even with strict: true.

However, regexpp parses it as an invalid(?) escape and throws an error in strict mode. This is because validation is done is two passes (1, 2). The bug occurs because the n flag isn't set in the first pass causing the syntax error. This can be seen in the stack trace: the second-last line - at RegExpValidator.validatePattern ([...]\validator.ts:531:14) - is the first parsing pass.

The fix for this bug is to determine whether the regex contains named groups ahead of time, similar to how the number of capturing groups is counted before parsing. I will make a PR.

@mysticatea
Copy link
Owner

Hi. Thank you for your detailed report!

Hmm. I'm not 100% sure if it's a spec violation. The two passes parsing came from 22.2.3.2.3 Static Semantics: ParsePattern ( patternText, u ). It says, "if no u flag, parse the pattern without N parameter, then if no syntax errors exist and named capturing groups exist, re-parse the pattern with N parameter." And unfortunately, since 22.2.1 Patterns, \k is a syntax error if no N parameter.

Please tell me if I went wrong.

@mysticatea
Copy link
Owner

So maybe, this is a spec bug, and IdentityEscape production needs [~N] k.

@RunDevelopment
Copy link
Author

RunDevelopment commented Jun 13, 2021

I agree with you. I also read the specification like this. \k<foo> is a syntax error according to the spec. The only reason v8 parses the above regex is because of Annex B.

This is quite interesting because whether this is a bug or not comes down to the semantic of the strict option. Right now strict is implemented as "Annex B syntax is disabled for every parsing pass". In #24, I implemented strict as "Annex B syntax is disabled for the last parsing pass" (well, not quite).

As a question to the creator of the library: What is the semantic of the strict option?

I hope that it's the latter version because that would be a lot more useful IMO.

@mysticatea
Copy link
Owner

mysticatea commented Jun 13, 2021

I opened an issue: tc39/ecma262#2434

I don't think it's intentional that named capturing groups requires u flag. Let's see how TC39 fixes that.


As a question to the creator of the library: What is the semantic of the strict option?

It disables Annex B completely.

@MichaelDeBoey
Copy link

For people watching this issue: we've already started with our own fork in order to not hold the wider community back anymore: https://github.com/eslint-community/regexpp

@mysticatea We would still love to move the original repo to the new @eslint-community though.


This PR is released in @eslint-community/regexpp v4.4.1
https://github.com/eslint-community/regexpp/releases/tag/v4.4.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending for spec bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants