Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: (|a)* matches more text than (|a)+ does #136

Closed
adonovan opened this issue May 12, 2021 · 2 comments · Fixed by #137
Closed

regexp: (|a)* matches more text than (|a)+ does #136

adonovan opened this issue May 12, 2021 · 2 comments · Fixed by #137
Assignees

Comments

@adonovan
Copy link
Collaborator

adonovan commented May 12, 2021

From golang/go#46123 (the Go implementation of the algorithm):
@BurntSushi noticed that in Rust regex (rust-lang/regex#779):

(|a)* matching aaa matches the entire text.
(|a)+ matching aaa matches only the empty string at the start of the text.
This is a bit of an unlikely corner case, but it's certainly a significant violation of the rules of regexps for * to match more than +. The correct answer is for (|a)* to match an empty string too, because each iteration prefers the empty string, so even an infinite number of those matches will never fall back to matching an a. (This behavior is what Perl and PCRE do.)

Go's regexp package and RE2 have the same bug, which ultimately derives from a mismatch between the e* picture in my first regexp article and the backtracking prioritization matching in the second article. The implementation of e* needs to be a little different to get the match prioritization correct in the case where e has a preferred empty match.

I found this mismatch between RE2 and Perl in the e* case when running Glenn Fowler's regexp test cases against RE2 long ago, but at the time I thought the mismatch was unavoidable in the proper NFA approach. @BurntSushi's observation that e+ handles the empty match correctly proves otherwise. The correct fix is to leave the compilation of e+ alone and then compile e* as (e+)?. (I will add a note to the article as well.)

This is a tiny bug fix but may of course result in different matches for certain programs, causing problems for programs that rely on the current buggy behavior. That is always the case for any bug fix, of course. Fixing the bug only in Go would also make Go diverge from RE2 and Rust.

@junyer, @BurntSushi, and I discussed both the potential for breakage and the goal of keeping Go, RE2, and Rust aligned. We decided the potential breakage is minimal and outweighed by the benefits of fixing the bug, to better match Perl and PCRE, and to reestablish the properties that e* never matches more than e+ and that e+ and ee* are always the same. We agreed to make the change in all of our implementations.

@rsc
Copy link

rsc commented May 12, 2021

The final Go fix is here: https://go-review.googlesource.com/c/go/+/318750/4/src/regexp/syntax/compile.go.
RE2 and Rust are using the same logic (only change e* when e can match an empty string).

@sjamesr
Copy link
Contributor

sjamesr commented May 12, 2021

Thank you, I'll port the fix to RE2/J shortly.

sjamesr added a commit to sjamesr/re2j that referenced this issue May 27, 2021
Original go fix:

golang/go@2a61b3c

Change description from that commit:

In Perl mode, (|a)* should match an empty string at the start of the
input. Instead it matches as many a's as possible.
Because (|a)+ is handled correctly, matching only an empty string,
this leads to the paradox that e* can match more text than e+
(for e = (|a)) and that e+ is sometimes different from ee*.

The current code treats e* and e+ as the same structure, with
different entry points. In the case of e* the preference list ends up
not quite in the right order, in part because the "before e" and
"after e" states are the same state. Splitting them apart fixes the
preference list, and that can be done by compiling e* as if it were
(e+)?.

Fixes google#136.
sjamesr added a commit to sjamesr/re2j that referenced this issue May 27, 2021
Ports fix of golang/go#46123 to RE2/J.

Original go fix:

golang/go@2a61b3c

Change description from that commit:

In Perl mode, (|a)* should match an empty string at the start of the
input. Instead it matches as many a's as possible.
Because (|a)+ is handled correctly, matching only an empty string,
this leads to the paradox that e* can match more text than e+
(for e = (|a)) and that e+ is sometimes different from ee*.

The current code treats e* and e+ as the same structure, with
different entry points. In the case of e* the preference list ends up
not quite in the right order, in part because the "before e" and
"after e" states are the same state. Splitting them apart fixes the
preference list, and that can be done by compiling e* as if it were
(e+)?.

Fixes google#136.
sjamesr added a commit to sjamesr/re2j that referenced this issue May 27, 2021
Ports fix of golang/go#46123 to RE2/J.

Original go fix:

golang/go@2a61b3c

Change description from that commit:

In Perl mode, (|a)* should match an empty string at the start of the
input. Instead it matches as many a's as possible.
Because (|a)+ is handled correctly, matching only an empty string,
this leads to the paradox that e* can match more text than e+
(for e = (|a)) and that e+ is sometimes different from ee*.

The current code treats e* and e+ as the same structure, with
different entry points. In the case of e* the preference list ends up
not quite in the right order, in part because the "before e" and
"after e" states are the same state. Splitting them apart fixes the
preference list, and that can be done by compiling e* as if it were
(e+)?.

Fixes google#136.
sjamesr added a commit to sjamesr/re2j that referenced this issue May 27, 2021
Ports fix of golang/go#46123 to RE2/J.

Original go fix:

golang/go@2a61b3c

Change description from that commit:

In Perl mode, (|a)* should match an empty string at the start of the
input. Instead it matches as many a's as possible.
Because (|a)+ is handled correctly, matching only an empty string,
this leads to the paradox that e* can match more text than e+
(for e = (|a)) and that e+ is sometimes different from ee*.

The current code treats e* and e+ as the same structure, with
different entry points. In the case of e* the preference list ends up
not quite in the right order, in part because the "before e" and
"after e" states are the same state. Splitting them apart fixes the
preference list, and that can be done by compiling e* as if it were
(e+)?.

Fixes google#136.
sjamesr added a commit that referenced this issue May 28, 2021
Ports fix of golang/go#46123 to RE2/J.

Original go fix:

golang/go@2a61b3c

Change description from that commit:

In Perl mode, (|a)* should match an empty string at the start of the
input. Instead it matches as many a's as possible.
Because (|a)+ is handled correctly, matching only an empty string,
this leads to the paradox that e* can match more text than e+
(for e = (|a)) and that e+ is sometimes different from ee*.

The current code treats e* and e+ as the same structure, with
different entry points. In the case of e* the preference list ends up
not quite in the right order, in part because the "before e" and
"after e" states are the same state. Splitting them apart fixes the
preference list, and that can be done by compiling e* as if it were
(e+)?.

Fixes #136.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants