Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RE2 - Syntax change across RE2, Go, Rust made #1588

Closed
jfesler opened this issue May 13, 2021 · 7 comments
Closed

RE2 - Syntax change across RE2, Go, Rust made #1588

jfesler opened this issue May 13, 2021 · 7 comments

Comments

@jfesler
Copy link

jfesler commented May 13, 2021

Bug Description

@BurntSushi noticed that in Rust regex (rust-lang/regex#779):

(|a)* matching aaa matches the entire text.
(|a)+ matching aaa matches only the empty string at the start of the text.

This is being fixed by RE2, Go, and Rust

golang/go#46123
google/re2j#136
rust-lang/regex#779

Reproduction steps

Set regex101 to use "go"
Set text area to aaa
Test (|a)*
Test (|a)+

Expected Outcome

Both should match `` (none)

Browser

Version 14.0.3 (16610.4.3.1.7)

OS

macOS 11.2.3 (20D91)

@working-name
Copy link
Collaborator

I disagree with the expected result as far as normal regex operation goes. You have a blank alternation there which should match any position in the string. The fact that it's in a capture group also means it should capture what it matches, but it would only do so for the last iteration of its matches due to having a quantifier after it. It should never ignore 1 or more a characters in the string just because the first alternation is blank.

(|a)+ or (|a)* in a normal situation should match at any non-consumed (already matched) position in the string and any number of repeating a characters, including zero. But there would always be a match returned. From what I read in the links you posted it seems they want to stick with matching aaa plus the odd blank matches due to the empty alternation. The bug here really is that + makes the engine ignore the non-blank alternation.

As far as the regex itself goes, it's a bit odd to have (|a)+ - you generally don't use empty alternates. The same outcome can be achieved by a* directly, or (?:more-complex-stuff)*.

@BurntSushi
Copy link

As far as the regex itself goes, it's a bit odd to have (|a)+ - you generally don't use empty alternates. The same outcome can be achieved by a* directly, or (?:more-complex-stuff)*.

The oddness of the regex isn't really in question. It's clearly a pathological case, and it's minimized from a more complex regex. I originally reported this with (?m)(^|a)+ matching against a\naaa\n.

The bug here really is that + makes the engine ignore the non-blank alternation.

No. The empty string matches at every position. So repeating it would cause an infinite repetition since there is never any advance.

To be really clear: we (maintainers of Rust regex, Go regex and RE2) are changing this to a result that we believe is both correct and is also consistent with PCRE and Perl.

@working-name
Copy link
Collaborator

As far as the regex itself goes, it's a bit odd to have (|a)+ - you generally don't use empty alternates. The same outcome can be achieved by a* directly, or (?:more-complex-stuff)*.

The oddness of the regex isn't really in question. It's clearly a pathological case, and it's minimized from a more complex regex. I originally reported this with (?m)(^|a)+ matching against a\naaa\n.

I agree, which is why it was my last remark, outside of the main discussion, tried to provide alternates.

The bug here really is that + makes the engine ignore the non-blank alternation.

No. The empty string matches at every position. So repeating it would cause an infinite repetition since there is never any advance.

Not sure why this would be the case. Once a position is matched the "head" advances along in the string. It would however have the opportunity to become a catastrophic backtracking issue as part of a larger regex. But by itself it should not:

https://regex101.com/r/YDmvwO/1

To be really clear: we (maintainers of Rust regex, Go regex and RE2) are changing this to a result that we believe is both correct and is also consistent with PCRE and Perl.

That is awesome, and thank you!

@BurntSushi
Copy link

BurntSushi commented May 13, 2021

Once a position is matched the "head" advances along in the string.

You're talking about iteration. But within the context of a repetition, that isn't the case. The repetition is saying, "match as much of this possible." But matching an empty position means nothing is advanced.

If you match (|a)+ against aaa once, then the match you should and do get back is (0, 0). And that's exactly what regex101 is telling you: once it finds the match at (0, 0), it's forcefully advancing the search position to the next byte because it has detected that there is an empty match. If the repetition worked as you say, then you should get (0, 3) back. (Interestingly, Regex101 is producing additional matches, but that might be an artifact of how PHP implements iteration, which is outside the regex engine internals.)

The bug here is that when you try to match (|a)* once against aaa, then Rust/Go/RE2 all report (0, 3), where as PCRE2 and Perl report (0, 0). And we believe the PCRE2 and Perl output is correct for "leftmost-first" or "preference-order" match semantics.

@working-name
Copy link
Collaborator

Okay I see where the misunderstanding was. We're saying the same thing but I assumed /g flag due to the initial description mentioning matches the entire text.

My apologies, yes, we're on the same page.

@firasdib
Copy link
Owner

firasdib commented Jul 5, 2021

I assume this is an implementation detail within the engine, and not my implementation. If so, I can easily recompile a newer version.

@firasdib
Copy link
Owner

firasdib commented Feb 9, 2022

Thanks for creating this issue, I've finally gotten around to updating it (and improving performance of the golang implementation)

@firasdib firasdib closed this as completed Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants