Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce R opt-in flag for the CRLF mode #131

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

mmizutani
Copy link

@mmizutani mmizutani commented Jan 3, 2024

Description

This PR introduces the inline regex flag R ((?R:)) for switching on and off the CRLF mode.

Changes

  • Extended the parser to support a new opt-in regex flag R for switching the CRLF mode.
  • Extended the backtracking VM engine to support the CRLF mode.

Background

The CRLF mode has been available in the Rust regex crate via an opt-in flag since v1.9.0 and in the corresponding versions of the regex-automata crate,

A new inline flag, R, which enables CRLF mode. This makes . match any Unicode scalar value except for \r and \n, and also makes (?m:^) and (?m:$) match after and before both \r and \n, respectively, but never between a \r and \n.

but it is still not supported in the fancy-regex crate:

regex must be valid: ParseError(2, UnknownFlag("(?R"))

The CRLF mode was introduced to the Rust regex crate to make the treatment of a carriage return r by any character or byte (.) and line anchors (^, $) more aligned with that of a line feed \n in a backward-compatible way.

Before the introduction of the CRLF mode or with the CRLF mode turned off, which is by default in the regex crate, a carriage return r is not fully handled as a new line character unlike \n and is treated more like an ordinary, non- new line character:

  • When the multiline mode is on and the state (dot new line) mode is off ((?m-s)), the any syntax (.) matches \n\r but not \r\n in the Rust regex crate engine as if . = [^\n] not . = [^\r\n]. This behavior is in line with default behavior of other major regex engines, but some use cases might prefer \r being treated as a new line character just like \n.
  • Line anchors (^ and $ in the multiline mode ((?m:)) could find a match at a boundary between \r and \n of \r\n in the rust regex crate engine although it is not an end of a line in a strict sense. In Oniguruma and some other regex engines, ^ and $ do not match at the middle of \r\n, or how to treat \r, \n, and other less common new line Unicode characters is configurable as in PCRE2.

Being able to selectively enable the CRLF mode in the fancy-regex would be useful in some special cases, e.g., when porting regex strings written for the PCRE2 engine compiled with the PCRE2_NEWLINE_ANYCRLF option.

Note

The CRLF mode is a bit tricky to understand since the R inline flag indirectly changes matching behaviors by altering the effects of two other inline flags: multiline (m) and state (dot new line) (s).
In a nutshell,

As BurntSushi has pointed out in the comment below, we can think about CRLF mode as changing a single thing: the line terminator goes from \n to \r\n|\r|\n. Everything else follows from that, including the behavior of the s and m flags.

Specifically, from the perspective of the developers of fancy-regex, this implies the following branching of behaviors of ^, $ and . based on the presence or absence of the R flag:

Flags ^ and $ behave as Effects on matching
(?), (?R) Assertion::StartText,
Assertion::EndText
^ and $ match the start and end of the whole input text. That is, R flag makes no difference.
(?m) Assertion::StartLine { crlf: false }
Assertion::EndLine { crlf: false }
^ and $ match starts and end of lines. They could match at between \r and \n of \r\n.
(?mR) Assertion::StartLine { crlf: true }
Assertion::EndLine { crlf: true }
^ and $ match start and end of lines. They never match at between \r and \n of \r\n. \r\n as a whole is treated like a newline.
Flags . behaves as Effects on matching
(?s), (?sR) Insn::Any . matches any character including \r and \n. That is, R flag makes no difference.
(?) Insn::AnyExceptLF . matches any character including \r but excluding \n.
(?R) Insn::AnyExceptCRLF . matches any character excluding \r and \n.

Related discussion and commits

BurntSushi added a commit to BurntSushi/rebar that referenced this pull request Jan 3, 2024
@BurntSushi
Copy link

When the multiline mode is on and the state (dot new line) mode is off ((?m-s:)), the any syntax (.) matches \n but not \r in the Rust regex crate engine as if . = [^\n] not . = [^\r\n].

To clarify, the only way for . to match \n is when the s flag is enabled. The m flag has zero impact on the match semantics of .. I think your wording is wrong, but your . = [^\n] is correct.

In the roughly equivalent mode of Oniguruma ((?-m:)) and some other regex engines, . matches both \n and \r.

This doesn't seem right either:

fn main() {
    let regex = onig::Regex::new("(?-m:.)").unwrap();
    println!("{:?}", regex.find("\n"));
}

Has this output:

$ cargo run -q
None

But this:

fn main() {
    let regex = onig::Regex::new("(?m:.)").unwrap();
    println!("{:?}", regex.find("\n"));
}

Has this output:

$ cargo run -q
Some((0, 1))

Oniguruma is actually the odd regex engine here. The vast majority of regex engines treat the s and m flags in the same way as the regex crate. I would not recommend copying Oniguruma on this point. I think Oniguruma might have gotten this behavior from Spencer's regex library, but I can't find a source for that.

The CRLF mode is a bit tricky to understand since the R inline flag indirectly changes matching behaviors by altering the effects of two other inline flags: multiline (m) and state (dot new line) (s).

Think about CRLF mode as changing a single thing: the line terminator goes from \n to \r\n|\r|\n. Everything else follows from that, including the behavior of the s and m flags.

@mmizutani
Copy link
Author

@BurntSushi
Thank you for your points as well as the comprehensive script for checking the behavior of various regex engines. I have corrected my PR description, especially,

When the multiline mode is on and the state (dot new line) mode is off ((?m-s)), the any syntax (.) matches \n\r but not \r\n in the Rust regex crate engine as if . = [^\n] not . = [^\r\n]. This behavior is in line with default behavior of other major regex engines, but some use cases might prefer \r being treated as a new line character just like \n in a consistent way.

I have replaced Oniguruma with PCRE2 to reduce noise in explaining when the CRLF mode might be useful in fancy-regex. The naming of regex flags in Oniguruma is indeed rather unique and confusing for users of other regex engines.

@CAD97 CAD97 mentioned this pull request Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants