-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce R
opt-in flag for the CRLF mode
#131
base: main
Are you sure you want to change the base?
Conversation
To clarify, the only way for
This doesn't seem right either: fn main() {
let regex = onig::Regex::new("(?-m:.)").unwrap();
println!("{:?}", regex.find("\n"));
} Has this output:
But this: fn main() {
let regex = onig::Regex::new("(?m:.)").unwrap();
println!("{:?}", regex.find("\n"));
} Has this output:
Oniguruma is actually the odd regex engine here. The vast majority of regex engines treat the
Think about CRLF mode as changing a single thing: the line terminator goes from |
@BurntSushi
I have replaced Oniguruma with PCRE2 to reduce noise in explaining when the CRLF mode might be useful in |
Description
This PR introduces the inline regex flag
R
((?R:)
) for switching on and off the CRLF mode.Changes
R
for switching the CRLF mode.Background
The CRLF mode has been available in the Rust
regex
crate via an opt-in flag since v1.9.0 and in the corresponding versions of theregex-automata
crate,but it is still not supported in the
fancy-regex
crate:The CRLF mode was introduced to the Rust
regex
crate to make the treatment of a carriage returnr
by any character or byte (.
) and line anchors (^
,$
) more aligned with that of a line feed\n
in a backward-compatible way.Before the introduction of the CRLF mode or with the CRLF mode turned off, which is by default in the
regex
crate, a carriage returnr
is not fully handled as a new line character unlike\n
and is treated more like an ordinary, non- new line character:(?m-s)
), the any syntax (.
) matches\n
\r
but not\r
\n
in the Rustregex
crate engine as if. = [^\n]
not. = [^\r\n]
. This behavior is in line with default behavior of other major regex engines, but some use cases might prefer\r
being treated as a new line character just like\n
.^
and$
in the multiline mode ((?m:)
) could find a match at a boundary between\r
and\n
of\r\n
in the rustregex
crate engine although it is not an end of a line in a strict sense. In Oniguruma and some other regex engines,^
and$
do not match at the middle of\r\n
, or how to treat\r
,\n
, and other less common new line Unicode characters is configurable as in PCRE2.Being able to selectively enable the CRLF mode in the
fancy-regex
would be useful in some special cases, e.g., when porting regex strings written for the PCRE2 engine compiled with thePCRE2_NEWLINE_ANYCRLF
option.Note
The CRLF mode is a bit tricky to understand since theR
inline flag indirectly changes matching behaviors by altering the effects of two other inline flags: multiline (m
) and state (dot new line) (s
).In a nutshell,
As BurntSushi has pointed out in the comment below, we can think about CRLF mode as changing a single thing: the line terminator goes from
\n
to\r\n|\r|\n
. Everything else follows from that, including the behavior of thes
andm
flags.Specifically, from the perspective of the developers of
fancy-regex
, this implies the following branching of behaviors of^
,$
and.
based on the presence or absence of theR
flag:^
and$
behave as(?)
,(?R)
Assertion::StartText
,Assertion::EndText
^
and$
match the start and end of the whole input text. That is,R
flag makes no difference.(?m)
Assertion::StartLine { crlf: false }
,Assertion::EndLine { crlf: false }
^
and$
match starts and end of lines. They could match at between\r
and\n
of\r\n
.(?mR)
Assertion::StartLine { crlf: true }
,Assertion::EndLine { crlf: true }
^
and$
match start and end of lines. They never match at between\r
and\n
of\r\n
.\r\n
as a whole is treated like a newline..
behaves as(?s)
,(?sR)
Insn::Any
.
matches any character including\r
and\n
. That is,R
flag makes no difference.(?)
Insn::AnyExceptLF
.
matches any character including\r
but excluding\n
.(?R)
Insn::AnyExceptCRLF
.
matches any character excluding\r
and\n
.Related discussion and commits