Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

permit some no-op escape sequences for compatibility purposes #501

Closed
KiChjang opened this issue Jul 28, 2018 · 36 comments
Closed

permit some no-op escape sequences for compatibility purposes #501

KiChjang opened this issue Jul 28, 2018 · 36 comments

Comments

@KiChjang
Copy link
Member

Some of the regexes found in https://github.com/ua-parser/uap-core is throwing errors when parsed with the regex crate:

regex parse error:
    (?:\/[A-Za-z0-9\.]+)? *([A-Za-z0-9 \-_\!\[\]:]*(?:[Aa]rchiver|[Ii]ndexer|[Ss]craper|[Bb]ot|[Ss]pider|[Cc]rawl[a-z]*))/(\d+)(?:\.(\d+)(?:\.(\d+))?)?
       ^^
error: unrecognized escape sequence

This sounds like we're deviating from the regex spec here. Can someone confirm?

@BurntSushi
Copy link
Member

Which regex spec are you referring to?

Otherwise, yes, this crate disallows unnecessary escapes. This is to permit the addition of new escapes in a backwards compatible way. It is plausible that we could allow certainly no-op escapes (such as for /), but this may or may not solve the large problem.

@KiChjang
Copy link
Member Author

KiChjang commented Jul 31, 2018

So I was referring to the ES6 spec for regexes. I believe that there are production-grade projects such as the one I linked in my first post which does contain unnecessary backslashes, and I think this crate should definitely provide a way to accept these regexes as-is, without modifying the regexes contained within to conform to the regex syntax introduced in this crate.

For the project I linked, here's a shortlist of what I did to make it compile with this crate:

  • \/ -> /
  • \! -> !
  • \ ->
  • |) -> )? (empty alternations)

@BurntSushi
Copy link
Member

So I was referring to the ES6 spec for regexes.

This crate definitely does not, and never will, conform to the ES6 specification for regexes. It is a complete non-goal.

I am much more sympathetic to your practical concerns. I'm generally strongly opposed to allowing escapes to always work even when they are no-ops, but I could get on board with selecting a set of commonly escaped characters in the wild.

It might also be smart to try to fix those projects such that they don't use unnecessary escapes.

Empty alternations are something I'd also like to support, but a bug in the compiler prevents it for now.

@RReverser
Copy link
Contributor

We actually made own Rust wrapper for uap-core, and yes, we had to fix unnecessary escapes. Perhaps we could open-source it to avoid duplication of efforts?

@BurntSushi
Copy link
Member

@RReverser Were escapes the only reason for that? Or were there other issues that needed to be papered over?

@RReverser
Copy link
Contributor

the only reason for that

Only reason for what? Implementing Rust version of uap-core?

@BurntSushi
Copy link
Member

@RReverser In the process of doing that, you said you had to fix unnecessary escapes. Was there anything else you needed to do with the regexes specifically to make them work with Rust's regex engine?

@RReverser
Copy link
Contributor

RReverser commented Aug 16, 2018

Ah. Well, another fix I had to do was to replace \d to match only ASCII digits, since it turned out to include Unicode ones as well, which should not be allowed from the point of UA parser, although in all other regards strings should be still matched in Unicode-aware mode (I suppose you remember our discussion about this).

Other than that, no other fixes were necessary, although I did write a bunch of extra analysis and rewrites using regex-syntax to optimise common cases.

@BurntSushi BurntSushi changed the title Unable to parse some of the regexes in ua-parser/uap-core permit some no-op escape sequences for compatibility purposes Oct 8, 2018
@BurntSushi
Copy link
Member

In an effort to keep conversation on this topic in one place, I'm going to respond to @zackw's issue #522 here:

To recap, my high level current thinking on this topic:

I'm generally strongly opposed to allowing escapes to always work even when they are no-ops, but I could get on board with selecting a set of commonly escaped characters in the wild.

The problem here is that I don't know how much effort we should expend to make the syntax compatible with other regex engines. Surely, we can all agree that 100% compatibility can never happen or be expected. So we have to choose some set of features that gets us part of the way there. I just don't know which things to choose.

The purpose for the current behavior is to permit backwards compatible additions to the syntax of regexes. Some regex engines were developed with very little foresight. Python's regex engine, for example, will permit just about anything to be escaped. Some escapes are significant, but the escapes that aren't significant behave as if they weren't escaped at all. This makes it impossible to add new escape sequences since existing escapes are valid and have specified match semantics.

This regex library chooses to return an error for non-significant escape sequences precisely because we consider turning invalid syntax into valid syntax to be a backwards compatible change, and we can legitimately get away with that by enforcing it.

This framework does not specifically forbid insignificant escape sequences. We can simply choose the ones we want to explicitly allow. , " and ' are all candidates, but they are far from the only ones. Moreover, even if we were to add new escape sequences in the future, it would probably be poor judgment to use an escape sequence that is commonly used in other regex engines as an insignificant escape sequence. e.g., Prescribing special meaning to \" while another regex engine just treats it as a literal " is probably poor form.

I think the most compelling use cases, from my perspective, are huge libraries of regexes. However, in practice, it seems quite difficult to just expect to be able to compile them without any other changes. As @RReverser notes above, the \d, \w and \s escape sequences are all Unicode aware by default, which is not common in other regex engines, which were likely built in a time before Unicode was as widespread as it is today. Therefore, even if we fix the cases of escape sequences, you still wind up with subtle match differences. If pressed, I am sure I could come up with a list of several more cases. What I'm trying to say here is that it may be unreasonable to expect a large existing library of regexes---written specifically for one particular regex engine---to just work out of the box on a different regex engine, even if it might in practice in some number cases.

@zackw
Copy link

zackw commented Oct 8, 2018

I understand and appreciate your position that no-op escapes should not be added just because they are no-op escapes in other regex engines.

I would like to offer an independent argument for no-op \", \' and \/ (didn't think of \/ before, but yes, that one too) based on the fact that ", ', and / are commonly used to delimit regex literals in many different languages and contexts, and \", \', \/ are commonly understood to escape the delimitation (i.e. extend the regex literal past a point where it would otherwise have ended). In many cases, backslashes that escape delimitation will be stripped by the "outer" parser before the regex engine sees them, but not all (e.g. the Python raw strings I mentioned in #522). Therefore, no-op \", \' and \/ is not just a matter of compatibility with other regex engines, but other surrounding contexts besides Rust source code.

@BurntSushi
Copy link
Member

BurntSushi commented Oct 8, 2018

@zackw Ah I see. I don't think I appreciated that point about Python's raw strings when I first read it in #522. Thanks for mentioning that again.

OK, so how about we start off by making ", ', (space, 0x20), / and ! escapeable but no-op? This will also cement the syntax such that these characters can never be used as a valid escape sequence that does anything other than match the literal being escaped. I think for these characters, that would be reasonable.

We can add more no-op escapes moving forward if they creep up, but I think these are probably the ones I see escaped most often.

@KiChjang
Copy link
Member Author

KiChjang commented Oct 8, 2018

@BurntSushi That sounds reasonable to me. About the empty alternations -- is there a tracking issue in rustc for the bug you mentioned?

@BurntSushi
Copy link
Member

BurntSushi commented Oct 8, 2018

@KiChjang Errm, the "compiler" in this context refers to the regex compiler, not rustc. Sorry about the mixup. But no, there is no issue for it because I don't yet understand the bug and it only manifests when empty alternations are allowed. There probably should be an issue for the feature of empty alternations though.

@zackw
Copy link

zackw commented Oct 8, 2018

+1 from me on no-op treatment for ", ', /, and space. +0 on !. I don't know of any context where ! is used as a delimiter, except sed's alternative-delimiter notation, m!...! where ! can be any single ASCII character; since it can be any character, that notation shouldn't be an argument for anything.

I see that @KiChjang originally asked for \! to be a no-op because of an existing body of JavaScript regexes that use it. JavaScript regex syntax defines \h to be equivalent to h for all characters h where \h has not already been assigned a special meaning (if I'm reading https://tc39.github.io/ecma262/#sec-patterns-static-semantics-character-value correctly), which is exactly the thing @BurntSushi didn't like up above. Could we have some specific examples, with context, of regexes using \! instead of bare ! please? I want to understand why they were written that way.

@zackw
Copy link

zackw commented Oct 8, 2018

@BurntSushi Regarding empty alternations, it might be a good idea to file an issue just so it's on record as a known problem and something you intend to support in the future.

@BurntSushi
Copy link
Member

Aye. I opened #524.

@KiChjang
Copy link
Member Author

KiChjang commented Oct 8, 2018

@zackw To be honest, I don't know why the regexes in the project I linked escapes !s, but here are the lines where it would escape it: https://github.com/ua-parser/uap-core/blob/23bfabe34b86f29f4840c9dd1ef6129e685581e3/regexes.yaml#L98-L100

@zackw
Copy link

zackw commented Oct 8, 2018

    [A-Za-z0-9 \-_\!\[\]:]*
    [A-Za-z0-9 _\!\[\]:]*

It's not at all clear to me what either of those character classes are supposed to do. And the context makes it sound like they're not supposed to be different, either, for added bafflement. I'm actually left wondering whether someone thought ! was a metacharacter within JS regex character classes (it isn't; it's a metacharacter within shell glob character classes, but that's a totally different ball of wax).

@RReverser
Copy link
Contributor

Supporting " and / would solve all the cases where I had to manually (and carefully) unescape regexes before passing to Regex::new in several projects, so big 👍 here. I guess ' also makes sense, but never seen ! as being important.

More generally, I think it should be possible to allow no-op escapes for any non-ASCII-alphanumeric characters without breaking forward compatibility, but starting with a limited set is probably better for now.

@okdana
Copy link

okdana commented Jul 27, 2020

I wanted to mention that one of the reasons you might see 'no-op' escapes in the wild is that some languages' regex-escape functions produce them.

For example, Perl's quotemeta() escapes all non-word ASCII characters, and PHP's preg_quote() escapes all 'special' punctuation characters (even ones like ! that are only special when combined with an always-special character like ? that would be escaped anyway). Python's re.escape() used to work like PHP too, but it's been made more selective recently. I don't think JavaScript has a built-in escape function, but it does support look-around, so maybe there's some library that escapes ! for the same reason.

As far as patterns written out by humans, i'm just guessing, but i can think of two reasons they'd do it: (1) the author understands how the escaping works but chooses to rely on the 'no-op' feature so they don't have to remember which characters are special and when (not a bad reason imo), or (2) the author doesn't understand how the escaping works and simply cargo-cults it from the output of those functions, or from the first type of person.

@anweiss
Copy link

anweiss commented Oct 28, 2020

Is there an easy way to remove unnecessary escapes from a given string using this crate? If not, is there a list of characters that don't require escapes? I'm attempting to convert an existing PCRE-compatible regex to an expression that this crate can parse. Thanks!

@BurntSushi
Copy link
Member

The crate certainly does not provide any such operation. It wouldn't really make sense to IMO.

As for a list of all meta characters, I think the only stable way to do that is to use is_meta_character from the regex-syntax crate. is_meta_character returns true only for characters that must be escaped in order to use their literal form. Other characters, such as ASCII space, can be escaped but do not need to be escaped.

Now, is_meta_character doesn't give you a list, but you can generate one by just trying all inputs. And since is_meta_character promises that the list will never expand or contract in a semver compatible release, the generated list will be stable. Or you could just use the fact that all meta characters are ASCII, so you only need to check 128 possible inputs instead of the full range of Unicode scalar values.

@anweiss
Copy link

anweiss commented Oct 29, 2020

Thanks @BurntSushi for the explanation! Super helpful! Will look at is_meta_character.

@BurntSushi
Copy link
Member

That appears to be a regex101 thing, or something related to how they've configured PCRE2 (which has many many options). But / is not a meta character and it does not need to be escaped. You can also use ripgrep to try out PCRE2:

$ echo '/c' | rg -oP '(?:/c|b)'
/c

Or even pcre2grep itself:

$ echo '/c' | pcre2grep -o '(?:/c|b)'
/c

BurntSushi added a commit that referenced this issue Mar 3, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
@BurntSushi
Copy link
Member

This should be actually happening soon, where all ASCII characters except for [0-9A-Za-z<>] will be escapeable, even if the escape is superfluous. As part of this, I've introduced a new regex_syntax::is_escapeable function:

pub fn is_escapeable_character(c: char) -> bool {
    // Certainly escapeable if it's a meta character.
    if is_meta_character(c) {
        return true;
    }
    // Any character that isn't ASCII is definitely not escapeable. There's
    // no real need to allow things like \☃ right?
    if !c.is_ascii() {
        return false;
    }
    // Otherwise, we basically say that everything is escapeable unless it's a
    // letter or digit. Things like \3 are either octal (when enabled) or an
    // error, and we should keep it that way. Otherwise, letters are reserved
    // for adding new syntax in a backwards compatible way.
    match c {
        '0'..='9' | 'A'..='Z' | 'a'..='z' => false,
        // While not currently supported, we keep these as not escapeable to
        // give us some flexibility with respect to supporting the \< and
        // \> word boundary assertions in the future. By rejecting them as
        // escapeable, \< and \> will result in a parse error. Thus, we can
        // turn them into something else in the future without it being a
        // backwards incompatible change.
        '<' | '>' => false,
        _ => true,
    }
}

BurntSushi added a commit that referenced this issue Mar 4, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
BurntSushi added a commit that referenced this issue Mar 5, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
BurntSushi added a commit that referenced this issue Mar 15, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
BurntSushi added a commit that referenced this issue Mar 15, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
BurntSushi added a commit that referenced this issue Mar 15, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
BurntSushi added a commit that referenced this issue Mar 20, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
BurntSushi added a commit that referenced this issue Mar 21, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
BurntSushi added a commit that referenced this issue Apr 15, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
BurntSushi added a commit that referenced this issue Apr 15, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
BurntSushi added a commit that referenced this issue Apr 17, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
BurntSushi added a commit that referenced this issue Apr 17, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
BurntSushi added a commit that referenced this issue Apr 17, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
BurntSushi added a commit that referenced this issue Apr 17, 2023
This resolves a long-standing (but somewhat minor) complaint that folks
have with the regex crate: it does not permit escaping punctuation
characters in cases where those characters do not need to be escaped. So
things like \/, \" and \! would result in parse errors. Most other regex
engines permit these, even in cases where they aren't needed.

I had been against doing this for future evolution purposes, but it's
incredibly unlikely that we're ever going to add a new meta character to
the syntax. I literally cannot think of any conceivable future in which
that might happen.

However, we do continue to ban escapes for [0-9A-Za-z<>], because it is
conceivable that we might add new escape sequences for those characters.
(And 0-9 are already banned by virtue of them looking too much like
backreferences, which aren't supported.) For example, we could add
\Q...\E literal syntax. Or \< and \> as start and end word boundaries,
as found in POSIX regex engines.

Fixes #501
BurntSushi added a commit that referenced this issue Apr 20, 2023
1.8.0 (2023-04-20)
==================
This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represent preparatory work
for the second release, which will be even bigger. Namely, this release:

* Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
* Upgrades its dependency on `aho-corasick` to the recently release 1.0
version.
* Upgrades its dependency on `regex-syntax` to the simultaneously released
`0.7` version. The changes to `regex-syntax` principally revolve around a
rewrite of its literal extraction code and a number of simplifications and
optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into
this repository, and then changing the `regex` crate to be nothing but an API
shim layer on top of `regex-automata`'s API.

These tandem releases are the culmination of about 3
years of on-and-off work that [began in earnest in March
2020](#656).

Because of the scale of changes involved in these releases, I would love to
hear about your experience. Especially if you notice undocumented changes in
behavior or performance changes (positive *or* negative).

Most changes in the first release are listed below. For more details, please
see the commit log, which reflects a linear and decently documented history
of all changes.

New features:

* [FEATURE #501](#501):
Permit many more characters to be escaped, even if they have no significance.
More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be
escaped. Also, a new routine, `is_escapeable_character`, has been added to
`regex-syntax` to query whether a character is escapeable or not.
* [FEATURE #547](#547):
Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise
introduce any new expressive power.
* [FEATURE #595](#595):
Capture group names are now Unicode-aware. They can now begin with either a `_`
or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints
can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and
`]`. Note that replacement syntax has not changed.
* [FEATURE #810](#810):
Add `Match::is_empty` and `Match::len` APIs.
* [FEATURE #905](#905):
Add an `impl Default for RegexSet`, with the default being the empty set.
* [FEATURE #908](#908):
A new method, `Regex::static_captures_len`, has been added which returns the
number of capture groups in the pattern if and only if every possible match
always contains the same number of matching groups.
* [FEATURE #955](#955):
Named captures can now be written as `(?<name>re)` in addition to
`(?P<name>re)`.
* FEATURE: `regex-syntax` now supports empty character classes.
* FEATURE: `regex-syntax` now has an optional `std` feature. (This will come
to `regex` in the second release.)
* FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications
made to it.
* FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF
mode. This will be supported in `regex` proper in the second release.
* FEATURE: `regex-syntax` now has proper support for "regex that never
matches" via `Hir::fail()`.
* FEATURE: The `hir::literal` module of `regex-syntax` has been completely
re-worked. It now has more documentation, examples and advice.
* FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed
to `utf8`, and the meaning of the boolean has been flipped.

Performance improvements:

* PERF: The upgrade to `aho-corasick 1.0` may improve performance in some
cases. It's difficult to characterize exactly which patterns this might impact,
but if there are a small number of longish (>= 4 bytes) prefix literals, then
it might be faster than before.

Bug fixes:

* [BUG #514](#514):
Improve `Debug` impl for `Match` so that it doesn't show the entire haystack.
* BUGS [#516](#516),
[#731](#731):
Fix a number of issues with printing `Hir` values as regex patterns.
* [BUG #610](#610):
Add explicit example of `foo|bar` in the regex syntax docs.
* [BUG #625](#625):
Clarify that `SetMatches::len` does not (regretably) refer to the number of
matches in the set.
* [BUG #660](#660):
Clarify "verbose mode" in regex syntax documentation.
* BUG [#738](#738),
[#950](#950):
Fix `CaptureLocations::get` so that it never panics.
* [BUG #747](#747):
Clarify documentation for `Regex::shortest_match`.
* [BUG #835](#835):
Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`.
* [BUG #846](#846):
Add more clarifying documentation to the `CompiledTooBig` error variant.
* [BUG #854](#854):
Clarify that `regex::Regex` searches as if the haystack is a sequence of
Unicode scalar values.
* [BUG #884](#884):
Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute.
* [BUG #893](#893):
Optimize case folding since it can get quite slow in some pathological cases.
* [BUG #895](#895):
Reject `(?-u:\W)` in `regex::Regex` APIs.
* [BUG #942](#942):
Add a missing `void` keyword to indicate "no parameters" in C API.
* [BUG #965](#965):
Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`.
* [BUG #975](#975):
Clarify documentation for `\pX` syntax.
BurntSushi added a commit that referenced this issue Apr 20, 2023
1.8.0 (2023-04-20)
==================
This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represent preparatory work
for the second release, which will be even bigger. Namely, this release:

* Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
* Upgrades its dependency on `aho-corasick` to the recently release 1.0
version.
* Upgrades its dependency on `regex-syntax` to the simultaneously released
`0.7` version. The changes to `regex-syntax` principally revolve around a
rewrite of its literal extraction code and a number of simplifications and
optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into
this repository, and then changing the `regex` crate to be nothing but an API
shim layer on top of `regex-automata`'s API.

These tandem releases are the culmination of about 3
years of on-and-off work that [began in earnest in March
2020](#656).

Because of the scale of changes involved in these releases, I would love to
hear about your experience. Especially if you notice undocumented changes in
behavior or performance changes (positive *or* negative).

Most changes in the first release are listed below. For more details, please
see the commit log, which reflects a linear and decently documented history
of all changes.

New features:

* [FEATURE #501](#501):
Permit many more characters to be escaped, even if they have no significance.
More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be
escaped. Also, a new routine, `is_escapeable_character`, has been added to
`regex-syntax` to query whether a character is escapeable or not.
* [FEATURE #547](#547):
Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise
introduce any new expressive power.
* [FEATURE #595](#595):
Capture group names are now Unicode-aware. They can now begin with either a `_`
or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints
can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and
`]`. Note that replacement syntax has not changed.
* [FEATURE #810](#810):
Add `Match::is_empty` and `Match::len` APIs.
* [FEATURE #905](#905):
Add an `impl Default for RegexSet`, with the default being the empty set.
* [FEATURE #908](#908):
A new method, `Regex::static_captures_len`, has been added which returns the
number of capture groups in the pattern if and only if every possible match
always contains the same number of matching groups.
* [FEATURE #955](#955):
Named captures can now be written as `(?<name>re)` in addition to
`(?P<name>re)`.
* FEATURE: `regex-syntax` now supports empty character classes.
* FEATURE: `regex-syntax` now has an optional `std` feature. (This will come
to `regex` in the second release.)
* FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications
made to it.
* FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF
mode. This will be supported in `regex` proper in the second release.
* FEATURE: `regex-syntax` now has proper support for "regex that never
matches" via `Hir::fail()`.
* FEATURE: The `hir::literal` module of `regex-syntax` has been completely
re-worked. It now has more documentation, examples and advice.
* FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed
to `utf8`, and the meaning of the boolean has been flipped.

Performance improvements:

* PERF: The upgrade to `aho-corasick 1.0` may improve performance in some
cases. It's difficult to characterize exactly which patterns this might impact,
but if there are a small number of longish (>= 4 bytes) prefix literals, then
it might be faster than before.

Bug fixes:

* [BUG #514](#514):
Improve `Debug` impl for `Match` so that it doesn't show the entire haystack.
* BUGS [#516](#516),
[#731](#731):
Fix a number of issues with printing `Hir` values as regex patterns.
* [BUG #610](#610):
Add explicit example of `foo|bar` in the regex syntax docs.
* [BUG #625](#625):
Clarify that `SetMatches::len` does not (regretably) refer to the number of
matches in the set.
* [BUG #660](#660):
Clarify "verbose mode" in regex syntax documentation.
* BUG [#738](#738),
[#950](#950):
Fix `CaptureLocations::get` so that it never panics.
* [BUG #747](#747):
Clarify documentation for `Regex::shortest_match`.
* [BUG #835](#835):
Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`.
* [BUG #846](#846):
Add more clarifying documentation to the `CompiledTooBig` error variant.
* [BUG #854](#854):
Clarify that `regex::Regex` searches as if the haystack is a sequence of
Unicode scalar values.
* [BUG #884](#884):
Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute.
* [BUG #893](#893):
Optimize case folding since it can get quite slow in some pathological cases.
* [BUG #895](#895):
Reject `(?-u:\W)` in `regex::Regex` APIs.
* [BUG #942](#942):
Add a missing `void` keyword to indicate "no parameters" in C API.
* [BUG #965](#965):
Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`.
* [BUG #975](#975):
Clarify documentation for `\pX` syntax.
crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue May 2, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.7.3` -> `1.8.1` |

---

### Release Notes

<details>
<summary>rust-lang/regex</summary>

### [`v1.8.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;181-2023-04-21)

\==================
This is a patch release that fixes a bug where a regex match could be reported
where none was found. Specifically, the bug occurs when a pattern contains some
literal prefixes that could be extracted *and* an optional word boundary in the
prefix.

Bug fixes:

-   [BUG #&#8203;981](rust-lang/regex#981):
    Fix a bug where a word boundary could interact with prefix literal
    optimizations and lead to a false positive match.

### [`v1.8.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;180-2023-04-20)

\==================
This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represents preparatory work
for the second release, which will be even bigger. Namely, this release:

-   Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
-   Upgrades its dependency on `aho-corasick` to the recently released 1.0
    version.
-   Upgrades its dependency on `regex-syntax` to the simultaneously released
    `0.7` version. The changes to `regex-syntax` principally revolve around a
    rewrite of its literal extraction code and a number of simplifications and
    optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into
this repository, and then changing the `regex` crate to be nothing but an API
shim layer on top of `regex-automata`'s API.

These tandem releases are the culmination of about 3
years of on-and-off work that [began in earnest in March
2020](rust-lang/regex#656).

Because of the scale of changes involved in these releases, I would love to
hear about your experience. Especially if you notice undocumented changes in
behavior or performance changes (positive *or* negative).

Most changes in the first release are listed below. For more details, please
see the commit log, which reflects a linear and decently documented history
of all changes.

New features:

-   [FEATURE #&#8203;501](rust-lang/regex#501):
    Permit many more characters to be escaped, even if they have no significance.
    More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be
    escaped. Also, a new routine, `is_escapeable_character`, has been added to
    `regex-syntax` to query whether a character is escapeable or not.
-   [FEATURE #&#8203;547](rust-lang/regex#547):
    Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise
    introduce any new expressive power.
-   [FEATURE #&#8203;595](rust-lang/regex#595):
    Capture group names are now Unicode-aware. They can now begin with either a `_`
    or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints
    can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and
    `]`. Note that replacement syntax has not changed.
-   [FEATURE #&#8203;810](rust-lang/regex#810):
    Add `Match::is_empty` and `Match::len` APIs.
-   [FEATURE #&#8203;905](rust-lang/regex#905):
    Add an `impl Default for RegexSet`, with the default being the empty set.
-   [FEATURE #&#8203;908](rust-lang/regex#908):
    A new method, `Regex::static_captures_len`, has been added which returns the
    number of capture groups in the pattern if and only if every possible match
    always contains the same number of matching groups.
-   [FEATURE #&#8203;955](rust-lang/regex#955):
    Named captures can now be written as `(?<name>re)` in addition to
    `(?P<name>re)`.
-   FEATURE: `regex-syntax` now supports empty character classes.
-   FEATURE: `regex-syntax` now has an optional `std` feature. (This will come
    to `regex` in the second release.)
-   FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications
    made to it.
-   FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF
    mode. This will be supported in `regex` proper in the second release.
-   FEATURE: `regex-syntax` now has proper support for "regex that never
    matches" via `Hir::fail()`.
-   FEATURE: The `hir::literal` module of `regex-syntax` has been completely
    re-worked. It now has more documentation, examples and advice.
-   FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed
    to `utf8`, and the meaning of the boolean has been flipped.

Performance improvements:

-   PERF: The upgrade to `aho-corasick 1.0` may improve performance in some
    cases. It's difficult to characterize exactly which patterns this might impact,
    but if there are a small number of longish (>= 4 bytes) prefix literals, then
    it might be faster than before.

Bug fixes:

-   [BUG #&#8203;514](rust-lang/regex#514):
    Improve `Debug` impl for `Match` so that it doesn't show the entire haystack.
-   BUGS [#&#8203;516](rust-lang/regex#516),
    [#&#8203;731](rust-lang/regex#731):
    Fix a number of issues with printing `Hir` values as regex patterns.
-   [BUG #&#8203;610](rust-lang/regex#610):
    Add explicit example of `foo|bar` in the regex syntax docs.
-   [BUG #&#8203;625](rust-lang/regex#625):
    Clarify that `SetMatches::len` does not (regretably) refer to the number of
    matches in the set.
-   [BUG #&#8203;660](rust-lang/regex#660):
    Clarify "verbose mode" in regex syntax documentation.
-   BUG [#&#8203;738](rust-lang/regex#738),
    [#&#8203;950](rust-lang/regex#950):
    Fix `CaptureLocations::get` so that it never panics.
-   [BUG #&#8203;747](rust-lang/regex#747):
    Clarify documentation for `Regex::shortest_match`.
-   [BUG #&#8203;835](rust-lang/regex#835):
    Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`.
-   [BUG #&#8203;846](rust-lang/regex#846):
    Add more clarifying documentation to the `CompiledTooBig` error variant.
-   [BUG #&#8203;854](rust-lang/regex#854):
    Clarify that `regex::Regex` searches as if the haystack is a sequence of
    Unicode scalar values.
-   [BUG #&#8203;884](rust-lang/regex#884):
    Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute.
-   [BUG #&#8203;893](rust-lang/regex#893):
    Optimize case folding since it can get quite slow in some pathological cases.
-   [BUG #&#8203;895](rust-lang/regex#895):
    Reject `(?-u:\W)` in `regex::Regex` APIs.
-   [BUG #&#8203;942](rust-lang/regex#942):
    Add a missing `void` keyword to indicate "no parameters" in C API.
-   [BUG #&#8203;965](rust-lang/regex#965):
    Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`.
-   [BUG #&#8203;975](rust-lang/regex#975):
    Clarify documentation for `\pX` syntax.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNS42MS4wIiwidXBkYXRlZEluVmVyIjoiMzUuNjYuMyIsInRhcmdldEJyYW5jaCI6ImRldmVsb3AifQ==-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1874
Reviewed-by: crapStone <crapstone@noreply.codeberg.org>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants