plan: moving regex engines to regex-automata #656

BurntSushi · 2020-03-24T14:21:20Z

In discussions in #524, it became clearer to me that it might be good to say more about what my short/long term plans are for the regex crate.

TL;DR - Push all regex engines down into the regex-automata crate, and bring that in as an internal-but-published dependency of regex as an official part of this repository. It will become a second required dependency (with the first being regex-syntax).

The main motivation for doing this is to clean up the regex internals, specifically with the intent to make it easier to add more optimizations. Currently, the infrastructure is a bit of a hodge podge and it can be quite difficult to know where and when a particular regex engine is being used. On top of that, since every regex engine is strictly internal, testing each engine independently is quite annoying, as it requires exporting undocumented APIs. As a result, bugs have cropped up where certain engines are inconsistent with other engines. Similarly, benchmarking each engine independent of the other is made difficult, which in turn makes it difficult to focus on optimizing certain scenarios that come up in the course of regex execution.

On top of all of this, putting the regex engines into their own crate with a proper API means that folks can use those engines directly. It also provides an outlet of sorts to provide lower level APIs that would I otherwise wouldn't want cluttering up the main regex crate too much. The main thing on my mind here are streaming APIs (see #425), but there may be other things of use.

I've actually been working towards this goal for a long time now. Arguably, i started around 3 years ago when I began rewriting the regex-syntax crate. I then moved to rewriting memchr and aho-corasick as well. In the process, I pushed complexity down as much as I could. For example, the HIR in regex-syntax no longer exposes any regex flags (such as case insensitivity), which makes regex compilers simpler. As another example, aho-corasick now supports leftmost-first matching, which matches the regex crate's match semantics precisely, and now makes it easier to defer to aho-corasick for an alternation of literals. Yet another example is the Teddy algorithm that uses SIMD for finding matches of multiple literals very quickly. That has been pushed down into aho-corasick as well, which means more people get to benefit from it and it no longer needs to be maintained inside of regex proper.

A little over a year ago, I started work on regex-automata. The main use case for regex-automata was to generate fully compiled DFAs that could be cheaply embedded into a program and deserialized cheaply for easy matching. (For example, they power the grapheme, word and sentence iterators in the bstr crate.)

In the course of building regex-automata, I had to write another HIR -> NFA compiler, with the intent that this would one day become the compiler inside the regex crate. It has many improvements over the existing one. Firstly, it is much simpler by doing away with the "instruction hole" mechanism used in the current compiler. Secondly, it is also simpler in that it does away with dichotomy between a Unicode NFA and a byte-based NFA, which has been the cause of many bugs and increases the complexity of the matching engines. Thirdly, by getting rid of the Unicode/bytes dichotomy in favor of just using bytes, we will be able to compile one fewer NFAs than what the regex crate currently does (one Unicode NFA for the Pike VM and the backtracker, 1 forwards byte NFA and 1 backwards byte NFA for the lazy DFA). Fourthly, it is more aggressive about epsilon removal and does away with the binary "alt" instructions in favor of flattening the byte code, which should decrease overhead substantially when matching. Finally, it use's Daciuk's algorithm (the same one used in the fst crate) to compiler nearly minimal UTF-8 automata in linear time, which substantially reduces the size of the corresponding NFAs and DFAs, while also making them faster to execute by virtue of having fewer epsilon transitions.

regex-automata also contains new test infrastructure, and most of the tests in the regex crate have already been ported to it. This test infrastructure should make it much easier to add new tests and also much faster to run while testing more engines. Today's infrastructure is a terrible macro soup, and just building it takes forever because every test is compiled for every regex engine that we test. The new test infrastructure instead loads tests from TOML files and runs them using its own harness. The downside is that this pushes more things outside of Rust's standard unit test harness, but it's well worth it.

As of a couple months ago, I've begun preparing regex-automata to absorb the matching engines from regex proper. This will include the lazy DFA, the Pike VM and the backtracker. regex-automata also provides its own ahead-of-time-compiled DFAs, which I hope to use as yet another engine for cases where the regex is small and full DFA compilation is cheap. This organization should drastically decrease the complexity of adding new regex engines. I hope this will allow me to finally bring in things like one-pass DFA (see #467), as well as some other ideas I've had for years, such as a bit-parallel NFA and even a JIT powered by Cranelift, basically with the goal of fixing one of the weakest parts of the regex crate right now: the performance of matching with capturing groups. (N.B. If a JIT were added, it would be an optional opt-in feature.)

Once regex-automata has absorbed the regex engines inside of regex, my plan at that point is to move regex-automata into this repository and start integrating it into regex proper. At that point, the main responsibility of regex will be to glue everything together, plan the appropriate optimizations, including literal oriented optimizations and provide the higher level API that it does today. At this point in time, I do not plan on making any breaking changes to regex during all of this.

As a stretch goal, one of the things I'd like to do is to bring the cheap deserialization of regex-automata's DFAs to the regex crate proper. This would permit users to generate pre-compiled Regex objects that can be embedded into any program. This would allow one to completely avoid paying for compilation time. This does require a lot of work though, so I don't know if this will happen initially. (Doing this basically implies that every single representation of a finite state machine, including all of the literal prefilters, need to be able to work directly from a more primitive representation such as a &[u8].)

Once this Great Regex Engine Refactor is complete, the result will have been a complete rewrite of the regex crate.

The text was updated successfully, but these errors were encountered:

ethanpailes · 2020-03-24T14:53:49Z

You probably already plan to do something like this, but just in case you have not already planned on it, I think it would be useful for the regex-automata crate to have a set of features that people can use to pull in just a single engine. I'm thinking in particular about situations where people might care more about code size than performance and are happy with just pulling in the PikeVM. It might even be useful to have a separate regex-engine-lite crate that just has the PikeVM so that people who don't know the difference between the different matchers can pull it in.

BurntSushi · 2020-03-24T14:55:50Z

@ethanpailes Yeah I may consider that. Each engine on its own doesn't usually take up too much code and usually doesn't require any extra dependencies. But sure, chopping it up with features like I did for regex is definitely on the table. I'll probably avoid adding new *-lite crates though.

Voultapher · 2020-03-24T18:34:04Z

Still amazed to witness someone as clever and dedicated as you. The programming community would be a sadder place without you. Thank you for you wonderful hard work.

giovanniberti · 2020-03-24T19:36:36Z

TL;DR - Push all regex engines down into the regex-automata crate, and bring that in as an internal-but-published dependency of regex as an official part of this repository. It will become a second required dependency (with the first being regex-syntax).
[...]
Once this Great Regex Engine Refactor is complete, the result will have been a complete rewrite of the regex crate.

Given that you are pushing regex engines out of the regex crate, shouldn't it be called the Great Regex Engine Pushout Refactor, or GREP-R for short? 😉

Btw, love your work and the dedication you put into maintaining one of the most important crates in the rust ecosystem! 💪

chrisduerr · 2020-03-24T23:28:00Z

Once regex-automata has absorbed the regex engines inside of regex, my plan at that point is to move regex-automata into this repository

At this point in time, I do not plan on making any breaking changes to regex during all of this.

While this sounds like a great effort for regex and the unification of the two different crates, I'd be curious what this means for users of regex-automata. I'm not particularly concerned about breaking changes itself, but is there any plan to remove functionality in favor of unification?

It seems like the basics of regex-automata are fundamentally required for both regex and regex-automata, with a DFA and transition between states, but I'd like to make sure that usecases like streaming regex parsers are still captured in that new ecosystem. Especially since even regex-automata has that as a secondary usecase since it's rarely desired of course.

Fundamentally it sounds to me like regex-automata users will just benefit from more regex engine choices without any drawbacks, which would obviously be great.

BurntSushi · 2020-03-24T23:40:12Z

Fundamentally it sounds to me like regex-automata users will just benefit from more regex engine choices without any drawbacks, which would obviously be great.

That's the intent yes. I have no plans to remove functionality from regex-automata. :-)

This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)

An empty character class is effectively a way to write something that can never match anything. The regex crate has pretty much always returned an error for such things because it was never taught how to handle "always fail" states. Partly because I just didn't think about it when initially writing the regex engines and partly because it isn't often useful. With that said, it should be supported for completeness and because there is no real reason to not support it. Moreover, it can be useful in certain contexts where regexes are generated and you want to insert an expression that can never match. It's somewhat contrived, but it happens when the interface is a regex pattern. Previously, the ban on empty character classes was implemented in the regex-syntax crate. But with the rewrite in #656 getting closer and closer to landing, it's now time to relax this restriction. However, we do keep the overall restriction in the 'regex' API by returning an error in the NFA compiler. Once #656 is done, the new regex engines will permit this case.

This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)

An empty character class is effectively a way to write something that can never match anything. The regex crate has pretty much always returned an error for such things because it was never taught how to handle "always fail" states. Partly because I just didn't think about it when initially writing the regex engines and partly because it isn't often useful. With that said, it should be supported for completeness and because there is no real reason to not support it. Moreover, it can be useful in certain contexts where regexes are generated and you want to insert an expression that can never match. It's somewhat contrived, but it happens when the interface is a regex pattern. Previously, the ban on empty character classes was implemented in the regex-syntax crate. But with the rewrite in #656 getting closer and closer to landing, it's now time to relax this restriction. However, we do keep the overall restriction in the 'regex' API by returning an error in the NFA compiler. Once #656 is done, the new regex engines will permit this case.

This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)

An empty character class is effectively a way to write something that can never match anything. The regex crate has pretty much always returned an error for such things because it was never taught how to handle "always fail" states. Partly because I just didn't think about it when initially writing the regex engines and partly because it isn't often useful. With that said, it should be supported for completeness and because there is no real reason to not support it. Moreover, it can be useful in certain contexts where regexes are generated and you want to insert an expression that can never match. It's somewhat contrived, but it happens when the interface is a regex pattern. Previously, the ban on empty character classes was implemented in the regex-syntax crate. But with the rewrite in #656 getting closer and closer to landing, it's now time to relax this restriction. However, we do keep the overall restriction in the 'regex' API by returning an error in the NFA compiler. Once #656 is done, the new regex engines will permit this case.

This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)

BurntSushi · 2023-04-17T18:37:03Z

I have a PR up for phase 1 of 2 of this transition: #977

Basically, this PR is preparing for the regex crate to cut over to regex-automata for its internals. That will include bringing regex-automata into this repository. I don't have a specific timeline for when that will happen, but I do want to let this first phase bake for a little bit. Hopefully within a month. regex-automata itself is ready.

An empty character class is effectively a way to write something that can never match anything. The regex crate has pretty much always returned an error for such things because it was never taught how to handle "always fail" states. Partly because I just didn't think about it when initially writing the regex engines and partly because it isn't often useful. With that said, it should be supported for completeness and because there is no real reason to not support it. Moreover, it can be useful in certain contexts where regexes are generated and you want to insert an expression that can never match. It's somewhat contrived, but it happens when the interface is a regex pattern. Previously, the ban on empty character classes was implemented in the regex-syntax crate. But with the rewrite in #656 getting closer and closer to landing, it's now time to relax this restriction. However, we do keep the overall restriction in the 'regex' API by returning an error in the NFA compiler. Once #656 is done, the new regex engines will permit this case.

This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)

An empty character class is effectively a way to write something that can never match anything. The regex crate has pretty much always returned an error for such things because it was never taught how to handle "always fail" states. Partly because I just didn't think about it when initially writing the regex engines and partly because it isn't often useful. With that said, it should be supported for completeness and because there is no real reason to not support it. Moreover, it can be useful in certain contexts where regexes are generated and you want to insert an expression that can never match. It's somewhat contrived, but it happens when the interface is a regex pattern. Previously, the ban on empty character classes was implemented in the regex-syntax crate. But with the rewrite in #656 getting closer and closer to landing, it's now time to relax this restriction. However, we do keep the overall restriction in the 'regex' API by returning an error in the NFA compiler. Once #656 is done, the new regex engines will permit this case.

This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag, 'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also causes '.' to *not* match \r in addition to \n (unless the 's' flag is enabled of course). The intended semantics are that CRLF mode makes \r\n, \r and \n line terminators but with one key property: \r\n is treated as a single line terminator. That is, ^/$ do not match between \r and \n. This partially addresses #244 by adding syntax support. Currently, if you try to use this new flag, the regex compiler will report an error. We intend to finish support for this once #656 is complete. (Indeed, at time of writing, CRLF matching works in regex-automata.)

1.8.0 (2023-04-20) ================== This is a sizeable release that will be soon followed by another sizeable release. Both of them will combined close over 40 existing issues and PRs. This first release, despite its size, essentially represent preparatory work for the second release, which will be even bigger. Namely, this release: * Increases the MSRV to Rust 1.60.0, which was released about 1 year ago. * Upgrades its dependency on `aho-corasick` to the recently release 1.0 version. * Upgrades its dependency on `regex-syntax` to the simultaneously released `0.7` version. The changes to `regex-syntax` principally revolve around a rewrite of its literal extraction code and a number of simplifications and optimizations to its high-level intermediate representation (HIR). The second release, which will follow ~shortly after the release above, will contain a soup-to-nuts rewrite of every regex engine. This will be done by bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into this repository, and then changing the `regex` crate to be nothing but an API shim layer on top of `regex-automata`'s API. These tandem releases are the culmination of about 3 years of on-and-off work that [began in earnest in March 2020](#656). Because of the scale of changes involved in these releases, I would love to hear about your experience. Especially if you notice undocumented changes in behavior or performance changes (positive *or* negative). Most changes in the first release are listed below. For more details, please see the commit log, which reflects a linear and decently documented history of all changes. New features: * [FEATURE #501](#501): Permit many more characters to be escaped, even if they have no significance. More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be escaped. Also, a new routine, `is_escapeable_character`, has been added to `regex-syntax` to query whether a character is escapeable or not. * [FEATURE #547](#547): Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise introduce any new expressive power. * [FEATURE #595](#595): Capture group names are now Unicode-aware. They can now begin with either a `_` or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and `]`. Note that replacement syntax has not changed. * [FEATURE #810](#810): Add `Match::is_empty` and `Match::len` APIs. * [FEATURE #905](#905): Add an `impl Default for RegexSet`, with the default being the empty set. * [FEATURE #908](#908): A new method, `Regex::static_captures_len`, has been added which returns the number of capture groups in the pattern if and only if every possible match always contains the same number of matching groups. * [FEATURE #955](#955): Named captures can now be written as `(?<name>re)` in addition to `(?P<name>re)`. * FEATURE: `regex-syntax` now supports empty character classes. * FEATURE: `regex-syntax` now has an optional `std` feature. (This will come to `regex` in the second release.) * FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications made to it. * FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF mode. This will be supported in `regex` proper in the second release. * FEATURE: `regex-syntax` now has proper support for "regex that never matches" via `Hir::fail()`. * FEATURE: The `hir::literal` module of `regex-syntax` has been completely re-worked. It now has more documentation, examples and advice. * FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed to `utf8`, and the meaning of the boolean has been flipped. Performance improvements: * PERF: The upgrade to `aho-corasick 1.0` may improve performance in some cases. It's difficult to characterize exactly which patterns this might impact, but if there are a small number of longish (>= 4 bytes) prefix literals, then it might be faster than before. Bug fixes: * [BUG #514](#514): Improve `Debug` impl for `Match` so that it doesn't show the entire haystack. * BUGS [#516](#516), [#731](#731): Fix a number of issues with printing `Hir` values as regex patterns. * [BUG #610](#610): Add explicit example of `foo|bar` in the regex syntax docs. * [BUG #625](#625): Clarify that `SetMatches::len` does not (regretably) refer to the number of matches in the set. * [BUG #660](#660): Clarify "verbose mode" in regex syntax documentation. * BUG [#738](#738), [#950](#950): Fix `CaptureLocations::get` so that it never panics. * [BUG #747](#747): Clarify documentation for `Regex::shortest_match`. * [BUG #835](#835): Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`. * [BUG #846](#846): Add more clarifying documentation to the `CompiledTooBig` error variant. * [BUG #854](#854): Clarify that `regex::Regex` searches as if the haystack is a sequence of Unicode scalar values. * [BUG #884](#884): Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute. * [BUG #893](#893): Optimize case folding since it can get quite slow in some pathological cases. * [BUG #895](#895): Reject `(?-u:\W)` in `regex::Regex` APIs. * [BUG #942](#942): Add a missing `void` keyword to indicate "no parameters" in C API. * [BUG #965](#965): Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`. * [BUG #975](#975): Clarify documentation for `\pX` syntax.

BurntSushi · 2023-04-28T12:51:08Z

For anyone following this thread, regex 1.9 (the next release) is going to close this issue. Finally. I plan to release it after I've finished documentation and polishing work. (There is a fair bit to do here.)

I am also currently planning to release a new crate, regex-lite, simultaneously with regex 1.9. You can see more about that and why specifically I chose to do it with regex 1.9 here: #961 (comment)

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.7.3` -> `1.8.1` | --- ### Release Notes <details> <summary>rust-lang/regex</summary> ### [`v1.8.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#181-2023-04-21) \================== This is a patch release that fixes a bug where a regex match could be reported where none was found. Specifically, the bug occurs when a pattern contains some literal prefixes that could be extracted *and* an optional word boundary in the prefix. Bug fixes: - [BUG #981](rust-lang/regex#981): Fix a bug where a word boundary could interact with prefix literal optimizations and lead to a false positive match. ### [`v1.8.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#180-2023-04-20) \================== This is a sizeable release that will be soon followed by another sizeable release. Both of them will combined close over 40 existing issues and PRs. This first release, despite its size, essentially represents preparatory work for the second release, which will be even bigger. Namely, this release: - Increases the MSRV to Rust 1.60.0, which was released about 1 year ago. - Upgrades its dependency on `aho-corasick` to the recently released 1.0 version. - Upgrades its dependency on `regex-syntax` to the simultaneously released `0.7` version. The changes to `regex-syntax` principally revolve around a rewrite of its literal extraction code and a number of simplifications and optimizations to its high-level intermediate representation (HIR). The second release, which will follow ~shortly after the release above, will contain a soup-to-nuts rewrite of every regex engine. This will be done by bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into this repository, and then changing the `regex` crate to be nothing but an API shim layer on top of `regex-automata`'s API. These tandem releases are the culmination of about 3 years of on-and-off work that [began in earnest in March 2020](rust-lang/regex#656). Because of the scale of changes involved in these releases, I would love to hear about your experience. Especially if you notice undocumented changes in behavior or performance changes (positive *or* negative). Most changes in the first release are listed below. For more details, please see the commit log, which reflects a linear and decently documented history of all changes. New features: - [FEATURE #501](rust-lang/regex#501): Permit many more characters to be escaped, even if they have no significance. More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be escaped. Also, a new routine, `is_escapeable_character`, has been added to `regex-syntax` to query whether a character is escapeable or not. - [FEATURE #547](rust-lang/regex#547): Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise introduce any new expressive power. - [FEATURE #595](rust-lang/regex#595): Capture group names are now Unicode-aware. They can now begin with either a `_` or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and `]`. Note that replacement syntax has not changed. - [FEATURE #810](rust-lang/regex#810): Add `Match::is_empty` and `Match::len` APIs. - [FEATURE #905](rust-lang/regex#905): Add an `impl Default for RegexSet`, with the default being the empty set. - [FEATURE #908](rust-lang/regex#908): A new method, `Regex::static_captures_len`, has been added which returns the number of capture groups in the pattern if and only if every possible match always contains the same number of matching groups. - [FEATURE #955](rust-lang/regex#955): Named captures can now be written as `(?<name>re)` in addition to `(?P<name>re)`. - FEATURE: `regex-syntax` now supports empty character classes. - FEATURE: `regex-syntax` now has an optional `std` feature. (This will come to `regex` in the second release.) - FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications made to it. - FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF mode. This will be supported in `regex` proper in the second release. - FEATURE: `regex-syntax` now has proper support for "regex that never matches" via `Hir::fail()`. - FEATURE: The `hir::literal` module of `regex-syntax` has been completely re-worked. It now has more documentation, examples and advice. - FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed to `utf8`, and the meaning of the boolean has been flipped. Performance improvements: - PERF: The upgrade to `aho-corasick 1.0` may improve performance in some cases. It's difficult to characterize exactly which patterns this might impact, but if there are a small number of longish (>= 4 bytes) prefix literals, then it might be faster than before. Bug fixes: - [BUG #514](rust-lang/regex#514): Improve `Debug` impl for `Match` so that it doesn't show the entire haystack. - BUGS [#516](rust-lang/regex#516), [#731](rust-lang/regex#731): Fix a number of issues with printing `Hir` values as regex patterns. - [BUG #610](rust-lang/regex#610): Add explicit example of `foo|bar` in the regex syntax docs. - [BUG #625](rust-lang/regex#625): Clarify that `SetMatches::len` does not (regretably) refer to the number of matches in the set. - [BUG #660](rust-lang/regex#660): Clarify "verbose mode" in regex syntax documentation. - BUG [#738](rust-lang/regex#738), [#950](rust-lang/regex#950): Fix `CaptureLocations::get` so that it never panics. - [BUG #747](rust-lang/regex#747): Clarify documentation for `Regex::shortest_match`. - [BUG #835](rust-lang/regex#835): Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`. - [BUG #846](rust-lang/regex#846): Add more clarifying documentation to the `CompiledTooBig` error variant. - [BUG #854](rust-lang/regex#854): Clarify that `regex::Regex` searches as if the haystack is a sequence of Unicode scalar values. - [BUG #884](rust-lang/regex#884): Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute. - [BUG #893](rust-lang/regex#893): Optimize case folding since it can get quite slow in some pathological cases. - [BUG #895](rust-lang/regex#895): Reject `(?-u:\W)` in `regex::Regex` APIs. - [BUG #942](rust-lang/regex#942): Add a missing `void` keyword to indicate "no parameters" in C API. - [BUG #965](rust-lang/regex#965): Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`. - [BUG #975](rust-lang/regex#975): Clarify documentation for `\pX` syntax. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).  Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1874 Reviewed-by: crapStone <crapstone@noreply.codeberg.org> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>

I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Closes #476 Closes #644 Closes #675 Closes #824 Closes #961 Closes #68 Closes #510 Closes #787 Closes #891 Closes #429 Closes #517 Closes #579 Closes #779 Closes #850 Closes #921 Closes #976 Closes #1002 Closes #656

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` | --- ### Release Notes <details> <summary>rust-lang/regex (regex)</summary> ### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07) [Compare Source](rust-lang/regex@1.9.0...1.9.1) \================== This is a patch release which fixes a memory usage regression. In the regex 1.9 release, one of the internal engines used a more aggressive allocation strategy than what was done previously. This patch release reverts to the prior on-demand strategy. Bug fixes: - [BUG #1027](rust-lang/regex#1027): Change the allocation strategy for the backtracker to be less aggressive. ### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05) [Compare Source](rust-lang/regex@1.8.4...1.9.0) \================== This release marks the end of a [years long rewrite of the regex crate internals](rust-lang/regex#656). Since this is such a big release, please report any issues or regressions you find. We would also love to hear about improvements as well. In addition to many internal improvements that should hopefully result in "my regex searches are faster," there have also been a few API additions: - A new `Captures::extract` method for quickly accessing the substrings that match each capture group in a regex. - A new inline flag, `R`, which enables CRLF mode. This makes `.` match any Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and `(?m:$)` match after and before both `\r` and `\n`, respectively, but never between a `\r` and `\n`. - `RegexBuilder::line_terminator` was added to further customize the line terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte. - The `std` Cargo feature is now actually optional. That is, the `regex` crate can be used without the standard library. - Because `regex 1.9` may make binary size and compile times even worse, a new experimental crate called `regex-lite` has been published. It prioritizes binary size and compile times over functionality (like Unicode) and performance. It shares no code with the `regex` crate. New features: - [FEATURE #244](rust-lang/regex#244): One can opt into CRLF mode via the `R` flag. e.g., `(?mR:$)` matches just before `\r\n`. - [FEATURE #259](rust-lang/regex#259): Multi-pattern searches with offsets can be done with `regex-automata 0.3`. - [FEATURE #476](rust-lang/regex#476): `std` is now an optional feature. `regex` may be used with only `alloc`. - [FEATURE #644](rust-lang/regex#644): `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave. - [FEATURE #675](rust-lang/regex#675): Anchored search APIs are now available in `regex-automata 0.3`. - [FEATURE #824](rust-lang/regex#824): Add new `Captures::extract` method for easier capture group access. - [FEATURE #961](rust-lang/regex#961): Add `regex-lite` crate with smaller binary sizes and faster compile times. - [FEATURE #1022](rust-lang/regex#1022): Add `TryFrom` implementations for the `Regex` type. Performance improvements: - [PERF #68](rust-lang/regex#68): Added a one-pass DFA engine for faster capture group matching. - [PERF #510](rust-lang/regex#510): Inner literals are now used to accelerate searches, e.g., `\w+@\w+` will scan for `@`. - [PERF #787](rust-lang/regex#787), [PERF #891](rust-lang/regex#891): Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`. (There are many more performance improvements as well, but not all of them have specific issues devoted to them.) Bug fixes: - [BUG #429](rust-lang/regex#429): Fix matching bugs related to `\B` and inconsistencies across internal engines. - [BUG #517](rust-lang/regex#517): Fix matching bug with capture groups. - [BUG #579](rust-lang/regex#579): Fix matching bug with word boundaries. - [BUG #779](rust-lang/regex#779): Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`. - [BUG #850](rust-lang/regex#850): Fix matching bug inconsistency between NFA and DFA engines. - [BUG #921](rust-lang/regex#921): Fix matching bug where literal extraction got confused by `$`. - [BUG #976](rust-lang/regex#976): Add documentation to replacement routines about dealing with fallibility. - [BUG #1002](rust-lang/regex#1002): Use corpus rejection in fuzz testing. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).  Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957 Reviewed-by: crapStone <crapstone01@gmail.com> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>

BurntSushi mentioned this issue Mar 24, 2020

support empty alternations #524

Closed

This was referenced Mar 28, 2020

Panic "BUG: reverse match implies forward match" under certain conditions #659

Closed

incorrect case for word boundaries #579

Closed

BurntSushi mentioned this issue May 8, 2020

add anchored search APIs #675

Closed

BurntSushi added the plan label May 9, 2020

BurntSushi mentioned this issue May 15, 2020

Report partial matches #678

Closed

This was referenced Oct 13, 2020

Add a onepass dfa matcher. #467

Closed

fuzz: compiling '\P{any}' panics by tripping an assertion in the compiler #722

Closed

BurntSushi mentioned this issue Mar 22, 2021

Analyze complexity before actual compile? BurntSushi/regex-automata#11

Closed

BurntSushi mentioned this issue Apr 14, 2021

Split Inst enum into BytesInst and UnicodeInst enums #761

Closed

This was referenced May 11, 2021

some regexes fail to satisfy that (re)+ is always equal to (re)(re)* #779

Closed

Fix some clippy lints up to rust 1.41.1 #780

Closed

BurntSushi mentioned this issue May 21, 2021

Sub-match extraction BurntSushi/regex-automata#14

Closed

This was referenced Jun 11, 2021

Search via regex-automata helix-editor/helix#211

Closed

execute a regex on text streams #425

Open

BurntSushi mentioned this issue Dec 7, 2021

[Feature request]: match multiple regular expressions simultaneously obtaining capture groups #822

Closed

BurntSushi mentioned this issue Mar 26, 2022

captures_iter and find_iter handle newline differently #850

Closed

CAD97 mentioned this issue Mar 31, 2022

Feature request: unbuffered text matching #852

Closed

BurntSushi mentioned this issue Apr 8, 2022

byte regex can produce empty matches between UTF-8 code units #484

Closed

BurntSushi mentioned this issue May 1, 2022

Add convenience methods for extracting bits of text #824

Closed

BurntSushi mentioned this issue May 17, 2022

inconsistent matches when capturing group is present #517

Closed

BurntSushi mentioned this issue May 25, 2022

RegexSet does not benefit from perf-literal #865

Closed

BurntSushi mentioned this issue Jun 9, 2022

expose the Input trait #867

Closed

BurntSushi mentioned this issue Jul 19, 2022

aho-corasick should be applied for cases like \b(literal1|literal2|...|literalN)\b #891

Closed

BurntSushi mentioned this issue Mar 22, 2023

Add a flag to show patterns alongside matches BurntSushi/ripgrep#2471

Closed

BurntSushi mentioned this issue Apr 9, 2023

Non-syntactic way to match from the beginning of text, or "implicit \A". #974

Closed

BurntSushi mentioned this issue Apr 17, 2023

first phase of migrating to regex-automata #977

Merged

BurntSushi mentioned this issue Apr 20, 2023

release: 1.8.0 #979

Merged

This was referenced Jun 19, 2023

Request support for reporting partial matches. #1014

Closed

Time for a new release? BurntSushi/ripgrep#2540

Closed

BurntSushi closed this as completed in aa64e6d Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plan: moving regex engines to regex-automata #656

plan: moving regex engines to regex-automata #656

BurntSushi commented Mar 24, 2020 •

edited

ethanpailes commented Mar 24, 2020

BurntSushi commented Mar 24, 2020

Voultapher commented Mar 24, 2020

giovanniberti commented Mar 24, 2020

chrisduerr commented Mar 24, 2020

BurntSushi commented Mar 24, 2020

BurntSushi commented Apr 17, 2023 •

edited

BurntSushi commented Apr 28, 2023

plan: moving regex engines to regex-automata #656

plan: moving regex engines to regex-automata #656

Comments

BurntSushi commented Mar 24, 2020 • edited

ethanpailes commented Mar 24, 2020

BurntSushi commented Mar 24, 2020

Voultapher commented Mar 24, 2020

giovanniberti commented Mar 24, 2020

chrisduerr commented Mar 24, 2020

BurntSushi commented Mar 24, 2020

BurntSushi commented Apr 17, 2023 • edited

BurntSushi commented Apr 28, 2023

BurntSushi commented Mar 24, 2020 •

edited

BurntSushi commented Apr 17, 2023 •

edited