Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Analyze complexity before actual compile? #11

Closed
audunhalland opened this issue Mar 22, 2021 · 11 comments
Closed

Analyze complexity before actual compile? #11

audunhalland opened this issue Mar 22, 2021 · 11 comments

Comments

@audunhalland
Copy link

Hi, I'm wondering if there exists a theoretical way to cheaply (ish?) analyze the resulting automaton complexity ahead of time, without actually building it.

My use case is that I have an opportunity to choose a pattern implementation dynamically, between a plain Regex that is faster to compile but (in some cases) much slower to match, and an automaton, which seems to always be much faster to match against. It would be helpful to have such a measurement up front, in order to help pick the tradeoff. I know that the docs say no untrusted patterns, but the user does not write the pattern directly: The patterns are generated based on "wildcards" (*) in search words.

If this sounds impossible, then so be it :)

Otherwise, thanks for a great lib!

@BurntSushi
Copy link
Owner

It would be helpful to have a real example. It could be the case that a plain Regex should just be made to be faster. :-)

At some point in the future, I plan to use regex-automata inside the regex crate. While the initial goal is to just get all the regex engines ported over to regex-automata, eventually the idea is that it will help fuel more optimizations. One such optimization is exactly the case you've described: in some limited cases, it might actually make sense to just build the whole DFA instead of using the lazy DFA. Doing this requires some kind of analysis that compiling the full DFA won't be too expensive. So I will have to solve this problem to some extent at one point or another. If I do, I think it's reasonable that this analysis could be expose in regex-automata (or maybe regex-syntax, it depends on its nature).

Interestingly, Becchi and Crowley published "A Hybrid Finite Automaton for Practical Deep Packet Inspection", which actually tries to answer precisely the question: can we tell if a given regex is going to cause exponential state explosion? It's not quite the same as, "will the DFA be to big to be practically useful," but maybe some insight can be learned. It has been years since I've read the paper, and sine I haven't tried to tackle this problem yet, I haven't re-read and can't remember much about it. When I do try to tackle this problem (I expect it will be a year before I do, or more), then I plan to absorb that paper and see what can be done with it.

Failing all of that, my plan is to just use heuristics. e.g., if the regex has no large bounded repetitions and has no large Unicode character classes and is otherwise under a certain size, then try to build the DFA for it. I suspect this would be coupled with an option in this crate (which I plan to add) that would allow the construction of the DFA to fail once it got beyond a certain size (configured by the caller). My guess is that such heuristics would probably work well and might even be necessary, especially since we care about non-exponentially large DFAs. For example, there is nothing exponential about \w{10}, but when Unicode is enabled (the default), compiling its full DFA is quite large!

Note that for the past several months, I've been working on a ground-up rewrite of this crate in the ag/work branch. (I may force-push to that branch, so beware.) It has largely reached feature parity with the current version of the crate, but also added lots of other stuff (like RegexSet functionality). The main engineering goal is to make it suitable for use in the regex crate, and so that is the primary forcing function here.

@audunhalland
Copy link
Author

audunhalland commented Mar 22, 2021

Thanks for the feedback, very interesting.

Some examples, just grabbed out of a test in my debug build:

Regex:
^(different)$|^(have)$|^(many)$|^(patterns)$|^(test)$|^(to)$|^wildca|^cl[\x{0000}-\x{024f}]*k!$
anchored DFA, equivalent to the above (just without groups):
different|have|many|patterns|test|to|((wildca)[\x{0000}-\x{024f}]*)|(cl[\x{0000}-\x{024f}]*k!)

(I've used a "latin range unicode trick", 0000-024f in order to reduce the DFA compile time. This is cheaper to use than .*)

Input string "clugulububiffjoglufagofibonggrik!". This is matched 400 times against each pattern impl:
Regex compile: 957.59µs
Regex match: 10.339507ms
DFA compile: 4.50037ms
DFA match: 1.570276ms

However, I also have a test for a much more complex pattern:
Regex:
^b[\x{0000}-\x{024f}]*r$|^int[\x{0000}-\x{024f}]*rnt$|^m[\x{0000}-\x{024f}]*n$|^mul[\x{0000}-\x{024f}]*gens$|^ne[\x{0000}-\x{024f}]*ne$|^og[\x{0000}-\x{024f}]*å$|^wil[\x{0000}-\x{024f}]*rd$|^br|^e|^go|^hel|^m|^mang|^me|^mønste|^no|^ogs|^tes|^v|^veldi|^wildc|eksten$|gså$|ldcards$
anchored DFA:
((br|e|go|hel|m|mang|me|mønste|no|ogs|tes|v|veldi|wildc)[\x{0000}-\x{024f}]*)|(b[\x{0000}-\x{024f}]*r)|(int[\x{0000}-\x{024f}]*rnt)|(m[\x{0000}-\x{024f}]*n)|(mul[\x{0000}-\x{024f}]*gens)|(ne[\x{0000}-\x{024f}]*ne)|(og[\x{0000}-\x{024f}]*å)|(wil[\x{0000}-\x{024f}]*rd)|([\x{0000}-\x{024f}]*(eksten|gså|ldcards))
(this is Norwegian, sorry about that 😬 )

Input string "muligens", matched 400 times:
Regex compile: 1.867883ms
Regex match: 1.863705ms
DFA compile: 24.825708ms
DFA match: 666.785µs

I found this a bit interesting because the 2nd pattern was an order of magnitude cheaper to match using Regex. It could be because of the length of the input string being shorter. But OTOH, the DFA compilation time has "exploded".

edit: I tested the first Regex pattern using a shorter input "clugrik!" (match for the ^cl[\x{0000}-\x{024f}]*k!$ "clause"), and this takes the Regex match performance from 10.339507ms for the long input to 4.44021ms for the short input. But that is still 4 times slower than the complex 2nd pattern 🤔

edit 2: I have verified that the input patterns to Regex and Automata are equivalent, I have two distinct functions generating them, so the "clauses" end up in different order. (Also I'll mention that the patterns for the Regex implmentation uses match groups for clauses that do not include wildcards, i.e. completely anchored from beginning to end)

@BurntSushi
Copy link
Owner

@audunhalland Thanks for the examples! Would it be possible to share your benchmark program in full? I'd love to look at precisely the code you're writing. If so, I can try to take a closer look at the performance profile here and either try to fix things or at least explain them. The performance profile of a regex::Regex is quite complex, since it has oodles of different kinds of optimizations. Many of them are heuristics, which means they can be slower than the "standard" approach in some cases, but the intent is that it's a lot faster in common cases. Alas, this is partially why I'm taking the approach I am with regex-automata: sometimes you want more predictable performance.

I am also quite interested in your precise program because, for example, your bigger regex takes a lot longer to compile in my branch than it does for you. And I'd like to make sure I am comparing apples-to-apples here, but I can't do that without seeing precisely how you've compiled your pattern.

In my branch, there is a regex-cli tool that will compile DFAs and what not and print various info about them:

$ regex-cli debug dfa dense -q '((br|e|go|hel|m|mang|me|mønste|no|ogs|tes|v|veldi|wildc)[\x{0000}-\x{024f}]*)|(b[\x{0000}-\x{024f}]*r)|(int[\x{0000}-\x{024f}]*rnt)|(m[\x{0000}-\x{024f}]*n)|(mul[\x{0000}-\x{024f}]*gens)|(ne[\x{0000}-\x{024f}]*ne)|(og[\x{0000}-\x{024f}]*å)|(wil[\x{0000}-\x{024f}]*rd)|([\x{0000}-\x{024f}]*(eksten|gså|ldcards))'
            parse time:  81.044µs
        translate time:  38.838µs
      compile nfa time:  211.982µs
compile dense dfa time:  285.643066ms
      dense dfa memory:  7573932
 dense alphabet length:  46
          dense stride:  64

So in this case, it took 285ms to compile the DFA. And it is 7.5MB in size! Yikes.

@BurntSushi
Copy link
Owner

Note that in my rewrite, the crate supports all the same regex features (except Unicode word boundaries). So you can write anchor assertions. When I do that, the DFA gets much smaller:

$ regex-cli debug dfa dense -q '^b[\x{0000}-\x{024f}]*r$|^int[\x{0000}-\x{024f}]*rnt$|^m[\x{0000}-\x{024f}]*n$|^mul[\x{0000}-\x{024f}]*gens$|^ne[\x{0000}-\x{024f}]*ne$|^og[\x{0000}-\x{024f}]*å$|^wil[\x{0000}-\x{024f}]*rd$|^br|^e|^go|^hel|^m|^mang|^me|^mønste|^no|^ogs|^tes|^v|^veldi|^wildc|eksten$|gså$|ldcards$'
            parse time:  47.86µs
        translate time:  23.17µs
      compile nfa time:  137.524µs
compile dense dfa time:  5.838039ms
      dense dfa memory:  106896
 dense alphabet length:  48
          dense stride:  64

And notice that its compile time is snappier. I made some improvements in that department in my rewrite as well. :-)

This made me realize that you're probably compiling your DFAs in anchored mode (and yeah... you said that, I missed it, sorry!). Once I do that with your original regex, compilation time goes way down:

$ regex-cli debug dfa dense -q -a '((br|e|go|hel|m|mang|me|mønste|no|ogs|tes|v|veldi|wildc)[\x{0000}-\x{024f}]*)|(b[\x{0000}-\x{024f}]*r)|(int[\x{0000}-\x{024f}]*rnt)|(m[\x{0000}-\x{024f}]*n)|(mul[\x{0000}-\x{024f}]*gens)|(ne[\x{0000}-\x{024f}]*ne)|(og[\x{0000}-\x{024f}]*å)|(wil[\x{0000}-\x{024f}]*rd)|([\x{0000}-\x{024f}]*(eksten|gså|ldcards))'
            parse time:  159.533µs
        translate time:  74.817µs
      compile nfa time:  431.497µs
compile dense dfa time:  6.098245ms
      dense dfa memory:  113980
 dense alphabet length:  46
          dense stride:  64

But at least with my branch, you don't need two separate regexes. You can use the same one for both.

Anyway, would still love to see your benchmark program if that's feasible to do, so I could explore your problems more precisely.

@audunhalland
Copy link
Author

The code is part of a "Highlighter"-module in a search engine application using https://github.com/tantivy-search/tantivy. It's not open source and I'm developing for a client, but sure I can extract just this module to a separate repo. I can get to it tonight, probably.

@BurntSushi
Copy link
Owner

@audunhalland Yeah I mean the benchmarking aspect. e.g., You mention running something 400 times. I guess I assumed that was a small little benchmark program. But yeah, the important bit is making sure I know how you're compiling things and what not.

@audunhalland
Copy link
Author

Here's the code:
https://github.com/audunhalland/regex-test/blob/main/src/token_matcher.rs#L172

main has been set up to run this test only. with cargo run --release it is of course much faster to compile the automata.

@BurntSushi
Copy link
Owner

@audunhalland Thank you! That's really useful data, thank you. I don't see anything to be done in the moment other than to perhaps invent you're own heuristics from patterns you're seeing. Once I get to the point where I need to do that, I'll be sure to include these regexes in the benchmark suite. :-)

@cessen
Copy link

cessen commented Aug 1, 2021

I suspect this would be coupled with an option in this crate (which I plan to add) that would allow the construction of the DFA to fail once it got beyond a certain size (configured by the caller).

We were discussing this recently over at the Helix project, actually. We're planning to switch over to regex-automata for searching relatively soon, but ideally want to avoid exponential explosion when building the DFA.

A stop-gap solution we're considering is to abuse the StateID parameter of RegexBuilder::build_with_size() to limit the size of the DFA (by using u16 for example). Then we report an error to the user if their regex expression results in a too-large DFA. However, that's not the intended purpose of the StateID APIs, of course. And it also gives a very limited set of max sizes to choose from. So having an API that's actually meant for that would be great.

(Of course, ultimately we probably want to use something like a lazy DFA or an NFA. But in practice this would be a good solution for the time being, especially given the typical searches done in a text editor.)

@BurntSushi
Copy link
Owner

Yeah that is a creative use. regex-automata 0.2 will have an API that gives a more explicit way to error if the DFA gets too large.

And as of about a week ago, I do have the lazy DFA implemented. That will probably give you what you want, although there are some restrictions on how the state IDs are used. Namely, I believe the ultimate restriction will be, "the only state ID you can feed to the transition function is the state ID previously returned by the transition function." (Those can be given up in exchange for allowing unbounded memory growth.)

@BurntSushi
Copy link
Owner

I think this is perhaps an interesting future direction, but I don't think there is anything actionable at present here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants