Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Latest commit

 

History

History
165 lines (139 loc) · 7.91 KB

PLANS.md

File metadata and controls

165 lines (139 loc) · 7.91 KB

pattern_limit should not be defined inside nfa::thompson, but rather at the top-level.


Main problem right now is exemplified by the set60 and set70 failing tests. In particular, when finding the starting position while matching multiple regexes simultaneously, the reverse search is messed up. The reverse search doesn't depend on which regex matched in the forward direction, which means it won't always find the correcting starting location. Unfortunately, the only way to fix this, as far as I can tell, is to add a group of start states for every regex in the DFA. Then once we do the reverse search, we need to choose the correct start state based on which regex matched in the forward direction.

This is a nasty change.

So it looks like this only applies when doing an overlapping search in reverse to find the start of a match. That means we should make this configurable but enable it by default for the reverse automata. It should be configurable so that folks can construct a regex that doesn't have the ability to do overlapping searches correctly. If an overlapping search is attempted with a reverse automaton that lacks starting states for each pattern, then the implementation should panic.

BUT! It is also convenient to provide this option in general for folks that want a DFA that can match any pattern while also being able to match a specific pattern.

Straw man:

  • Update dense::Config to have a starts_for_each_pattern option. It should be disabled by default.
  • In RegexBuilder::build_many_with_size tweak the reverse DFA configuration to have the aforementioned option enabled.
  • It would be interesting to add new APIs to Regex that support matching specific patterns, but I think this is a complication. If we did want to do this, then we should just add it to the _at variants and leave the rest of the API untouched.
  • Add a pattern_id: Option<PatternID> parameter to each of the five *_at methods on the dfa::Automaton trait. A value of None retains the existing behavior. A Some value means that the starting state for that specific pattern must be chosen, which in turn implies an anchored search. (This means starts_for_each_pattern has utility for single-pattern DFAs since it makes it possible to build a DFA that can do both unanchored and anchored searches.)
  • Thread this new parameter down into the various functions in dfa::search all the way down into init_fwd and init_rev. These functions will then pass it to dfa.start_state_{forward,reverse}.
  • This is where things get gruesome since we now need to completely re-work how start states are represented in dense and sparse DFAs and it needs to be configurable. It looks like the Start type from dfa::automaton can basically remain unchanged, since it still represents one of the four possible starting states that will need to be applied for every pattern.
  • For dfa::dense, change StartList to StartTable. Currently, its only header is the state ID count, which is always 4. We'll want to change this to the stride and add a new header value that encodes the number of patterns. When the number of patterns is zero, then existing behavior is preserved and represents the case where starts_for_each_pattern is disabled (or in the case of an empty DFA). When non-zero, a table of starting state IDs is encoded with each row corresponding to the 4 starting states for each pattern. Before this table (even if it's empty), the 4 starting states for the entire DFA are encoded.
  • For dfa::sparse, do the same as above. They are essentially the same right now anyway, with the only difference being that sparse DFAs use &[u8] instead of &[S] (because sparse DFAs don't have any alignment requirements).
  • Modify DFA::empty to accept a starts_for_each_pattern bool that, when true, creates a start table with the header, the start states for the entire DFA and a row of start states for each pattern. When false, no rows are added.
  • Expose whether there are starting states for each pattern via a predicate on the DFA.
  • Modify the determinizer's add_starts method to basically do what it does, but also do it for each pattern when the DFA is configured for it. It should continue to reuse states as appropriate or not generate new states if they aren't needed. This will want to use the NFA::start_pattern method, which provides the starting NFA state ID for the given pattern.
  • Fix the dense->sparse conversion. At this point, this piece should be fairly straight-forward since the sparse representation of starting states is basically identical to the dense representation.

At this point, I think the bug should resolve itself.

^^^^ DONE! IT WORKS!


Add top-level SyntaxConfig (or some such) that has all of the regex-syntax options forwarded, but with automata oriented docs. Then use this for all of the engines instead of having to repeat every option for every builder.


These produce different results. PCRE2 looks correct. Basically, we should be using the context around the at position correctly, which we aren't doing right now. Seems tricky to get right, particularly when confirming the match with a reverse DFA.

Maybe our 'at' functions need to take a full range... Sigh. This is indeed what RE2 does. GAH.

fn main() { let re = regex::Regex::new(r"(?-u)\b\sbar").unwrap(); let s = "foo bar baz"; println!("{:?}", re.find_at(s, 3).map(|m| m.as_str()));

let re = pcre2::bytes::Regex::new(r"\b\sbar").unwrap();
let s = "foo bar baz";
println!("{:?}", re.find_at(s.as_bytes(), 3).unwrap());

}

^^^^ This is fixed now, but we still need to find a way to add test coverage for "context" searches. It'd be nice to do this automatically, but we'll probably just added a new 'context = [start, end]' option.


  • Create regex-test crate, based on glob-test. Try to anticipate the needs for the full regex test suite.
    • See if we can clean up tests.
      • Provide a way to mark a test as expensive.
      • Provide a way to test is_match_at and find_at.
      • Test shortest_match_at too? Huge pain. Add tests for it.
      • Port ALL tests from the regex crate. Will probably need a way to mark a test as skipped.
      • Document tests better.
  • Find a way to remove byteorder dependency.
  • Reorganize crate API:
    • Have errors contain Box<Error+Send+Sync> instead of String.
    • Make errors non-exhaustive.
    • Audit StateID trait for safety.
    • Brainstorm hard about DFA trait and the fact that DenseDFA and SparseDFA have inefficient implementations of some methods. Maybe use multiple traits? Answer: get rid of premultiply/classes knobs and just enable them by default. Should remove a huge amount of code.
    • Check whether unsafe is really needed to eliminate bounds checks. Use micro-benchmarks and bigger CLI workloads using regex-automata-debug.
    • Re-write module docs for dfa as they are no longer top-level. Keep most.
    • Retain any pertinent top-level crate docs, but don't rewrite yet.
    • Clean up builders if we can. e.g., Determinizer, minimizer, it's all a mess right now.
    • Clean up and add 'always_match' and 'never_match' constructors for every regex engine.
    • See about supporting ^, $, \A, \z, \b and \B in DFAs. Do the non-Unicode version of \b unfortunately. Carefully scrutinize how the regex crate's lazy DFA does it and try to make it comprehensible. Done! Except for the part about making it comprehensible.
  • Rethink prefilters?
  • Add regex-automata-generate CLI tool. This should just be a copy of the ucd-generate dfa and ucd-generate regex commands.

Then build new public nfa sub-module.

  • For Unicode \b, generate \w DFA (forwards and reverse) and embed it into source for fast checking. That way, we don't need to ever do explicit UTF-8 decoding anywhere. Yay.

Then lazy sub-module.

Then onepass.

Then jit.

... and beyond? CRAZY. But it can be done! Build STRONG base layers.