Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: unbuffered text matching #852

Closed
CAD97 opened this issue Mar 31, 2022 · 7 comments
Closed

Feature request: unbuffered text matching #852

CAD97 opened this issue Mar 31, 2022 · 7 comments
Labels

Comments

@CAD97
Copy link
Contributor

CAD97 commented Mar 31, 2022

(Probably blocked on #656 restructuring, and potentially better exposed as part of regex-automata.)

TL;DR:

In the advanced/"lower level" impl section, or perhaps on applicable Regex types provided by regex-automata:

fn Regex::is_match_streaming<O: Display>(&self, text: O) -> bool;

pub fn is_match_streaming<O: Display>(&self, text: O) -> bool

Returns true if and only if there is a match for the regex in the display representation of the object given.

If you have or can get &str for the text to match, you should use is_match instead. This method is potentially useful to reduce allocations, as you don't have to first write the object's display representation into a buffer to match it.

However, this can lead to less performant regex execution. Normal regex execution does multiple optimizations that rely on having the full text available to scan, which can't be done in a streaming fashion. Instead, this method has to always run the most general regex VM byte-by-byte over the input.

To test against streaming text which is not solely a single object's Display impl, use format_args! to manufacture a Display impl with the text you want to match.

Examples

// Fictional people from a name generator.
// Any similarity to actual persons, living or dead, is purely coincidental.
let tiffany = "Ms Tiffany Northman";
let raymond = "Dr Raymond Hyde";
let regex = Regex::new("(?i)tiffany .* loves .* raymond").unwrap();
assert!(regex.is_match_streaming(format_args!("{tiffany} loves {raymond}")));

Motivation

TL;DR: the matchers crate, as used in tracing-subscriber.

The regex crate implements regular expression matching on strings and byte arrays. However, in order to match the output of implementations of fmt::Debug and fmt::Display, or by any code which writes to an instance of fmt::Write or io::Write, it is necessary to first allocate a buffer, write to that buffer, and then match the buffer against a regex.

In cases where it is not necessary to extract substrings, but only to test whether or not output matches a regex, it is not strictly necessary to allocate and write this output to a buffer. This crate provides a simple interface on top of the lower-level regex-automata library that implements fmt::Write and io::Write for regex patterns. This may be used to test whether streaming output matches a pattern without buffering that output.

Users who need to extract substrings based on a pattern or who already have buffered data should probably use the regex crate instead.

tracing spans/events structured logging facilities capture non-primitive types by their Debug formatting. It's expected for such formatting to be relatively cheap — cheaper to do multiple times than to allocate a buffer and do so a singular time, anyway. In most cases, the captured values will only be formatted once, into a large buffer used for recording the event, with all recorded fields printing into the same buffer.

Where the regex comes in is filtering. Filters can observe the recorded field, and filter spans/events in/out depending on the value of the field, and this is primarily done via regex matching. When doing this filtering, tracing-subscriber wants to avoid allocating a buffer for the fields to test, just to throw it away afterwards. This is done by writing the formatting machinery directly into driving a regex-automata DFA byte-by-byte.

tracing-subscriber filters are not supposed to be resilient to untrusted construction, but using the regex crate instead of regex-automata would reduce the failure case from regex compilation taking near-unbounded CPU/RAM to filter application taking O(n) time on the length of the filter string.

Implementation sketch

For simplicity's sake, I'm providing an implementation sketch using regex_automata@0.2.0::hybrid::dfa::DFA. This implementation is derived from matchers', which is MIT licensed1.

Implementation sketch
impl DFA {
    fn is_match_streaming(
        &self,
        cache: &mut Cache,
        bytes: impl Display,
    ) -> Result<bool, CacheError> {
        let mut matcher = StreamingMatcher {
            dfa: self,
            // NB: we need a start_state method that doesn't take the
            //     text as input, like the one in regex-automata@0.1
            state: self.start_state_anchored(cache, None)?,
            cache: cache,
        };
        use io::Write;
        write!(&mut matcher, "{}", bytes).expect("infallible Write impl");
        matcher.finish()
    }
}

struct StreamingMatcher<'a> {
    dfa: &'a DFA,
    state: Result<LazyStateID, CacheError>,
    cache: &'a mut Cache,
}

impl<'a> io::Write for StreamingMatcher<'a> {
    fn write(&mut self, bytes: &[u8]) -> io::Result<usize> {
        let mut i = 0;
        for &byte in bytes {
            self.advance(byte);
            i += 1;
            if self.is_dead() {
                break;
            }
        }
        Ok(i)
    }
}

impl StreamingMatcher<'_> {
    fn advance(&mut self, input: u8) {
        let Ok(state) = self.state else { return; }
        // TODO: use next_state_untagged/next_state_untagged_unchecked
        // TODO: replace CacheError with MatchError::GaveUp
        self.state = self.automaton.next_state(self.cache, state, input);
    }

    fn is_dead(&self) -> bool {
        // TODO: handle quit state
        match self.state {
            Ok(state) => state.is_dead(),
            Err(_) => true,
        }
    }

    fn finish(self) -> Result<bool, MatchError> {
        let state = self.state?;
        let state = self.dfa.next_eoi_state(self.cache, state)?;
        Ok(state.is_match())
    }
}

(I do not know how such an implementation would handle unicode word boundaries, as AIUI those are handled by annoying special cases and not the bytewise (hybrid) DFA.)

By my quick scanning through the regex implementation, the pikeVM is always driven byte-by-byte, and could be driven in a streaming fashion this same way.

Footnotes

  1. I believe the code as presented here is simple/axiomatic enough that it is uncopyrightable. As much as is permissible by law, I release copyright on my modifications to the code, to the detriment of my heirs. (My modifications can be considered licensed under the Unlicense.)

@BurntSushi
Copy link
Member

This looks like a special case of #425 to me. And yes, I would expect that this would show up in regex-automata. Or rather, I think what I would prefer for the time being is that folks would develop these sorts of matching routines by writing their own regex engines. It sounds a little crazy maybe, but that's actually one of my goals with regex-automata: it should be possible to use the finite state machines to write your own search implementation using those finite state machines. (Just like you did above.)

// NB: we need a start_state method that doesn't take the
// text as input, like the one in regex-automata@0.1
state: self.start_state_anchored(cache, None)?,

Hmmm, yes, indeed. The start state computation does require doing look-behind (for start anchors and one half of the word boundary assertion). But this is at most one byte and it only happens when start > 0. When start == 0 (which I would imagine is the case for this problem), then bytes is never accessed. So I do believe that we can provide a more elementary method that would work here. The start_state_reverse case is harder, because the check is no longer start == 0, but rather, end == bytes.len(). But, if you're doing a reverse search, then it's likely you have the buffer in the first place anyway.

I do in part think that #425 subsumes this, but I would like to keep this open since there is a very clear and actionable use case that drives things here. It's also a bit simpler than the general case.

(I do not know how such an implementation would handle unicode word boundaries, as AIUI those are handled by annoying special cases and not the bytewise (hybrid) DFA.)

By my quick scanning through the regex implementation, the pikeVM is always driven byte-by-byte, and could be driven in a streaming fashion this same way.

Right, the DFAs cannot and probably will never be able to handle Unicode word boundaries. You can use its heuristic support for it, and if the search quits, you could in theory give up and switch over to the PikeVM. But... doing this without a buffer seems tricky/impossible. So the heuristic might not be possible. It seems likely then that you'll need to do:

if nfa.has_unicode_word_boundary() {
    // do streaming version of PikeVM
} else {
    // do streaming version of lazy DFA
}

The PikeVM has its own problems with streaming, regrettably. For example, it will try to look up to 4 bytes behind and up to 4 bytes ahead in order to deal with Unicode word boundaries. (Up to 1 byte behind or ahead for other types of assertions, currently.)

So there are problems to be worked out... One simple way forward for y'all would be to do this instead:

if nfa.has_unicode_word_boundary() {
    // copy string to buffer, run normal regex
} else {
    // use lazy DFA
}

And that might be good enough. Unless, of course, you find that most queries in practice really want Unicode word boundaries.

Although, it seems likely that you are currently blocked by the start_state_method. But there is a hack for you. If you are indeed always starting the search with start == 0, then you should be able to call start_state_forward(cache, None, &[], 0, 0) and that will ultimately do what you want. I can't guarantee that it will always work, but it should work today.

Finally, I feel compelled to remind you of the warning at the top of the regex-automata 0.2 docs. :-) https://docs.rs/regex-automata/0.2.0/regex_automata/index.html

@CAD97
Copy link
Contributor Author

CAD97 commented Mar 31, 2022

Finally, I feel compelled to remind you of the warning at the top of the regex-automata 0.2 docs. :-)

matchers currently uses regex-automata@0.1, so yeah, this is more a "nice future" kind of feature request, nothing's really blocked. :-)

Once regex-automata@0.2 is more cooked, matchers can implement these patterns on top of the building blocks provided.

(I'd like to remove the (?:){HUGE} footgun, but filters are currently assumed to be trusted, so it's low/"nice future" priority.)

@BurntSushi
Copy link
Member

Aye. And to be clear, I hope there are no more 0.2.x releases. If things go my way, the next release will be 0.3.0 and that will roughly coincide with: 1) moving regex-automata into this repo and 2) building regex on top of regex-automata. Those are my victory conditions. :-)

(I'd like to remove the (?:){HUGE} footgun, but filters are currently assumed to be trusted, so it's low/"nice future" priority.)

Hmmm. I don't believe regex-automata has ever suffered from this footgun? But maybe I'm wrong.

@BurntSushi
Copy link
Member

On the topic of footguns, if you haven't read the security advisory, you might enjoy doing so: GHSA-m5pq-gvj9-9vr8

In summary, there are always ways of making regex compilation take quite a bit of time. The (?:){HUGE} footgun was a little special in that it was particularly egregious and also kinda didn't respect our API guarantee for size limits. But there is no API guarantee that, e.g., "regex compilation will always take less than 1 second." :)

@CAD97
Copy link
Contributor Author

CAD97 commented Mar 31, 2022

Yeah, regex-automata suffers from a similar issue; I just timed regex_automata::hybrid::dfa::DFA::new("(?:){4294967295}").unwrap() and it took 101.5 seconds in release mode. For comparison, regex::Regex::new with regex@1.5.4 took 71.5 seconds.

But you also very explicitly don't offer mitigation for this in regex-automata,

Compilation can take an exponential amount of time and space in the size of the regex pattern. While most patterns do not exhibit worst case exponential time, such patterns do exist. For example, [01]*1[01]{N} will build a DFA with 2^(N+1) states. For this reason, untrusted patterns should not be compiled with this library. (In the future, the API may expose an option to return an error if the DFA gets too big.) [regex-automata@0.1.10, regex-automata@0.2.0::dfa]

This might break the lazy DFA's guarantees, though, since it sort of implies being linear on pattern length / text length, and I don't think this is.

It avoids the worst case exponential time complexity of DFA compilation by guaranteeing that it will only build at most one state per byte searched. While the worst case here can lead to a very high constant, it will never be exponential. [regex-automata@0.2.0::hybrid]

@BurntSushi
Copy link
Member

BurntSushi commented Mar 31, 2022

OK... so much to untangle here. Haha. This is why I didn't want to publish regex-automata 0.2. It just isn't done yet. I'll just lay out a sequence of statements that I think collectively will help to clarify things:

  • Creating a lazy DFA does essentially no additional work beyond building the NFA. The actual additional work that a lazy DFA needs to do isn't done until search time.
  • That (?:){HUGE} takes a long time doesn't necessarily mean things aren't linear. The catch here is that the size of the regex is unfortunately not len(pattern), but rather, len(pattern-with-all-bounded-repetitions-expanded).
  • Currently, regex-automata sets no size limits on things. If you enable a size limit, then your lazy DFA construction should/will fail in a reasonable amount of time. For example, try regex-cli debug nfa thompson '(?:){200000}' --nfa-size-limit 2000000.
  • The exponential time complexity warning at build time definitely only applies to regex_automata::dfa in 0.2.0.

So for example, if I do end up sticking with no size limits by default for regex-automata, then that would be pretty clearly documented at the crate level for example. I don't know if I will stick to "no limits" as a default though. (Certainly a limit will remain a default in the regex crate proper.)

@BurntSushi
Copy link
Member

I'm going to close this out in favor of the monster #425 issue. Namely, I don't think we need more than one issue for the streaming use case.

Once #656 is done, my hope is that it will enable some experimentation. Note that this may include writing your own PikeVM if you need Unicode word boundary support. it is possibly easier than it sounds, given that regex-automata will expose a way to compile an NFA. But I don't currently have any plans to pursue streaming APIs in regex-automata proper. I want to see people build their own first so we can get a better idea of what is actually required, and whether it makes sense to add streaming APIs.

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants