Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report partial matches #678

Closed
Ekleog opened this issue May 14, 2020 · 7 comments
Closed

Report partial matches #678

Ekleog opened this issue May 14, 2020 · 7 comments

Comments

@Ekleog
Copy link

Ekleog commented May 14, 2020

Hello,

First, this issue's description may overlap a bit with #425, by attempting to do what #25 wanted.
However, I believe the solution described below to handle it would probably be much easier, and I think it would be a good thing to handle my use case even if #425 were implemented (even though it could be worked around by using longest_potential_match_len in such a world).
So, I thought it was worth it suggesting this intermediate solution, that would hopefully be reasonable to implement even though #425 appears to not be as of today, and solve at least some issues :)

Problem description

I'm trying to use regexes from streaming nom parsers, for parsing SMTP information.

In order to do so, the nom way of doing things is basically to, when receiving a packet, optimistically parse it, and if the parser reports that it needs additional data to proceed, then wait for enough data to come in and parse again -- this way most matches are done in one pass, even though it means one additional case for the rare case where a message was split between two packets.

However, in order to use regexes from this, there has to be a way to be able to know whether the regex didn't match because it couldn't match, or if it didn't match because it didn't have a complete match yet.

Proposed solution

The solution I can think of is simple with only basic theory of automata -- not having any idea how regex's engine works in practice, I cannot say whether it would be simple in practice, though.

The idea is the following: if the automaton has reached a failing state before the end of the input, return that there can be no match. If it has not, and reaches the end of the input in a start or intermediate state, return that there may be a match that would come up if there were more input.

Note that this scheme works well only for regexes that start with ^ or similar -- but any regex that doesn't would potentially match later on, which means that this would be working as intended.

As a further evolution that might be enough to handle part of #425's problem too, it might make sense to also report when the regex was last in its start state, so that the caller could just buffer starting from the last time the regex was in its start state and re-run from that last position the next time. It's less optimized than doing it all in one pass, but hopefully would be fast enough (by making the user rescan only the parts of the buffer that could contain a match), while unlocking additional use cases.

What do you think about this idea? Would it be simple enough to deserve implementing, until #425 would be solved, or would I be better off implementing this myself based on the regex_automata crate?

@BurntSushi
Copy link
Member

I think you would definitely be better off using regex-automata. It might not be difficult to implement your suggestion in the actual regex engines themselves, but exposing that information in the API is non-trivial.

@BurntSushi
Copy link
Member

it might make sense to also report when the regex was last in its start state, so that the caller could just buffer starting from the last time the regex was in its start state and re-run from that last position the next time

Note that this isn't correct in general. The start state might itself consume input.

@Ekleog
Copy link
Author

Ekleog commented May 14, 2020

Oh that's right, so I guess this further evolution would require additional changes, by splitting the start state between a start state that hasn't consumed any input yet and a start state that has consumed input, and this would make things much harder to implement for the extension because then any transition to the start state would have to be considered, to either point to the nothing-consumed start state or to the something-consumed start state.

As for how to expose this information in the API… intuitively, as it doesn't incur any performance cost, I'd have done it by taking the methods that return Option<T> and instead return Result<T, enum { WillNeverMatch, MayMatchLater }>. Now, I don't know how you feel about backwards-compatibility and API-breaking changes, and I'm not really sure how to incrementally improve the current API, as all good function names are taken :/ something like adding a _streaming at the end?

(Anyway, I'm going to try it out with regex-automata for now, which probably solves my current issue, thank you for your feedback! :))

@BurntSushi
Copy link
Member

BurntSushi commented May 15, 2020

Now, I don't know how you feel about backwards-compatibility and API-breaking changes

I'm sure there will be a regex 2.0 some day, but I won't do such a thing lightly and this kind of change definitely doesn't elevate to that level unfortunately. (Even if I agreed with your proposal. :P)

I'd have done it by taking the methods that return Option<T> and instead return Result<T, enum { WillNeverMatch, MayMatchLater }>

My initial reaction to this is that it makes the API more complex for a rather niche use case.

With that said, I have actually experimented with such an API in regex-automata. In particular, I've been working on #656, which involves re-working most of regex-automata. In particular, I've added try_* variants of a lot of the matching methods. For example: Regex::try_find_leftmost returns Result<Option<Match>, NoMatch> instead of just Option<Match>. (Warning: I have not put much, if any, effort into docs at the moment. They are likely outdated and wrong in places.)

Now, NoMatch isn't quite what you want here, since it actually indicates an error that occurred while searching. In this case, the caller can't make any assumptions about whether a match did or didn't occur. (Specifically, it corresponds to one of two cases: the machine hit a case it could not process and quit, or the machine "decided" it was inefficient and gave up.)

When I originally wrote NoMatch, it actually had three variants: the two I just described in addition to None. In other words, the API was Result<Match, NoMatch>. It turned out that this was actually quite inconvenient to use, because in some contexts, you want to propagate the "quit" and "gave up" variants as normal errors and handle the "didn't match" case differently. So it leads to a lot of explicit case analysis. It also presents problems for implementing iterators. The Option<Match> API extends naturally to iterators, but the Result<Match, NoMatch> API does not. You wind up having a next method that returns Option<Result<Match, NoMatch>>, but where the NoMatch::None variant can never occur, because it signals the end of iteration. So it's just super awkward.

So the API you would want, I think, would most naturally be:

// This corresponds to `NoMatch` in the regex-automata docs I linked.
type MatchError = ...;

enum NoMatch {
    WillNeverMatch(usize),
    MayMatchLater(usize),
}

fn try_find_leftmost(...) -> Result<Result<Match, NoMatch>, MatchError>;

This is a little ugly, but maybe it is the right thing to do at the regex-automata level. On the other hand, I don't think it would be that bad to demand that callers implement their own search loop for these kinds of cases, which wouldn't be too bad. Especially since this is really only useful for the anchored case.

More generally, my hope is that regex-automata will become the right place for these kinds of lower level and more complex APIs that I'm not sure really belong in regex proper. regex-automata is also something I'm more willing to evolve rapidly and am generally okay with it having a much higher compexity budget that regex. Namely, regex is something that ideally everyone can use and use easily without really know much about how regexes work. But in order to use regex-automata effectively, I would generally hope that folks would expect to put in a bigger investment to understand how it works.

So I think that's my high level feedback for the time being.

@Ekleog
Copy link
Author

Ekleog commented May 17, 2020

What, you don't want to make an API-breaking change just for my very specific use case? :P (more seriously, I was more thinking it could be something to stash on top of a list of changes queued for a potential 2.0 release whenever it comes)

So first: I have implemented what you suggested with regex-automata, here -- it works great, thank you for the hint!

For your idea of API for try_find_leftmost, it looks indeed like what would be best! The reason why I was suggesting replacement of Option<Match> with Result<Match, WillNeverMatch|MayMatchLater> is that it would allow simply replacing if let Some(foo) with if let Ok(foo), given the error variant would just be a split of the previous None case between a WillNeverMatch and a MayMatchLater case :)

With your try_find_leftmost, it's equivalent to going from Result<Option<Match>, MatchError> that AFAIU is your current API to Result<Result<Match, NoMatch>, MatchError> -- the same thing I was suggesting :)

About users having to write their own search loop, it totally makes sense to me! I think it should also be possible to upstream this specific loop into nom (related: rust-bakery/nom#1155), so that each other user wouldn't have to rewrite their own :)

@BurntSushi
Copy link
Member

With your try_find_leftmost, it's equivalent to going from Result<Option<Match>, MatchError> that AFAIU is your current API to Result<Result<Match, NoMatch>, MatchError> -- the same thing I was suggesting :)

Right. My main point I was trying to express was how awkward it is, I guess.

Note that when trying to use regex-automata in other crates, it would be good to mention that it is first and foremost an implementation detail of regex. This doesn't mean one shouldn't use it of course, but that the same kind of API stability one might expect with regex won't really apply to regex-automata.

@BurntSushi
Copy link
Member

I'm going to close this in favor of #656. Namely, once regex-automata 0.3 is released, you should have more options. You might still need to implement your own search routine using a lazy DFA for example, but this is eminently doable in a couple dozen lines. Otherwise, I don't think it's worth complicating every search API to report this information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants