Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search via regex-automata #211

Closed
wants to merge 5 commits into from
Closed

Search via regex-automata #211

wants to merge 5 commits into from

Conversation

archseer
Copy link
Member

No description provided.

@archseer archseer requested a review from pickfire June 10, 2021 14:08
/// NOTE: based on DFA::find_at
fn find(&self, text: RopeSlice, dfa: &impl DFA) -> Option<usize> {
// TODO: check this inside main search
// if dfa.is_anchored() && start > 0 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I should keep these checks, but then start & end need to be passed into find/rfind instead of doing the slicing inside prev/next

@kirawi
Copy link
Member

kirawi commented Jun 10, 2021

Are optimizations within the scope of this PR, or do you want to defer that to a later PR?

Comment on lines +136 to +141
let mut state = dfa.start_state();
let mut last_match = if dfa.is_dead_state(state) {
return None;
} else if dfa.is_match_state(state) {
Some(0)
} else {
None
};

let mut chunk_byte_offset = 0;

for chunk in text.chunks() {
for (i, &b) in chunk.as_bytes().iter().enumerate() {
state = unsafe { dfa.next_state_unchecked(state, b) };
if dfa.is_match_or_dead_state(state) {
if dfa.is_dead_state(state) {
return last_match;
}
last_match = Some(chunk_byte_offset + i + 1);
}
}
chunk_byte_offset += chunk.len();
}

last_match
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks about right to me, although you should definitely write tests for it.

Also, I'd suggest using dfa.next_state and avoid the unsafe unless you can witness a measurable perf improvement. The only thing next_state_unchecked does is elide bound checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pickfire This is almost a direct copy of DFA::find_at/rfind_at, I've mainly just added another loop that feeds chunks into the dfa.

return;
}

match Searcher::new(input) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed to be blocking the input, even when I search (debug) "hello world" every character takes like few seconds and I need to wait. Maybe we should spawn it as a blocking task?

@pickfire
Copy link
Contributor

Is it possible to leverage something like regex to get more optimizations for now? But come back later to optimize it ourselves?

@pickfire
Copy link
Contributor

It seemed to be fast enough when I run it on release mode searching for stuff within sqlite.c but on debug mode when I just search "hello world", after hello, every character is like blocking for a few seconds even when it never will work (since not even hello is there). Maybe we could have pass in history to searcher so if there are no results and (if possible) similar query then we stop searching altogether? The regex part I don't quite understand myself.

I think the best person to review this is probably @BurntSushi himself. Let's summon him and hope he can save us.

@pickfire
Copy link
Contributor

pickfire commented Jun 10, 2021

I got a panic, in https://github.com/clibs/sqlite/blob/master/sqlite3.c, search for This file, press n and press N.

backtrace
thread 'main' panicked at 'Attempt to slice past end of RopeSlice: slice end 5385123, RopeSlice length 5385122', /home/ivan/.cargo/registry/src/github.com-1ecc6299db9ec823/ropey-1.2.0/src/slice.rs:784:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/std/src/panicking.rs:493:5
   1: std::panicking::begin_panic_fmt
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/std/src/panicking.rs:435:5
   2: ropey::slice::RopeSlice::slice
             at /home/ivan/.cargo/registry/src/github.com-1ecc6299db9ec823/ropey-1.2.0/src/slice.rs:784:9
   3: helix_core::search::Searcher::search_prev
             at ./helix-core/src/search.rs:113:20
   4: helix_term::commands::_search::{{closure}}
             at ./helix-term/src/commands.rs:770:29
   5: core::option::Option<T>::or_else
             at /home/ivan/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/option.rs:818:21
   6: helix_term::commands::_search
             at ./helix-term/src/commands.rs:768:36
   7: helix_term::commands::search_prev
             at ./helix-term/src/commands.rs:817:5
   8: helix_term::ui::editor::EditorView::command_mode
             at ./helix-term/src/ui/editor.rs:553:21
   9: <helix_term::ui::editor::EditorView as helix_term::compositor::Component>::handle_event
             at ./helix-term/src/ui/editor.rs:641:33
  10: helix_term::compositor::Compositor::handle_event
             at ./helix-term/src/compositor.rs:112:19
  11: helix_term::application::Application::handle_terminal_events
             at ./helix-term/src/application.rs:144:32
  12: helix_term::application::Application::event_loop::{{closure}}
             at ./helix-term/src/application.rs:108:21
  13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /home/ivan/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/mod.rs:80:19
  14: helix_term::application::Application::run::{{closure}}
             at ./helix-term/src/application.rs:266:9
  15: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /home/ivan/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/mod.rs:80:19
  16: hx::main::{{closure}}
             at ./helix-term/src/main.rs:96:5
  17: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /home/ivan/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/mod.rs:80:19
  18: tokio::park::thread::CachedParkThread::block_on::{{closure}}
             at /home/ivan/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/park/thread.rs:263:54
  19: tokio::coop::with_budget::{{closure}}
             at /home/ivan/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/coop.rs:106:9
  20: std::thread::local::LocalKey<T>::try_with
             at /home/ivan/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:272:16
  21: std::thread::local::LocalKey<T>::with
             at /home/ivan/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:248:9
  22: tokio::coop::with_budget
             at /home/ivan/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/coop.rs:99:5
  23: tokio::coop::budget
             at /home/ivan/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/coop.rs:76:5
  24: tokio::park::thread::CachedParkThread::block_on
             at /home/ivan/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/park/thread.rs:263:31
  25: tokio::runtime::enter::Enter::block_on
             at /home/ivan/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/enter.rs:151:13
  26: tokio::runtime::thread_pool::ThreadPool::block_on
             at /home/ivan/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/thread_pool/mod.rs:71:9
  27: tokio::runtime::Runtime::block_on
             at /home/ivan/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/mod.rs:452:43
  28: hx::main
             at ./helix-term/src/main.rs:98:5
  29: core::ops::function::FnOnce::call_once
             at /home/ivan/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:227:5

let snapshot = doc.selection(view_id).clone();

let prompt = Prompt::new(
"search:".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"search:".to_string(),
"search: ".to_string(),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kak doesn't have any space between the prompt and the text either. We can change that but it should be done in a follow-up so we change all the prompts.

@pickfire

This comment has been minimized.

@kirawi
Copy link
Member

kirawi commented Jun 10, 2021

It seemed to be fast enough when I run it on release mode searching for stuff within sqlite.c but on debug mode when I just search "hello world", after hello, every character is like blocking for a few seconds even when it never will work (since not even hello is there). Maybe we could have pass in history to searcher so if there are no results and (if possible) similar query then we stop searching altogether? The regex part I don't quite understand myself.

I think the best person to review this is probably @BurntSushi himself. Let's summon him and hope he can save us.

I think we can just check if the string contains() each invalid pattern, and only adding to the history if there's a new query that got 0 matches. I'm also wondering whether we can make the DFAs Lazy.

@BurntSushi
Copy link

I'm not sure when I'll get a chance to look at this PR, but have y'all seen this section of the docs? Are those tradeoffs acceptable to you? https://docs.rs/regex-automata/0.1.10/regex_automata/#differences-with-the-regex-crate

DFA compilation, especially when Unicode mode is enabled, can be very very slow.

@kirawi
Copy link
Member

kirawi commented Jun 10, 2021

I'm not sure when I'll get a chance to look at this PR, but have y'all seen this section of the docs? Are those tradeoffs acceptable to you? https://docs.rs/regex-automata/0.1.10/regex_automata/#differences-with-the-regex-crate

DFA compilation, especially when Unicode mode is enabled, can be very very slow.

Yes, but we're looking into possible optimizations.

@BurntSushi
Copy link

Sorry, if I'm going to be helpful, I think I need more context. Are there any specific questions y'all have?

@kirawi
Copy link
Member

kirawi commented Jun 10, 2021

Archseer would probably be better to ask, though he's asleep right now.

@kirawi
Copy link
Member

kirawi commented Jun 10, 2021

I prototyped some of the optimizations I suggested, it's a lot better now I think mainly by ignoring known invalid queries (* on alphanumeric only). Admittedly, I don't think the Lazy behaviour made it faster (* on alphanumeric only), that or I missed something. I don't know how difficult it would be to make it async, but maybe it would also be possible to do the regex in a separate thread so the main rendering isn't blocked.

impl Searcher {
pub fn new(pattern: &str) -> Result<Searcher, RegexError> {
// Check case info for smart case
let has_uppercase = pattern.chars().any(|c| c.is_uppercase());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spotted that, should probably copy that behavior over. The current detection matches alacritty, might be worth pointing out there as well.


/// Returns the end offset of the longest match. If no match exists, then None is returned.
/// NOTE: based on DFA::find_at
fn find(&self, text: RopeSlice, dfa: &impl DFA) -> Option<usize> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this code is running afoul of this warning here: https://docs.rs/regex-automata/0.1.10/regex_automata/dense/enum.DenseDFA.html#the-dfa-trait

I would suggest unwrapping the DenseDFA enum into a PremultipliedByteClass, and then just that everywhere.

(In regex-automata 0.2, this perf footgun will disappear since a dense DFA will no longer be an enum. It will just be a PremultipliedByteClass.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I should have paid more attention to the docs.

What else are you changing in 0.2?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a total rewrite. This is the high level plan: rust-lang/regex#656

For DFAs in particular, they are getting support for everything except for Unicode word boundary. (So (?-u:\b), ^ and $ will work.) And they will also have multi-pattern support. But otherwise, all DFAs will be pre-multiplied with byte classes enabled, and state IDs will always be u32. (So there are much fewer type parameters.)

And lots of compilation perf improvements. But not enough to make building large Unicode patterns bearable.

@BurntSushi
Copy link

OK, so the reason why I pointed out the docs to trade offs is that building a DFA up-front can be a serious performance footgun. You can't escape it by "implementing optimizations" on top of it. The key here is that building a DFA can take exponential time and use an exorbitant amount of memory. But maybe this is OK, I'm not sure, it depends on what you're using it for.

@kirawi
Copy link
Member

kirawi commented Jun 10, 2021

OK, so the reason why I pointed out the docs to trade offs is that building a DFA up-front can be a serious performance footgun. You can't escape it by "implementing optimizations" on top of it. The key here is that building a DFA can take exponential time and use an exorbitant amount of memory. But maybe this is OK, I'm not sure, it depends on what you're using it for.

In rust-lang/regex#425, you mentioned hyperscan, do you think that could be a solid alternative for this PR and would you recommend it?

@BurntSushi
Copy link

IIRC, Hyperscan has long compile times too, but probably not as bad as full DFAs. Hard to say.

I've never used Hyperscan in anger. I just know it's probably the most sophisticated open source regex engine. If you can accept it in your build chain and are okay with x86-only, then it's worth a shot?

I think the real answer here is that you need to find a way to benchmark stuff. e.g., If a user types \w{30}, how long does it take? Is it an acceptable amount of time?

@kirawi
Copy link
Member

kirawi commented Jun 11, 2021

It's a bit unfortunate that hyperscan was unwilling to accept support for other architectures : (

@BurntSushi
Copy link

Well, it's a ton of work. All of the SIMD stuff I do is x86 only too. There are just fallbacks for everything. With Hyperscan, AIUI, SIMD is more deeply threaded through the code.

@archseer
Copy link
Member Author

@BurntSushi Thanks for taking a look!

I've been testing it primarily with simple patterns and it seemed to be instant in release mode, but I admit I didn't test it with more complex inputs, \w{30} is definitely unacceptable (I had to kill the process).

This is how the user interacts with search. We show an interactive query, then on Enter we store the final Searcher and reuse it when the user presses n/N to cycle through matches.

The ideas to optimize this:

  • We should probably default to unicode(false) (and perhaps provlde a key combination to toggle this in the search prompt, but this could freeze the editor if the user turns it on with \w{30}). It's not ideal, but since we're searching through code it seems reasonable.

  • Right now we build both the forward and backward searching DFAs, resulting in four compiles. The interactive preview is only using forward search so we could delay compiling the backwards ones until Enter is pressed to accept the query.

  • Seeing the note in the readme about the lack of literal optimizations: I was going to do a cheap trick and scan the pattern for alphanumeric + _ patterns, then use a separate searcher implementation that would simply use regular substring search.

    Just realized this is not that helpful though because we turn on smart case by default, so this could only really be applied if there's uppercase characters in the pattern. I would love some sort of a literal optimization feature in regex-automata or some sort of wrapper crate though, even if that was disabled by a feature flag by default.

  • We could probably build the DFAs in parallel

  • A debounce could be introduced to avoid continuously compiling while the user is typing

@pickfire
Copy link
Contributor

@BurntSushi Is it possible to have like an incrementally constructed DFA or regex? Since in our case the user will continuously type stuff or delete stuff, like from hello worl to hello world but in our case we have to recreate the whole DFA from scratch, is there a way to incrementally update the existing ones?

@BurntSushi
Copy link

* Seeing the note in the readme about the lack of literal optimizations: I was going to do a cheap trick and scan the pattern for alphanumeric + `_` patterns, then use a separate searcher implementation that would simply use regular substring search.
  Just realized this is not that helpful though because we turn on smart case by default, so this could only really be applied if there's uppercase characters in the pattern. I would love some sort of a literal optimization feature in regex-automata or some sort of wrapper crate though, even if that was disabled by a feature flag by default.

I haven't yet decided how this is going to work at the API level. I don't even know if literal optimizations should be in the domain of something like regex-automata, although, there will certainly have to be some kind of Prefilter mechanism built into the API so that search implementations can make use of it.

With that said, you don't need to examine the pattern yourself. You can extract prefixes using the regex-syntax crate: https://docs.rs/regex-syntax/0.6.25/regex_syntax/hir/literal/struct.Literals.html --- The API isn't great, but I haven't had a chance to revisit it yet. But once you've extracted some prefix literals, you can modify your search routine to use them for a faster search if possible.

Is it possible to have like an incrementally constructed DFA or regex? Since in our case the user will continuously type stuff or delete stuff, like from hello worl to hello world but in our case we have to recreate the whole DFA from scratch, is there a way to incrementally update the existing ones?

Not to my knowledge, although that isn't my area of specialty. It's possible there are some techniques for it, but I don't know them.


regex-automata 0.1 was built to solve a particular use case: embed fully compiled DFAs into programs. In particular, I used them to implement grapheme/word/sentence segmentation in bstr. On the way there, I exposed the lower level state transition APIs and made it possible to load DFAs in a no_std context. Outside of that, you're really stretching what it was intended to be used for. It is tempting to use it here precisely because of its ability to work on streams, but it's probably a mismatch in every other conceivable way.

In my effort to port the various internal regex engines over to regex-automata, my goal there is to expose APIs that are as low level as possible. My hope is that for at least some of those regex engines, it will be possible to execute them on a stream. The Pike VM, for example, seems amenable to this, although I'm not sure. (With the tradeoff being that the Pike VM is slower for searching than a typical DFA.) But it will be months (maybe a year, at my current pace) before this lands.

@BurntSushi
Copy link

One thing that would be useful to me, personally, would be a write-up on how other text editors (emacs, vim) handle their regex searches. Do they roll their own regex engine? Do they make sacrifices like putting a limit on the maximum length of a match? Do they give up search speed in exchange for more flexibility? And so on.

@BurntSushi
Copy link

One thing that would be useful to me, personally, would be a write-up on how other text editors (emacs, vim) handle their regex searches. Do they roll their own regex engine? Do they make sacrifices like putting a limit on the maximum length of a match? Do they give up search speed in exchange for more flexibility? And so on.

I just took a look at the source code of the regex engine for neovim, and these sorts of higher level questions are really hard to answer. After reading regex source for an hour, I don't think I made any headway into answering any of these questions other than the fact that, yes, they do roll their own regex engine.

@archseer
Copy link
Member Author

Kakoune has a custom implementation as well, it's probably a lot more compact than the neovim one: https://github.com/mawww/kakoune/blob/master/src/regex_impl.cc

@archseer
Copy link
Member Author

I'm tempted to merge regex-automata as a stop gap simply for the reverse search for the time being. It seems to perform okayish with simple regrexes and unicode off as long as the queries are kept simple (although \w{30}a absolutely freezes)

@BurntSushi
Copy link

Yeah, \w{30} has about 9,000 states where as (?-u)\w{30} is about 40 states. Very big difference.

The kakoune tip was helpful. It looks like both it and vim use NFA simulations, which means (I think) they give up search speed in favor of more flexibility and (much) faster compile times. For kakoune in particular, it looks like the only interface it requires is an iterator that can move forwards or backwards over the input. (And this isn't quite like a Rust Iterator, since kakoune's regex engine might revisit some parts of the text multiple times. So I think it's more like a cursor.) That API seems doable from inside the Pike VM I think, but there are lots of optimizations inhibits by byte/char-at-a-time processing. :-/ So in that case, y'all might need to handle the prefix literal optimizations yourself.

But this is a long ways off. But I'll definitely be keeping your use case in mind!

@archseer
Copy link
Member Author

No I mean: compiling \w{30} built with .unicode(false) seems to be reasonably fast, but \w{30}a with .unicode(false) I have to kill it after a minute, consuming 10GB+.

If we went down the self-implemented route until rust-automata is better suited for us, what would you recommend? Implementing a tiny Pike VM on top of regex-syntax? Seems a lot of work to get proper Unicode support done though.

@BurntSushi
Copy link

No I mean: compiling \w{30} built with .unicode(false) seems to be reasonably fast, but \w{30}a with .unicode(false) I have to kill it after a minute, consuming 10GB+.

Oh... That's... weird. I can't seem to reproduce that:

$ git clone https://github.com/BurntSushi/regex-automata /tmp/regex-automata
$ cd /tmp/regex-automata
$ cargo build --release --manifest-path regex-automata-debug/Cargo.toml
$ time ./regex-automata-debug/target/release/regex-automata-debug debug '(?-u)\w{30}a' | wc -l
46

real    0.006
user    0.005
sys     0.000
maxmem  8 MB
faults  0

Ah I see... But if I compile it in reverse, it does indeed take forever. And indeed, compiling (?-u)a\w{30} in the forward direction also takes a long time. And looking at the regex, yup, this is one of those cases where DFA compilation is exponential:

$ ./regex-automata-debug/target/release/regex-automata-debug debug '(?-u)a\w{5}' | wc -l
47
$ ./regex-automata-debug/target/release/regex-automata-debug debug '(?-u)a\w{6}' | wc -l
79
$ ./regex-automata-debug/target/release/regex-automata-debug debug '(?-u)a\w{7}' | wc -l
143
$ ./regex-automata-debug/target/release/regex-automata-debug debug '(?-u)a\w{8}' | wc -l
271
$ ./regex-automata-debug/target/release/regex-automata-debug debug '(?-u)a\w{9}' | wc -l
527
$ ./regex-automata-debug/target/release/regex-automata-debug debug '(?-u)a\w{10}' | wc -l
1039
$ ./regex-automata-debug/target/release/regex-automata-debug debug '(?-u)a\w{15}' | wc -l
32783

The issue is the combination of the bounded repeat and an overlapping character preceding it. e.g., If a were something not in \w, then all is fine:

$ ./regex-automata-debug/target/release/regex-automata-debug debug '(?-u)#\w{15}' | wc -l
31

Such is life in the world of fully compiled DFAs. :-/

@archseer
Copy link
Member Author

archseer commented Mar 1, 2022

Going to close this draft since it has gone stale, we're relying on regex for now until regex-automata is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regex Automata
4 participants