-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a trie to speed up index construction #887
base: main
Are you sure you want to change the base?
Conversation
Could you add a high-level description of what the PR does so the PR is self-contained? Is there an issue we could link to this PR? Also there is no need to add "WIP" to the title, this is what "Draft PR" means :) |
regex.py
I believe I've found a bug in For the token Edit: appears it's being addressed in #904 |
Seeing great results with this so far!
Pretty close to Full results
|
Fixes #795
Draft: Awaiting #930 so we can use token transition key sequences in the trie.
Problem
In
regex.py
Outlines compiles an index of legal tokens for each state of the FSM (state_scan_tokens
)On the
main
branch we use this naive approachCalling
_walk_fsm
once for everynum_tokens * num_states_per_token
is inefficient and the current bottleneck in index construction.Solution
We can improve this by using a Trie (implementation details to come)
Edit: thanks to the new token -> transition key seq index (#904) preliminary benchmarks show tries are substantially smaller and faster 🏎️
TODO:
main
once Fix null byte\x00
issue in byte level fsm resulting inKeyError
inBetterFSM::FSMInfo
#930 is mergedmain
, pickle, then create an FSM index with this branch, compare results.