Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a trie to speed up index construction #887

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lapp0
Copy link
Contributor

@lapp0 lapp0 commented May 10, 2024

Fixes #795

Draft: Awaiting #930 so we can use token transition key sequences in the trie.

Problem

In regex.py Outlines compiles an index of legal tokens for each state of the FSM (state_scan_tokens)

On the main branch we use this naive approach

  • for each state S_n in the FSM
    • for each token in vocabulary
      • simulate FSM traversal character by character starting with S_n, if successful, add the token to to S_n

Calling _walk_fsm once for every num_tokens * num_states_per_token is inefficient and the current bottleneck in index construction.

Solution

We can improve this by using a Trie (implementation details to come)

Edit: thanks to the new token -> transition key seq index (#904) preliminary benchmarks show tries are substantially smaller and faster 🏎️

TODO:

@rlouf
Copy link
Member

rlouf commented May 10, 2024

Could you add a high-level description of what the PR does so the PR is self-contained? Is there an issue we could link to this PR? Also there is no need to add "WIP" to the title, this is what "Draft PR" means :)

@rlouf rlouf changed the title WIP: Vocab Trie To Speed Up regex.py Use a trie to speed up index construction May 10, 2024
@lapp0
Copy link
Contributor Author

lapp0 commented May 15, 2024

I believe I've found a bug in regex.py's reduced_vocabulary()

For the token 188 in the gpt2 tokenizer ('\x00'), token_tuple_np is empty (array([''], dtype='<U2')), however it isn't added to empty_token_ids.

Edit: appears it's being addressed in #904

@lapp0
Copy link
Contributor Author

lapp0 commented May 31, 2024

Seeing great results with this so far!

state_scan_tokens

  • before: 10.391
  • after: 1.925

Pretty close to interegular to_fsm (time = 1.763) being the majority of the index compilation time.

Full results

trie:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.010    0.010    5.326    5.326 /home/andrew/p/outlines/profile_null_byte_fix.py:24(profile_email_guide)
        1    0.000    0.000    5.317    5.317 /home/andrew/p/outlines/outlines/fsm/guide.py:140(__init__)
        1    0.001    0.001    5.317    5.317 /home/andrew/p/outlines/outlines/caching.py:113(wrapper)
        1    0.001    0.001    5.316    5.316 /home/andrew/p/outlines/outlines/fsm/guide.py:108(create_states_mapping)
        1    0.032    0.032    3.369    3.369 /home/andrew/p/outlines/outlines/fsm/regex.py:885(create_fsm_index_tokenizer)
        1    0.518    0.518    3.254    3.254 /home/andrew/p/outlines/outlines/fsm/regex.py:732(create_fsm_index_end_to_end)
      389    1.924    0.005    1.925    0.005 /home/andrew/p/outlines/outlines/fsm/regex.py:647(state_scan_tokens)
     10/1    0.001    0.000    1.763    1.763 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:447(to_fsm)
      523    0.268    0.001    1.567    0.003 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:969(crawl)
    128/2    0.001    0.000    1.545    0.772 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:453(<genexpr>)
    118/1    0.003    0.000    1.545    1.545 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:370(to_fsm)
     13/1    0.000    0.000    1.398    1.398 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:280(to_fsm)
      131    0.002    0.000    0.868    0.007 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:310(concatenate)
       10    0.000    0.000    0.600    0.060 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:451(union)
       10    0.000    0.000    0.600    0.060 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:913(parallel)
    67256    0.585    0.000    0.585    0.000 {method 'index' of 'list' objects}
        1    0.000    0.000    0.465    0.465 /nix/store/sc2wsadi9mk4kq5r1h6gvi8z2r9c1cpq-python3.11-numba-0.59.1/lib/python3.11/site-packages/numba/experimental/jitclass/base.py:119(__call__)
        1    0.465    0.465    0.465    0.465 <string>:2(ctor)
      269    0.018    0.000    0.291    0.001 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:112(union)
   114135    0.267    0.000    0.267    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:925(follow)
       13    0.000    0.000    0.224    0.017 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:364(__add__)
   198170    0.202    0.000    0.223    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:347(follow)
      930    0.211    0.000    0.211    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:93(__init__)
      125    0.000    0.000    0.182    0.001 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:374(star)
        1    0.000    0.000    0.150    0.150 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:249(reduce)
        2    0.007    0.003    0.149    0.075 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:558(reversed)
  1153581    0.135    0.000    0.135    0.000 {method 'add' of 'set' objects}
        1    0.122    0.122    0.122    0.122 /home/andrew/p/outlines/outlines/fsm/regex.py:716(get_all_token_transitions)
    24185    0.086    0.000    0.118    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:580(follow)
   743602    0.077    0.000    0.077    0.000 {method 'setdefault' of 'dict' objects}
    65660    0.057    0.000    0.059    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:384(follow)
      269    0.011    0.000    0.055    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:114(<dictcomp>)
      255    0.001    0.000    0.044    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:409(times)
    57272    0.017    0.000    0.044    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:114(<genexpr>)
    801/1    0.002    0.000    0.040    0.040 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:69(get_alphabet)
     10/1    0.000    0.000    0.040    0.040 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:423(_get_alphabet)
    128/2    0.000    0.000    0.040    0.020 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:425(<genexpr>)
    118/1    0.001    0.000    0.040    0.040 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:330(_get_alphabet)
    787/2    0.000    0.000    0.040    0.020 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:331(<genexpr>)
     13/1    0.000    0.000    0.040    0.040 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:270(_get_alphabet)
        1    0.040    0.040    0.040    0.040 /home/andrew/p/outlines/outlines/fsm/regex.py:911(<dictcomp>)
       19    0.000    0.000    0.032    0.002 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:445(__mul__)
        1    0.000    0.000    0.030    0.030 /home/andrew/p/outlines/outlines/models/transformers.py:113(__hash__)
        1    0.000    0.000    0.030    0.030 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/datasets/fingerprint.py:226(hash)
        1    0.000    0.000    0.030    0.030 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/datasets/utils/_dill.py:106(dumps)

main:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.010    0.010   13.219   13.219 /home/andrew/p/outlines/profile_null_byte_fix.py:23(profile_email_guide)
        1    0.000    0.000   13.208   13.208 /home/andrew/p/outlines/outlines/fsm/guide.py:140(__init__)
        1    0.001    0.001   13.208   13.208 /home/andrew/p/outlines/outlines/caching.py:113(wrapper)
        1    0.000    0.000   13.208   13.208 /home/andrew/p/outlines/outlines/fsm/guide.py:108(create_states_mapping)
        1    0.016    0.016   11.280   11.280 /home/andrew/p/outlines/outlines/fsm/regex.py:829(create_fsm_index_tokenizer)
        1    0.545    0.545   11.185   11.185 /home/andrew/p/outlines/outlines/fsm/regex.py:684(create_fsm_index_end_to_end)
      389   10.389    0.027   10.391    0.027 /home/andrew/p/outlines/outlines/fsm/regex.py:651(state_scan_tokens)
     10/1    0.001    0.000    1.740    1.740 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:447(to_fsm)
      523    0.275    0.001    1.594    0.003 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:969(crawl)
    128/2    0.001    0.000    1.520    0.760 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:453(<genexpr>)
    118/1    0.003    0.000    1.519    1.519 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:370(to_fsm)
     13/1    0.000    0.000    1.371    1.371 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:280(to_fsm)
      131    0.002    0.000    0.836    0.006 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:310(concatenate)
       10    0.000    0.000    0.599    0.060 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:451(union)
       10    0.000    0.000    0.599    0.060 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:913(parallel)
    67256    0.581    0.000    0.581    0.000 {method 'index' of 'list' objects}
   114135    0.273    0.000    0.273    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:925(follow)
      269    0.174    0.001    0.244    0.001 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:112(union)
       13    0.000    0.000    0.242    0.019 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:364(__add__)
   198170    0.211    0.000    0.232    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:347(follow)
      125    0.000    0.000    0.190    0.002 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:374(star)
        1    0.000    0.000    0.152    0.152 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:249(reduce)
        2    0.005    0.003    0.152    0.076 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:558(reversed)
  1154007    0.136    0.000    0.136    0.000 {method 'add' of 'set' objects}
    24185    0.089    0.000    0.121    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:580(follow)
   743602    0.075    0.000    0.075    0.000 {method 'setdefault' of 'dict' objects}
    65660    0.061    0.000    0.063    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:384(follow)
      269    0.011    0.000    0.056    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:114(<dictcomp>)
        1    0.047    0.047    0.047    0.047 /home/andrew/p/outlines/outlines/fsm/regex.py:855(<dictcomp>)
    57272    0.018    0.000    0.045    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:114(<genexpr>)
      255    0.000    0.000    0.045    0.000 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/fsm.py:409(times)
    801/1    0.002    0.000    0.043    0.043 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:69(get_alphabet)
     10/1    0.000    0.000    0.043    0.043 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:423(_get_alphabet)
    128/2    0.000    0.000    0.043    0.021 /home/andrew/p/outlines/.myenv2/lib/python3.11p/site-packages/interegular/patterns.py:425(<genexpr>)
    118/1    0.001    0.000    0.043    0.043 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:330(_get_alphabet)
    787/2    0.000    0.000    0.042    0.021 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:331(<genexpr>)
     13/1    0.000    0.000    0.042    0.042 /home/andrew/p/outlines/.myenv2/lib/python3.11/site-packages/interegular/patterns.py:270(_get_alphabet)
      359    0.003    0.000    0.034    0.000 /nix/store/qd7h3vn2bff6jjigdvq0xh91q49sm1ng-python3.11-tqdm-4.66.4/lib/python3.11/site-packages/tqdm/std.py:1198(update)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Accelerate the index construction process
2 participants