Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli (tjs idea) #32

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

cli (tjs idea) #32

wants to merge 1 commit into from

Conversation

Conni2461
Copy link
Member

assuming you have a file called files doing fd --hidden -I > files for example.

hyperfine --warmup 3 "cat files | ./build/cli fzf.h" "cat files | fzf --filter fzf.h"

fzf.h is the "prompt term" here

single threaded

Benchmark #1: cat files | ./build/cli fzf.h
  Time (mean ± σ):     125.7 ms ±   6.1 ms    [User: 107.0 ms, System: 32.3 ms]
  Range (min … max):   116.8 ms … 143.7 ms    25 runs
 
Benchmark #2: cat files | fzf --filter fzf.h
  Time (mean ± σ):      89.8 ms ±   8.9 ms    [User: 133.1 ms, System: 31.6 ms]
  Range (min … max):    72.9 ms … 110.1 ms    34 runs
 
Summary
  'cat files | fzf --filter fzf.h' ran
    1.40 ± 0.16 times faster than 'cat files | ./build/cli fzf.h'

simple multi threaded attempt (add to list/sorting still missing)

Benchmark #1: cat files | ./build/cli fzf.h
  Time (mean ± σ):     174.4 ms ±  14.0 ms    [User: 279.8 ms, System: 157.0 ms]
  Range (min … max):   151.1 ms … 200.5 ms    16 runs
 
Benchmark #2: cat files | fzf --filter fzf.h
  Time (mean ± σ):      94.8 ms ±   8.0 ms    [User: 136.4 ms, System: 31.1 ms]
  Range (min … max):    81.3 ms … 115.4 ms    27 runs
 
Summary
  'cat files | fzf --filter fzf.h' ran
    1.84 ± 0.21 times faster than 'cat files | ./build/cli fzf.h'

@Conni2461
Copy link
Member Author

It turns out that c is pretty fast (thats singlethreaded)

hyperfine --warmup 3 "./build/cli fzf.h < files" "fzf --filter fzf.h < files"
Benchmark #1: ./build/cli fzf.h < files
  Time (mean ± σ):      62.6 ms ±   1.6 ms    [User: 50.6 ms, System: 11.6 ms]
  Range (min … max):    59.6 ms …  66.8 ms    47 runs
 
Benchmark #2: fzf --filter fzf.h < files
  Time (mean ± σ):      70.2 ms ±   4.2 ms    [User: 117.6 ms, System: 19.7 ms]
  Range (min … max):    63.1 ms …  82.6 ms    43 runs
 
Summary
  './build/cli fzf.h < files' ran
    1.12 ± 0.07 times faster than 'fzf --filter fzf.h < files'

@Conni2461
Copy link
Member Author

Conni2461 commented Aug 9, 2021

yeah 65% faster if i read the output correctly 🤣

hyperfine --warmup 3 "./build/cli fzf.h < files" "fzf --filter fzf.h < files"
Benchmark #1: ./build/cli fzf.h < files
  Time (mean ± σ):      45.1 ms ±   1.6 ms    [User: 39.8 ms, System: 4.8 ms]
  Range (min … max):    42.3 ms …  51.8 ms    65 runs
 
Benchmark #2: fzf --filter fzf.h < files
  Time (mean ± σ):      74.5 ms ±   8.9 ms    [User: 120.4 ms, System: 19.3 ms]
  Range (min … max):    64.2 ms … 114.8 ms    42 runs
 
Summary
  './build/cli fzf.h < files' ran
    1.65 ± 0.21 times faster than 'fzf --filter fzf.h < files'

@kkharji
Copy link
Member

kkharji commented Aug 9, 2021

yeah 21% if i read the output correctly 🤣

hyperfine --warmup 3 "./build/cli fzf.h < files" "fzf --filter fzf.h < files"
Benchmark #1: ./build/cli fzf.h < files
  Time (mean ± σ):      45.1 ms ±   1.6 ms    [User: 39.8 ms, System: 4.8 ms]
  Range (min … max):    42.3 ms …  51.8 ms    65 runs
 
Benchmark #2: fzf --filter fzf.h < files
  Time (mean ± σ):      74.5 ms ±   8.9 ms    [User: 120.4 ms, System: 19.3 ms]
  Range (min … max):    64.2 ms … 114.8 ms    42 runs
 
Summary
  './build/cli fzf.h < files' ran
    1.65 ± 0.21 times faster than 'fzf --filter fzf.h < files'

dedikodu-gossip
POG!!

@fredizzimo
Copy link
Contributor

Even if it's 65% faster, it might still be worth trying to improve.

One fairly simple thing could be to change the scoring calculation from int16_t to float. Floats are almost the same speed as int16_t for single operations, sometimes a bit faster, sometimes a bit slower depending on the operation and processor, but very comparable in speed.

But the difference is that the the compiler can do much more optimization for floats using SIMD extensions like SSE and AVX with the right compiler options, although there are some integer support in modern SIMD instruction sets.

Additionally when using strict aliasing, the floating point type is different from the char and int types for the other arrays, so it doesn't always need to assume that there are aliasing going on, which forces things to be read from memory instead of using cached results in registers. But of course there's still a lot of potential aliasing, and I'm not sure how well the compiler is able to analyse what's going on, so some manual annotation might improve things further.

Using floats will of course cause a bit more memory pressure and therefore potentially more cache misses, which are slow. But the slab is re-used, and unless you match really long strings, everything fits easily into the cache, so that should not be an issue.

But of course there could be other things going on that would make it slower with floats instead of faster, so benchmarking is needed. For example it's possible that the integer versions already uses the simd extensions that are possible, or that the opportunity for optimization does not exits the way the code currently is written. But since the algorithm is essentially doing matrix calculations, it might still be possible to rewrite it to take proper advantage of SIMD.

@Conni2461
Copy link
Member Author

All of this sounds good. I think you are way more qualified doing this than i am. 😆

But we are not stopping making telescope faster. The idea here was having a pipe that allows us to move the score calculation away from the neovim thread. Other idea we have, i talked about this yesterday with tj, was if we bundle fzf-native in telescope core and require people to do make on update, that we make use of this required build step. So that we not just put this submodule in telescope but slowly move some core telescope components to c and write bindings for it. Like right now we do the actual sorting in lua with a linked list that basically only correctly sorts the amount of items that will be displayed. for dropdown ~10 for normal ~50. Thats fine but makes scrolling thought all results a little bit harder because we then need to look through the whole linked list again if we are looking for the best item at index n + 1, etc ...

What i was thinking is having the store actually in c as a score, index tuple (we cant store the actual table reference in c afaik but we can store the index to the table ref that lies at i in the results table i think). We could make a heap in c then and maybe sort the first 500~1000 correctly without having a performance penalty because we have seen a factor 10 performance improvement between the c part and lua (fzy-native vs fzy lua version).

I just need to mention it. I am not sure that a heap is optimal here, i just know that it ended up being faster than the linked list approach i tried first and i had build a heap before so i tried that. And it ended up bringing the time down from 116.8 ms to 45.1 ms.

See nvim-telescope/telescope.nvim#988

@matu3ba
Copy link
Contributor

matu3ba commented Sep 8, 2021

"I just need to mention it. I am not sure that a heap is optimal here"

Optimal is a preallocated heap with offsets and length to objects xor padded objects. The latter looks slower though, since line length can differ alot. The first could be tried, but would be wasteful. We dont know the count of hits, so any estimation might be wrong.

Second-optimal are arena/region-allocators, where you just bump the capacity (pointer) to add more stuff inside instead having the malloc overhead on every call. If the memory page is full, the arena allocator takes the next one.

The index stores the offset to text chunks. The control structure stores pointer to the indexes.

So something like

control_block
 |    |    
 |    |->index0:  | offset0 | offset 1 | ... |
 |------>index1:  | offset0 | offset 1 | ... |

and offset0 -------------> text chunk0  (assume this is a continues memory chunk from region allocator)
    offset1 -------------> text chunk1
           ...                                           ....

Then the problem of cache locality boils down to having the lookup the text defined by offsets. I am not familiar with what is stored and how sorting should work, so I cant tell how the control_block would look exactly.

Note, that looks like a fundamental redesign. So it should be done later and not in this PR as to limit the scope.

Not sure, if adding a arena/region allocator is worth it though. https://github.com/cgaebel/arena_alloc looks good enough for that.

@Conni2461
Copy link
Member Author

Conni2461 commented Sep 8, 2021

I was talking about the data structure heap, like i implemented here in this PR. max-heap( to be more specific). I wasnt talking about allocation. fzf-native actually only calculates the score. Telescope sorts it in a datastructure (currently linked list and we only sort the first n elements, the displayed onces). I was thinking maybe we could improve that and other core elements in telescope. Thats all. But that doesnt affect either this repository.

This PR is just me playing around with a idea that tj mentioned. I dont try to get this merged anytime soon (i could just merge it) but i havent figured what i wanna do with it.

Still thanks for your comments :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants