New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complexity of codegen algorithm #132
Comments
I've created a fork for profiling at https://github.com/derekdreery/rust-phf/tree/prof_work/phf_bench . I've been using
to generate callgrind files, and KCacheGrind to view them, but I'm not really enough of an expert at viewing the output to work out what it means. Can anyone help? Here is the output: |
The implementation is based off of this paper: http://cmph.sourceforge.net/papers/esa09.pdf |
That's strange - my experiments suggests that if the construction is O(n^k) then 2 <= k <= 3, but the paper says k=1. The time definitely does not increase linearly with size as it is at the moment. |
Ok so I'm looking at branches and certain branches are happening a lot of times - there's one that's happening 600_000_000 times in a run of 100_000 keys. I'm pretty sure that's suspect from the linear point of view. |
It could very well be that we don't correctly implement the algorithm. I don't think anyone's really used this crate for maps larger than ~100k elements before. |
Sounds like a contribution opportunity :D |
I've done some more analysis, and I think the problem comes from the number of attempts it takes to find the displacements. Some results:
Where attempts is the average across all the bins (of which there are So there's some wide randomness in the time. First thoughts are it might be caused by problems with the hashing function, not being random enough, but I don't know really I'm not an expert in this at all. |
If anyone wants to recreate these experiments, I have a fork at https://github.com/derekdreery/rust-phf/tree/prof_work. The data comes from a 60MB file of known compromised passwords, so are real-world data. In the I probably would benefit with some input from a real mathematician, I'm not sure I have the skills to do the analysis here, although I'm happy to keep experimenting if that's useful. |
I have found that performance is heavily dependent on the quality of the hash function. I tried switching to FNV a while ago and even small key sets failed to solve in a reasonable amount of time. |
It uses the RNG to produce uniformly distributed seeds for attempts to find a perfect hash function. We keep the RNG locked into a fixed seed so generated code doesn't change every time you recompile. |
I see, so you just use it to get a single value to seed the hashing function. And then if it fails you get a new seed and try again. Thanks for clarification. |
So to summarize where I am:
so the question is: is there a way for us to make life easier for the displacement finding loop. In experiments the paper authors talk about 3 seconds to create a map for 1_000_000 keys, which is more than my 20mins attempt, so I'm sure it's possible. |
IIRC the paper calls for three separate hash functions to compute the values that are then folded in with the displacements, but we currently just split the one u64 we get out of the hasher into 3 bits. That could be the source of the issue possibly? |
I'll try that to see, if not I might download the code they used and compare it to this implementation. |
Wouldn't it be possible to use the same hash function 3 times by using 3 different initialization vectors? |
It would be, but then the new hash function would have to be at least 3 times faster than SipHash 1-3 for it to be worth it. |
Not necessarily, if it reduces degenerate cases like this. |
Comparing the performance between these two commits:
In the former, both Siphasher and FNV produce a solution in ~1.5s for 500,000. For 1M, FNV solves in ~4s and Siphasher solves in ~2. The only things I changed is it now generates 3 hashes instead of 1, and takes the top 32 bits of the result (which this page suggests is higher entropy than low bits). For both, I am running However, FNV still fails to solve when running |
I'm also thinking, to add more entropy while remaining deterministic, why not seed the RNG with the hash of |
We don't really care about having entropy here, though. The quality of the output of the RNG is AFAIK identical regardless of how you computed the seed. |
I was just thinking having some variety to the RNG seed to maybe reduce the chance of encountering a degenerate case. |
I was thinking expanding the hashes to the full 64 bits might improve solution but it actually slows it down compared to abonander@c288801 so that result is probably near-optimal without a deeper dive. |
Leaving this open until we're sure #164 resolved the issue. |
I've had a go with my password checker and can generate code for all 14_000_000 passwords! If you want to try it out: https://github.com/derekdreery/common-passwords Note that I've put a limit of 1_000_000 passwords in the build.rs file to make it quicker, but it does build (very slowly) with all 14_000_000. With 1_000_000 passwords interned, size for wasm is 17M, run time is ~instantaneous. :) |
That's excellent! I've made further investigation into speeding up the algorithm but unfortunately it's not as trivial as I thought: #162 (comment) |
Closing as the issue itself has been fixed, I believe. |
I've got a list of 14_000_000 passwords that I want to create a hash from. I'm finding that up to say 500_000 it completes more or less instantly, but at 1_000_000 it takes a long time (20mins+). I would guess anecdotally that the
build
step is O(n^2) or thereabouts. Has anyone else done perf work on the generation algorithm?Some timings:
100_000
- 1 sec500_000
- 40 sec1_000_000
- abort after 20 minsRough complexity
Using the first 2 results Gives time = k * size ^ n where n = 2.3, k = 3e-12, so for my 14_000_000 passwords I'm looking at about 24 hours :'( (modulo horrendous assumptions)
The text was updated successfully, but these errors were encountered: