Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complexity of codegen algorithm #132

Closed
derekdreery opened this issue Nov 23, 2018 · 27 comments
Closed

Complexity of codegen algorithm #132

derekdreery opened this issue Nov 23, 2018 · 27 comments

Comments

@derekdreery
Copy link

derekdreery commented Nov 23, 2018

I've got a list of 14_000_000 passwords that I want to create a hash from. I'm finding that up to say 500_000 it completes more or less instantly, but at 1_000_000 it takes a long time (20mins+). I would guess anecdotally that the build step is O(n^2) or thereabouts. Has anyone else done perf work on the generation algorithm?

Some timings:

  • 100_000 - 1 sec
  • 500_000 - 40 sec
  • 1_000_000 - abort after 20 mins

Rough complexity

Using the first 2 results Gives time = k * size ^ n where n = 2.3, k = 3e-12, so for my 14_000_000 passwords I'm looking at about 24 hours :'( (modulo horrendous assumptions)

@derekdreery
Copy link
Author

I've created a fork for profiling at https://github.com/derekdreery/rust-phf/tree/prof_work/phf_bench . I've been using

valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes --simulate-cache=yes ../target/release/phf_bench

to generate callgrind files, and KCacheGrind to view them, but I'm not really enough of an expert at viewing the output to work out what it means.

Can anyone help? Here is the output:

callgrind.out.3154.txt

@sfackler
Copy link
Collaborator

The implementation is based off of this paper: http://cmph.sourceforge.net/papers/esa09.pdf

@derekdreery
Copy link
Author

derekdreery commented Nov 24, 2018

That's strange - my experiments suggests that if the construction is O(n^k) then 2 <= k <= 3, but the paper says k=1. The time definitely does not increase linearly with size as it is at the moment.

@derekdreery
Copy link
Author

Ok so I'm looking at branches and certain branches are happening a lot of times - there's one that's happening 600_000_000 times in a run of 100_000 keys. I'm pretty sure that's suspect from the linear point of view.

@sfackler
Copy link
Collaborator

It could very well be that we don't correctly implement the algorithm. I don't think anyone's really used this crate for maps larger than ~100k elements before.

@derekdreery
Copy link
Author

Sounds like a contribution opportunity :D

@derekdreery
Copy link
Author

derekdreery commented Nov 27, 2018

I've done some more analysis, and I think the problem comes from the number of attempts it takes to find the displacements. Some results:

Keys attempts (avg) running time
100_000 1608 239ms
200_000 25755 6.0s
200_001 19723 4.6s
300_000 2870 1.1s
400_000 47484 23s
500_000 60653 39s

Where attempts is the average across all the bins (of which there are number of keys / 5).

So there's some wide randomness in the time. First thoughts are it might be caused by problems with the hashing function, not being random enough, but I don't know really I'm not an expert in this at all.

@derekdreery
Copy link
Author

If anyone wants to recreate these experiments, I have a fork at https://github.com/derekdreery/rust-phf/tree/prof_work. The data comes from a 60MB file of known compromised passwords, so are real-world data. In the phf_bench project you can run cargo bench or cargo run --release to run the tests. Size of the set of keys is hard-coded.

I probably would benefit with some input from a real mathematician, I'm not sure I have the skills to do the analysis here, although I'm happy to keep experimenting if that's useful.

@sfackler
Copy link
Collaborator

I have found that performance is heavily dependent on the quality of the hash function. I tried switching to FNV a while ago and even small key sets failed to solve in a reasonable amount of time.

@derekdreery
Copy link
Author

derekdreery commented Nov 28, 2018

I'm a bit confused about the hashing stage. The code boils down to

let mut rng = XorShiftRng::from_seed(FIXED_SEED);
let key = rng.gen();
// try again if first key doesn't work.

Which seems like a fixed value, or sequence of values (although during my testing I've never seen it fail). This means it doesn't use randomness at all.

I'm looking at the distribution of the hashing stage, and I get
100_000_hash_spread
200_000_hash_spread
200_001_hash_spread
500_000_hash_spread

for the spreads of the hashes - which look pretty uniform to me. All in all - I'm not yet nearer to an answer.

@sfackler
Copy link
Collaborator

It uses the RNG to produce uniformly distributed seeds for attempts to find a perfect hash function. We keep the RNG locked into a fixed seed so generated code doesn't change every time you recompile.

@derekdreery
Copy link
Author

I see, so you just use it to get a single value to seed the hashing function. And then if it fails you get a new seed and try again. Thanks for clarification.

@derekdreery
Copy link
Author

derekdreery commented Nov 28, 2018

So to summarize where I am:

  • The slowness comes from the stage for calculating displacements.
  • This is very non-linear, or even monotonic (time taken against number of entries)
  • The hashing seems pretty uniform

so the question is: is there a way for us to make life easier for the displacement finding loop. In experiments the paper authors talk about 3 seconds to create a map for 1_000_000 keys, which is more than my 20mins attempt, so I'm sure it's possible.

@sfackler
Copy link
Collaborator

IIRC the paper calls for three separate hash functions to compute the values that are then folded in with the displacements, but we currently just split the one u64 we get out of the hasher into 3 bits. That could be the source of the issue possibly?

@derekdreery
Copy link
Author

I'll try that to see, if not I might download the code they used and compare it to this implementation.

@abonander
Copy link
Collaborator

Wouldn't it be possible to use the same hash function 3 times by using 3 different initialization vectors?

@sfackler
Copy link
Collaborator

sfackler commented Jul 2, 2019

It would be, but then the new hash function would have to be at least 3 times faster than SipHash 1-3 for it to be worth it.

@abonander
Copy link
Collaborator

Not necessarily, if it reduces degenerate cases like this.

@abonander
Copy link
Collaborator

abonander commented Jul 3, 2019

Comparing the performance between these two commits:

In the former, both Siphasher and FNV produce a solution in ~1.5s for 500,000. For 1M, FNV solves in ~4s and Siphasher solves in ~2. The only things I changed is it now generates 3 hashes instead of 1, and takes the top 32 bits of the result (which this page suggests is higher entropy than low bits). For both, I am running cargo run --release inside phf_bench/. (I also changed the RNG seed just to see what happened.)

However, FNV still fails to solve when running cargo test for the whole repo so Siphasher is probably the better choice overall. The only tests that are failing right now are the compile-fail tests for phf_macros.

@abonander
Copy link
Collaborator

abonander commented Jul 3, 2019

I'm also thinking, to add more entropy while remaining deterministic, why not seed the RNG with the hash of env!("OUT_DIR")? Or maybe initialize the hasher with that data.

@sfackler
Copy link
Collaborator

sfackler commented Jul 3, 2019

We don't really care about having entropy here, though. The quality of the output of the RNG is AFAIK identical regardless of how you computed the seed.

@abonander
Copy link
Collaborator

I was just thinking having some variety to the RNG seed to maybe reduce the chance of encountering a degenerate case.

@abonander abonander reopened this Jul 6, 2019
@abonander
Copy link
Collaborator

abonander commented Jul 6, 2019

I was thinking expanding the hashes to the full 64 bits might improve solution but it actually slows it down compared to abonander@c288801 so that result is probably near-optimal without a deeper dive.

@abonander
Copy link
Collaborator

Leaving this open until we're sure #164 resolved the issue.

@derekdreery
Copy link
Author

I've had a go with my password checker and can generate code for all 14_000_000 passwords!

If you want to try it out: https://github.com/derekdreery/common-passwords

Note that I've put a limit of 1_000_000 passwords in the build.rs file to make it quicker, but it does build (very slowly) with all 14_000_000. With 1_000_000 passwords interned, size for wasm is 17M, run time is ~instantaneous. :)

@abonander
Copy link
Collaborator

That's excellent! I've made further investigation into speeding up the algorithm but unfortunately it's not as trivial as I thought: #162 (comment)

@JohnTitor
Copy link
Member

Closing as the issue itself has been fixed, I believe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants