Complexity of codegen algorithm #132

derekdreery · 2018-11-23T13:07:48Z

I've got a list of 14_000_000 passwords that I want to create a hash from. I'm finding that up to say 500_000 it completes more or less instantly, but at 1_000_000 it takes a long time (20mins+). I would guess anecdotally that the build step is O(n^2) or thereabouts. Has anyone else done perf work on the generation algorithm?

Some timings:

100_000 - 1 sec
500_000 - 40 sec
1_000_000 - abort after 20 mins

Rough complexity

Using the first 2 results Gives time = k * size ^ n where n = 2.3, k = 3e-12, so for my 14_000_000 passwords I'm looking at about 24 hours :'( (modulo horrendous assumptions)

The text was updated successfully, but these errors were encountered:

derekdreery · 2018-11-24T13:09:57Z

I've created a fork for profiling at https://github.com/derekdreery/rust-phf/tree/prof_work/phf_bench . I've been using

valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes --simulate-cache=yes ../target/release/phf_bench

to generate callgrind files, and KCacheGrind to view them, but I'm not really enough of an expert at viewing the output to work out what it means.

Can anyone help? Here is the output:

callgrind.out.3154.txt

sfackler · 2018-11-24T15:38:45Z

The implementation is based off of this paper: http://cmph.sourceforge.net/papers/esa09.pdf

derekdreery · 2018-11-24T16:07:32Z

That's strange - my experiments suggests that if the construction is O(n^k) then 2 <= k <= 3, but the paper says k=1. The time definitely does not increase linearly with size as it is at the moment.

derekdreery · 2018-11-24T16:55:42Z

Ok so I'm looking at branches and certain branches are happening a lot of times - there's one that's happening 600_000_000 times in a run of 100_000 keys. I'm pretty sure that's suspect from the linear point of view.

sfackler · 2018-11-25T03:16:38Z

It could very well be that we don't correctly implement the algorithm. I don't think anyone's really used this crate for maps larger than ~100k elements before.

derekdreery · 2018-11-25T14:30:35Z

Sounds like a contribution opportunity :D

derekdreery · 2018-11-27T17:01:42Z

I've done some more analysis, and I think the problem comes from the number of attempts it takes to find the displacements. Some results:

Keys	attempts (avg)	running time
100_000	1608	239ms
200_000	25755	6.0s
200_001	19723	4.6s
300_000	2870	1.1s
400_000	47484	23s
500_000	60653	39s

Where attempts is the average across all the bins (of which there are number of keys / 5).

So there's some wide randomness in the time. First thoughts are it might be caused by problems with the hashing function, not being random enough, but I don't know really I'm not an expert in this at all.

derekdreery · 2018-11-27T17:05:16Z

If anyone wants to recreate these experiments, I have a fork at https://github.com/derekdreery/rust-phf/tree/prof_work. The data comes from a 60MB file of known compromised passwords, so are real-world data. In the phf_bench project you can run cargo bench or cargo run --release to run the tests. Size of the set of keys is hard-coded.

I probably would benefit with some input from a real mathematician, I'm not sure I have the skills to do the analysis here, although I'm happy to keep experimenting if that's useful.

sfackler · 2018-11-27T17:09:01Z

I have found that performance is heavily dependent on the quality of the hash function. I tried switching to FNV a while ago and even small key sets failed to solve in a reasonable amount of time.

derekdreery · 2018-11-28T13:49:21Z

I'm a bit confused about the hashing stage. The code boils down to

let mut rng = XorShiftRng::from_seed(FIXED_SEED);
let key = rng.gen();
// try again if first key doesn't work.

Which seems like a fixed value, or sequence of values (although during my testing I've never seen it fail). This means it doesn't use randomness at all.

I'm looking at the distribution of the hashing stage, and I get

for the spreads of the hashes - which look pretty uniform to me. All in all - I'm not yet nearer to an answer.

sfackler · 2018-11-28T17:01:03Z

It uses the RNG to produce uniformly distributed seeds for attempts to find a perfect hash function. We keep the RNG locked into a fixed seed so generated code doesn't change every time you recompile.

derekdreery · 2018-11-28T17:05:18Z

I see, so you just use it to get a single value to seed the hashing function. And then if it fails you get a new seed and try again. Thanks for clarification.

derekdreery · 2018-11-28T19:09:43Z

So to summarize where I am:

The slowness comes from the stage for calculating displacements.
This is very non-linear, or even monotonic (time taken against number of entries)
The hashing seems pretty uniform

so the question is: is there a way for us to make life easier for the displacement finding loop. In experiments the paper authors talk about 3 seconds to create a map for 1_000_000 keys, which is more than my 20mins attempt, so I'm sure it's possible.

sfackler · 2018-11-28T19:14:55Z

IIRC the paper calls for three separate hash functions to compute the values that are then folded in with the displacements, but we currently just split the one u64 we get out of the hasher into 3 bits. That could be the source of the issue possibly?

derekdreery · 2018-11-28T19:23:34Z

I'll try that to see, if not I might download the code they used and compare it to this implementation.

abonander · 2019-07-02T18:57:47Z

Wouldn't it be possible to use the same hash function 3 times by using 3 different initialization vectors?

sfackler · 2019-07-02T18:59:32Z

It would be, but then the new hash function would have to be at least 3 times faster than SipHash 1-3 for it to be worth it.

abonander · 2019-07-02T19:43:59Z

Not necessarily, if it reduces degenerate cases like this.

abonander · 2019-07-03T00:25:19Z

Comparing the performance between these two commits:

abonander@c288801
abonander@1919661 (@derekdreery's branch rebased on latest master)

In the former, both Siphasher and FNV produce a solution in ~1.5s for 500,000. For 1M, FNV solves in ~4s and Siphasher solves in ~2. The only things I changed is it now generates 3 hashes instead of 1, and takes the top 32 bits of the result (which this page suggests is higher entropy than low bits). For both, I am running cargo run --release inside phf_bench/. (I also changed the RNG seed just to see what happened.)

However, FNV still fails to solve when running cargo test for the whole repo so Siphasher is probably the better choice overall. The only tests that are failing right now are the compile-fail tests for phf_macros.

abonander · 2019-07-03T00:33:31Z

I'm also thinking, to add more entropy while remaining deterministic, why not seed the RNG with the hash of env!("OUT_DIR")? Or maybe initialize the hasher with that data.

sfackler · 2019-07-03T05:48:20Z

We don't really care about having entropy here, though. The quality of the output of the RNG is AFAIK identical regardless of how you computed the seed.

abonander · 2019-07-03T07:35:52Z

I was just thinking having some variety to the RNG seed to maybe reduce the chance of encountering a degenerate case.

abonander · 2019-07-06T08:28:21Z

I was thinking expanding the hashes to the full 64 bits might improve solution but it actually slows it down compared to abonander@c288801 so that result is probably near-optimal without a deeper dive.

abonander · 2019-07-16T19:25:16Z

Leaving this open until we're sure #164 resolved the issue.

derekdreery · 2019-07-20T15:35:07Z

I've had a go with my password checker and can generate code for all 14_000_000 passwords!

If you want to try it out: https://github.com/derekdreery/common-passwords

Note that I've put a limit of 1_000_000 passwords in the build.rs file to make it quicker, but it does build (very slowly) with all 14_000_000. With 1_000_000 passwords interned, size for wasm is 17M, run time is ~instantaneous. :)

abonander · 2019-07-20T19:50:48Z

That's excellent! I've made further investigation into speeding up the algorithm but unfortunately it's not as trivial as I thought: #162 (comment)

JohnTitor · 2021-06-12T04:30:42Z

Closing as the issue itself has been fixed, I believe.

abonander closed this as completed Jul 6, 2019

abonander reopened this Jul 6, 2019

abonander mentioned this issue Jul 12, 2019

Fix asymptotic hashing performance #164

Merged

abonander mentioned this issue Oct 15, 2019

Update criterion #181

Merged

JohnTitor closed this as completed Jun 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complexity of codegen algorithm #132

Complexity of codegen algorithm #132

derekdreery commented Nov 23, 2018 •

edited

derekdreery commented Nov 24, 2018

sfackler commented Nov 24, 2018

derekdreery commented Nov 24, 2018 •

edited

derekdreery commented Nov 24, 2018

sfackler commented Nov 25, 2018

derekdreery commented Nov 25, 2018

derekdreery commented Nov 27, 2018 •

edited

derekdreery commented Nov 27, 2018

sfackler commented Nov 27, 2018

derekdreery commented Nov 28, 2018 •

edited

sfackler commented Nov 28, 2018

derekdreery commented Nov 28, 2018

derekdreery commented Nov 28, 2018 •

edited

sfackler commented Nov 28, 2018

derekdreery commented Nov 28, 2018

abonander commented Jul 2, 2019

sfackler commented Jul 2, 2019

abonander commented Jul 2, 2019

abonander commented Jul 3, 2019 •

edited

abonander commented Jul 3, 2019 •

edited

sfackler commented Jul 3, 2019

abonander commented Jul 3, 2019

abonander commented Jul 6, 2019 •

edited

abonander commented Jul 16, 2019

derekdreery commented Jul 20, 2019

abonander commented Jul 20, 2019

JohnTitor commented Jun 12, 2021

Complexity of codegen algorithm #132

Complexity of codegen algorithm #132

Comments

derekdreery commented Nov 23, 2018 • edited

Some timings:

Rough complexity

derekdreery commented Nov 24, 2018

sfackler commented Nov 24, 2018

derekdreery commented Nov 24, 2018 • edited

derekdreery commented Nov 24, 2018

sfackler commented Nov 25, 2018

derekdreery commented Nov 25, 2018

derekdreery commented Nov 27, 2018 • edited

derekdreery commented Nov 27, 2018

sfackler commented Nov 27, 2018

derekdreery commented Nov 28, 2018 • edited

sfackler commented Nov 28, 2018

derekdreery commented Nov 28, 2018

derekdreery commented Nov 28, 2018 • edited

sfackler commented Nov 28, 2018

derekdreery commented Nov 28, 2018

abonander commented Jul 2, 2019

sfackler commented Jul 2, 2019

abonander commented Jul 2, 2019

abonander commented Jul 3, 2019 • edited

abonander commented Jul 3, 2019 • edited

sfackler commented Jul 3, 2019

abonander commented Jul 3, 2019

abonander commented Jul 6, 2019 • edited

abonander commented Jul 16, 2019

derekdreery commented Jul 20, 2019

abonander commented Jul 20, 2019

JohnTitor commented Jun 12, 2021

derekdreery commented Nov 23, 2018 •

edited

derekdreery commented Nov 24, 2018 •

edited

derekdreery commented Nov 27, 2018 •

edited

derekdreery commented Nov 28, 2018 •

edited

derekdreery commented Nov 28, 2018 •

edited

abonander commented Jul 3, 2019 •

edited

abonander commented Jul 3, 2019 •

edited

abonander commented Jul 6, 2019 •

edited