Otimize duplicut for SSDs #22

nil0x42 · 2020-09-20T07:56:52Z

HDD vs SSD

On HDD, sequential access is relatively fast, while random access is terribly slow. That's why duplicut, written back in 2014 has been optimized thinking of it.
It made at that time no sense to have multiple threads reading concurrently a massive wordlist's content, so sequential access with a single thread was more performant when all lines could fit in hashmap at once.

Now we entered the SSD era, concurency could leverage great performance, as random access is way faster.

@solardiz suggested OpenMP, which would probably increase perf a lot.

TODO

compare duplicut/unique/rling on HDD to verify my assumption
compare duplicut/unique/rling on massive wordlist (>30GB)

@solardiz i'd love your suggestions & opinion about duplicut & ways to optimize 😄

The text was updated successfully, but these errors were encountered:

solardiz · 2020-09-20T11:31:59Z

My idea was to continue reading the input (or previously-written part of output when we're low on RAM) sequentially (tricky to do otherwise when the input is lines of varying lengths), but buffer it rather than process it against the hash table line by line. Once the buffer fills up, process it with multiple threads (mark for removal entries that are seen in the global hash table). Then repeat for the next buffer's worth of input. I guess a reasonable buffer size can be a few MB (maybe similar to L3 cache size). A complication is dealing with duplicates within a buffer - perhaps that needs to be taken care of separately, maybe using a separate smaller hash table, and maybe sequentially.

Random reading of input is also possible, perhaps skipping until the start of a new line and somehow processing the partial lines on block boundaries separately.

I suggested OpenMP because that's what we already use in JtR, and because it's easy to use in this way. Since you already use explicit pthreads, you probably shouldn't mix different threading technologies. You can implement the above with either technology.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Otimize duplicut for SSDs #22

Otimize duplicut for SSDs #22

nil0x42 commented Sep 20, 2020

solardiz commented Sep 20, 2020

Otimize duplicut for SSDs #22

Otimize duplicut for SSDs #22

Comments

nil0x42 commented Sep 20, 2020

HDD vs SSD

TODO

solardiz commented Sep 20, 2020