HAT-Trie

I've been wanting to play around with the HAT-Trie (pdf) described by Dr. Askitis. The jist of it is that it is supposed to have all the wonderful performance characteristics of a trie -- it's sorted, it's got a bit of compression built into the thing, etc. -- but in a cache-conscious way. In particular, much of this strength comes from the fact that the tree doesn't branch all the way down to hold individual words in leaves, but leaves actually contain a cache-conscious unsorted associative data structure, the array-hash.

There are a few reasons for this repository:

Askitis has licensed his code in such a way to prevent commercial use.
I'd like to gain an appreciation for this and similar data structures
I'd like to benchmark this relative to other data structures, in absence of the implementation from the paper (worrying about license pollution)

As a goal I'd like to see a canonical open source (and commercially usable) implementation emerge. May not be this code base, of course. In fact, there are a couple of implementations, in C and in C++ that have served as inspiration and good conversation. This is in hopes of continuing that conversation.

Array Hash

Before the HAT-Trie can be implemented, we must have the array-hash. It's very similar to traditional hash table, except that bins are not linked lists, but are contiguous chunks of memory with the keys and values interleaved. The hope is that since there are fewer levels of indirection, cache performance may be improved.

I tried two implementations: 1) using a char* directly and encoding the keys and values manually and 2) using std::vector which internally uses the same technique, but has the added bonus of being much easier to deal with. The char* version is available under the char branch, and the (more complete) std::vector approach can be found under the vector branch.

In terms of performance, std::tr1::unordered_map reliably beat both implementations after about 10-15k keys are inserted, and both std::vector and char* generally perform better for fewer than about 4k keys. Interestingly enough, however, std::vector<std::pair<std::string,T> > performed comparably well when compared to the char* version. It also makes the code significantly more readable and much easier to deal with.

For this test, I used std::tr1::unordered_map<std::string,size_t>, and for keys I used the distinct_t dataset hosted on Askitis' site. I tested with both g++ 4.2.2 and clang 2.1 on a 2011-ish MacBook Pro. I encourage you to run make bench yourself to compare this implementation of array-hash, std::map (for fun) and std::tr1::unordered_map.

Hashes

One of my initial inclinations to try to improve upon the performance to edge out std::tr1::unordered_map was to try different hashes. I had initially used my personal favorite superfasthash, but also tried out crapwow which claims to be more performant than superfast. The array-hash class here accepts hash functor as a template argument, and defaults to crapwow.

In terms of performance, they were essentially indistinguishable in these tests

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
deps		deps
include		include
.gitignore		.gitignore
.gitmodules		.gitmodules
Makefile		Makefile
README.md		README.md
bench.cpp		bench.cpp
bench.gnuplot		bench.gnuplot
test.cpp		test.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deps

deps

include

include

.gitignore

.gitignore

.gitmodules

.gitmodules

Makefile

Makefile

README.md

README.md

bench.cpp

bench.cpp

bench.gnuplot

bench.gnuplot

test.cpp

test.cpp

Repository files navigation

HAT-Trie

Array Hash

Hashes

About

Releases

Packages

Languages

dlecocq/hat-trie

Folders and files

Latest commit

History

Repository files navigation

HAT-Trie

Array Hash

Hashes

About

Resources

Stars

Watchers

Forks

Languages