Skip to content

dlecocq/hat-trie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HAT-Trie

I've been wanting to play around with the HAT-Trie (pdf) described by Dr. Askitis. The jist of it is that it is supposed to have all the wonderful performance characteristics of a trie -- it's sorted, it's got a bit of compression built into the thing, etc. -- but in a cache-conscious way. In particular, much of this strength comes from the fact that the tree doesn't branch all the way down to hold individual words in leaves, but leaves actually contain a cache-conscious unsorted associative data structure, the array-hash.

There are a few reasons for this repository:

  1. Askitis has licensed his code in such a way to prevent commercial use.
  2. I'd like to gain an appreciation for this and similar data structures
  3. I'd like to benchmark this relative to other data structures, in absence of the implementation from the paper (worrying about license pollution)

As a goal I'd like to see a canonical open source (and commercially usable) implementation emerge. May not be this code base, of course. In fact, there are a couple of implementations, in C and in C++ that have served as inspiration and good conversation. This is in hopes of continuing that conversation.

Array Hash

Before the HAT-Trie can be implemented, we must have the array-hash. It's very similar to traditional hash table, except that bins are not linked lists, but are contiguous chunks of memory with the keys and values interleaved. The hope is that since there are fewer levels of indirection, cache performance may be improved.

I tried two implementations: 1) using a char* directly and encoding the keys and values manually and 2) using std::vector which internally uses the same technique, but has the added bonus of being much easier to deal with. The char* version is available under the char branch, and the (more complete) std::vector approach can be found under the vector branch.

In terms of performance, std::tr1::unordered_map reliably beat both implementations after about 10-15k keys are inserted, and both std::vector and char* generally perform better for fewer than about 4k keys. Interestingly enough, however, std::vector<std::pair<std::string,T> > performed comparably well when compared to the char* version. It also makes the code significantly more readable and much easier to deal with.

For this test, I used std::tr1::unordered_map<std::string,size_t>, and for keys I used the distinct_t dataset hosted on Askitis' site. I tested with both g++ 4.2.2 and clang 2.1 on a 2011-ish MacBook Pro. I encourage you to run make bench yourself to compare this implementation of array-hash, std::map (for fun) and std::tr1::unordered_map.

Hashes

One of my initial inclinations to try to improve upon the performance to edge out std::tr1::unordered_map was to try different hashes. I had initially used my personal favorite superfasthash, but also tried out crapwow which claims to be more performant than superfast. The array-hash class here accepts hash functor as a template argument, and defaults to crapwow.

In terms of performance, they were essentially indistinguishable in these tests

About

An Exploratory Implementation of HAT-Trie and Array-Hash

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages