Incorrect HashSet documentation? #175

ryzhyk · 2021-02-19T22:34:09Z

Documentation for HashSet contains the following sentence:

Values will have a predictable order based on the hasher being used. Unless otherwise specified, this will be the standard RandomState hasher.

I don't believe this is accurate. Even when replacing RandomState with a deterministic hasher, the order of values during enumeration is still only predictable assuming there are no collision nodes in the HAMT. In the presence of collisions, the order depends on the order in which values were added to the collision node. Here is a simple repro that generates two sets with identical contents, but different enumeration order.

use im::hashset::HashSet;
use std::collections::hash_map::DefaultHasher;
use std::hash::BuildHasherDefault;

fn main() {
    // Two hashsets with deterministic hashers.
    let mut set1 = HashSet::with_hasher(<BuildHasherDefault<DefaultHasher>>::default());
    let mut set2 = HashSet::with_hasher(<BuildHasherDefault<DefaultHasher>>::default());

    for i in 0..100_001 {
        // Insert elements in ascending order to set1 and in descending order to set2.
        // This ensures that collision nodes, if any, will contain values in different order.
        set1.insert(i);
        set2.insert(100_000 - i);
    }

    // This assertion succeeds: `set1` and `set2` contain the same elements.
    assert_eq!(set1, set2);

    let vec1: Vec<_> = set1.into_iter().collect();
    let vec2: Vec<_> = set2.into_iter().collect();

    // This assertion fails: the order of elements is not exactly the same.
    assert_eq!(vec1, vec2);
}

The text was updated successfully, but these errors were encountered:

ryzhyk · 2021-02-20T00:17:06Z

Here is another passage from the documentation that is inaccurate for the same reason:

Please note that the order is consistent between sets using the same hasher, but no other ordering guarantee is offered. Items will not come out in insertion order or sort order. They will, however, come out in the same order every time for the same set.

ryzhyk · 2021-02-20T00:34:03Z

It gets worse: Ord and PartialOrd impls for HashSet also rely on consistent ordering and therefore don't work correctly:

use im::hashset::HashSet;
use std::collections::hash_map::DefaultHasher;
use std::hash::BuildHasherDefault;
use std::cmp::Ordering;

fn main() {

    let mut set1 = HashSet::with_hasher(<BuildHasherDefault<DefaultHasher>>::default());
    let mut set2 = HashSet::with_hasher(<BuildHasherDefault<DefaultHasher>>::default());

    for i in 0..15_001 {
        set1.insert(i);
        set2.insert(15_000 - i);
    }

    // The sets are the same according to Eq..
    assert_eq!(set1, set2);
    // ..but not according to Ord causing this assertion to fail.
    assert_eq!(set1.cmp(&set2), Ordering::Equal);
}

Add ordered iterators over HAMT and `HashSet`. Given two sets with the same elements and the same hashers, the new iterators enumerate these sets in the same order. This is in contrast to existing iterators that do not guarantee consistent ordering for elements with the same hash values. Consistent ordering is achieved by sorting collision nodes on the fly during iteration. We use the new iterators to fix the implementation of `Ord` and `PartialOrd` for `HashSet` (bodil#175).

Add ordered iterators over HAMT and `HashSet`. Given two sets with the same elements and the same hashers, the new iterators enumerate these sets in the same order. This is in contrast to existing iterators that do not guarantee consistent ordering for elements with the same hash values. Consistent ordering is achieved by sorting collision nodes on the fly during iteration. We use the new iterators to fix the implementation of `Ord`, `PartialOrd`, and `Hash` for `HashSet` (bodil#175).

At the API level functional hashsets (aka immutable hashsets) behave just like regular hashsets; however their internal implementation supports cloning a hashset in time O(1), by sharing the entire internal state between the clone and the original. Modifying the clone updates only the affected state in a copy-on-write fashion, with the rest of the state still shared with the parent. Example use case (added to `test-stream.sh`): computing the set of all unique id's that appear in a stream. At every iteration, we add all newly observed ids to the set of id's computed so far. This would normally amount to cloning and modifying a potentially large set in time `O(n)`, where `n` is the size of the set. With functional sets, the cost if `O(1)`. Functional data types are generally a great match for working with immutable collections, e.g., collections stored in DDlog relations. We therefore plan to introduce more functional data types in the future, possibly even replacing the standard collections (`Set`, `Map`, `Vec`) with functional versions. Implementation: we implement the library as a DDlog wrapper around the `im` crate. Unfortunately, the crate is no longer maintained and in fact it had some correctness issues described here: bodil/im-rs#175. I forked the crate and fixed the bugs in my fork: ddlog-dev/im-rs@46f13d8. We may need to switch to a different crate in the future, e.g., `rpds`, which is less popular but seems to be better maintained. Performance considerations. While functional sets are faster to copy, they are still expensive to hash and compare (just like normal sets, but potentially even more so due to more complex internal design). My initial implementation of the unique id's use case stored aggregates in a relation. It was about as slow as the implementation using non-functinal sets, with most of the time spent in comparing sets as they were deleted from/insered into relations. The stream-based implementation is >20x faster as it does not compute deltas, and is 8x faster than equivalent implementation using regular sets.

ryzhyk mentioned this issue Feb 19, 2021

Make new and default implementation not depend on random state #160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect HashSet documentation? #175

Incorrect HashSet documentation? #175

ryzhyk commented Feb 19, 2021 •

edited

ryzhyk commented Feb 20, 2021

ryzhyk commented Feb 20, 2021 •

edited

Incorrect HashSet documentation? #175

Incorrect HashSet documentation? #175

Comments

ryzhyk commented Feb 19, 2021 • edited

ryzhyk commented Feb 20, 2021

ryzhyk commented Feb 20, 2021 • edited

ryzhyk commented Feb 19, 2021 •

edited

ryzhyk commented Feb 20, 2021 •

edited