Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect HashSet documentation? #175

Open
ryzhyk opened this issue Feb 19, 2021 · 2 comments
Open

Incorrect HashSet documentation? #175

ryzhyk opened this issue Feb 19, 2021 · 2 comments

Comments

@ryzhyk
Copy link

ryzhyk commented Feb 19, 2021

Documentation for HashSet contains the following sentence:

Values will have a predictable order based on the hasher being used. Unless otherwise specified, this will be the standard RandomState hasher.

I don't believe this is accurate. Even when replacing RandomState with a deterministic hasher, the order of values during enumeration is still only predictable assuming there are no collision nodes in the HAMT. In the presence of collisions, the order depends on the order in which values were added to the collision node. Here is a simple repro that generates two sets with identical contents, but different enumeration order.

use im::hashset::HashSet;
use std::collections::hash_map::DefaultHasher;
use std::hash::BuildHasherDefault;

fn main() {
    // Two hashsets with deterministic hashers.
    let mut set1 = HashSet::with_hasher(<BuildHasherDefault<DefaultHasher>>::default());
    let mut set2 = HashSet::with_hasher(<BuildHasherDefault<DefaultHasher>>::default());

    for i in 0..100_001 {
        // Insert elements in ascending order to set1 and in descending order to set2.
        // This ensures that collision nodes, if any, will contain values in different order.
        set1.insert(i);
        set2.insert(100_000 - i);
    }

    // This assertion succeeds: `set1` and `set2` contain the same elements.
    assert_eq!(set1, set2);

    let vec1: Vec<_> = set1.into_iter().collect();
    let vec2: Vec<_> = set2.into_iter().collect();

    // This assertion fails: the order of elements is not exactly the same.
    assert_eq!(vec1, vec2);
}
@ryzhyk
Copy link
Author

ryzhyk commented Feb 20, 2021

Here is another passage from the documentation that is inaccurate for the same reason:

Please note that the order is consistent between sets using the same hasher, but no other ordering guarantee is offered. Items will not come out in insertion order or sort order. They will, however, come out in the same order every time for the same set.

@ryzhyk
Copy link
Author

ryzhyk commented Feb 20, 2021

It gets worse: Ord and PartialOrd impls for HashSet also rely on consistent ordering and therefore don't work correctly:

use im::hashset::HashSet;
use std::collections::hash_map::DefaultHasher;
use std::hash::BuildHasherDefault;
use std::cmp::Ordering;

fn main() {

    let mut set1 = HashSet::with_hasher(<BuildHasherDefault<DefaultHasher>>::default());
    let mut set2 = HashSet::with_hasher(<BuildHasherDefault<DefaultHasher>>::default());

    for i in 0..15_001 {
        set1.insert(i);
        set2.insert(15_000 - i);
    }

    // The sets are the same according to Eq..
    assert_eq!(set1, set2);
    // ..but not according to Ord causing this assertion to fail.
    assert_eq!(set1.cmp(&set2), Ordering::Equal);
}

ryzhyk added a commit to ddlog-dev/im-rs that referenced this issue Feb 20, 2021
Add ordered iterators over HAMT and `HashSet`.  Given two sets with the
same elements and the same hashers, the new iterators enumerate these
sets in the same order.  This is in contrast to existing iterators that
do not guarantee consistent ordering for elements with the same hash
values.  Consistent ordering is achieved by sorting collision nodes on
the fly during iteration.

We use the new iterators to fix the implementation of `Ord` and `PartialOrd`
for `HashSet` (bodil#175).
ryzhyk added a commit to ddlog-dev/im-rs that referenced this issue Feb 21, 2021
Add ordered iterators over HAMT and `HashSet`.  Given two sets with the
same elements and the same hashers, the new iterators enumerate these
sets in the same order.  This is in contrast to existing iterators that
do not guarantee consistent ordering for elements with the same hash
values.  Consistent ordering is achieved by sorting collision nodes on
the fly during iteration.

We use the new iterators to fix the implementation of `Ord`, `PartialOrd`,
and `Hash` for `HashSet` (bodil#175).
ryzhyk added a commit to ddlog-dev/differential-datalog that referenced this issue Feb 25, 2021
At the API level functional hashsets (aka immutable hashsets) behave
just like regular hashsets; however their internal implementation
supports cloning a hashset in time O(1), by sharing the entire internal
state between the clone and the original.  Modifying the clone updates
only the affected state in a copy-on-write fashion, with the rest of the
state still shared with the parent.

Example use case (added to `test-stream.sh`): computing the set of all
unique id's that appear in a stream.  At every iteration, we add all
newly observed ids to the set of id's computed so far.  This would
normally amount to cloning and modifying a potentially large set in time
`O(n)`, where `n` is the size of the set.  With functional sets, the
cost if `O(1)`.

Functional data types are generally a great match for working with immutable
collections, e.g., collections stored in DDlog relations.  We therefore plan
to introduce more functional data types in the future, possibly even
replacing the standard collections (`Set`, `Map`, `Vec`) with functional
versions.

Implementation: we implement the library as a DDlog wrapper around the
`im` crate.  Unfortunately, the crate is no longer maintained and in
fact it had some correctness issues described here:
bodil/im-rs#175.  I forked the crate and fixed
the bugs in my fork:
ddlog-dev/im-rs@46f13d8.

We may need to switch to a different crate in the future, e.g., `rpds`,
which is less popular but seems to be better maintained.

Performance considerations.  While functional sets are faster to copy,
they are still expensive to hash and compare (just like normal sets, but
potentially even more so due to more complex internal design).  My
initial implementation of the unique id's use case stored aggregates in
a relation.  It was about as slow as the implementation using
non-functinal sets, with most of the time spent in comparing sets as
they were deleted from/insered into relations.  The stream-based
implementation is >20x faster as it does not compute deltas, and is 8x
faster than equivalent implementation using regular sets.
ryzhyk added a commit to ddlog-dev/differential-datalog that referenced this issue Feb 25, 2021
At the API level functional hashsets (aka immutable hashsets) behave
just like regular hashsets; however their internal implementation
supports cloning a hashset in time O(1), by sharing the entire internal
state between the clone and the original.  Modifying the clone updates
only the affected state in a copy-on-write fashion, with the rest of the
state still shared with the parent.

Example use case (added to `test-stream.sh`): computing the set of all
unique id's that appear in a stream.  At every iteration, we add all
newly observed ids to the set of id's computed so far.  This would
normally amount to cloning and modifying a potentially large set in time
`O(n)`, where `n` is the size of the set.  With functional sets, the
cost if `O(1)`.

Functional data types are generally a great match for working with immutable
collections, e.g., collections stored in DDlog relations.  We therefore plan
to introduce more functional data types in the future, possibly even
replacing the standard collections (`Set`, `Map`, `Vec`) with functional
versions.

Implementation: we implement the library as a DDlog wrapper around the
`im` crate.  Unfortunately, the crate is no longer maintained and in
fact it had some correctness issues described here:
bodil/im-rs#175.  I forked the crate and fixed
the bugs in my fork:
ddlog-dev/im-rs@46f13d8.

We may need to switch to a different crate in the future, e.g., `rpds`,
which is less popular but seems to be better maintained.

Performance considerations.  While functional sets are faster to copy,
they are still expensive to hash and compare (just like normal sets, but
potentially even more so due to more complex internal design).  My
initial implementation of the unique id's use case stored aggregates in
a relation.  It was about as slow as the implementation using
non-functinal sets, with most of the time spent in comparing sets as
they were deleted from/insered into relations.  The stream-based
implementation is >20x faster as it does not compute deltas, and is 8x
faster than equivalent implementation using regular sets.
ryzhyk added a commit to ddlog-dev/differential-datalog that referenced this issue Feb 25, 2021
At the API level functional hashsets (aka immutable hashsets) behave
just like regular hashsets; however their internal implementation
supports cloning a hashset in time O(1), by sharing the entire internal
state between the clone and the original.  Modifying the clone updates
only the affected state in a copy-on-write fashion, with the rest of the
state still shared with the parent.

Example use case (added to `test-stream.sh`): computing the set of all
unique id's that appear in a stream.  At every iteration, we add all
newly observed ids to the set of id's computed so far.  This would
normally amount to cloning and modifying a potentially large set in time
`O(n)`, where `n` is the size of the set.  With functional sets, the
cost if `O(1)`.

Functional data types are generally a great match for working with immutable
collections, e.g., collections stored in DDlog relations.  We therefore plan
to introduce more functional data types in the future, possibly even
replacing the standard collections (`Set`, `Map`, `Vec`) with functional
versions.

Implementation: we implement the library as a DDlog wrapper around the
`im` crate.  Unfortunately, the crate is no longer maintained and in
fact it had some correctness issues described here:
bodil/im-rs#175.  I forked the crate and fixed
the bugs in my fork:
ddlog-dev/im-rs@46f13d8.

We may need to switch to a different crate in the future, e.g., `rpds`,
which is less popular but seems to be better maintained.

Performance considerations.  While functional sets are faster to copy,
they are still expensive to hash and compare (just like normal sets, but
potentially even more so due to more complex internal design).  My
initial implementation of the unique id's use case stored aggregates in
a relation.  It was about as slow as the implementation using
non-functinal sets, with most of the time spent in comparing sets as
they were deleted from/insered into relations.  The stream-based
implementation is >20x faster as it does not compute deltas, and is 8x
faster than equivalent implementation using regular sets.
ryzhyk added a commit to vmware/differential-datalog that referenced this issue Feb 28, 2021
At the API level functional hashsets (aka immutable hashsets) behave
just like regular hashsets; however their internal implementation
supports cloning a hashset in time O(1), by sharing the entire internal
state between the clone and the original.  Modifying the clone updates
only the affected state in a copy-on-write fashion, with the rest of the
state still shared with the parent.

Example use case (added to `test-stream.sh`): computing the set of all
unique id's that appear in a stream.  At every iteration, we add all
newly observed ids to the set of id's computed so far.  This would
normally amount to cloning and modifying a potentially large set in time
`O(n)`, where `n` is the size of the set.  With functional sets, the
cost if `O(1)`.

Functional data types are generally a great match for working with immutable
collections, e.g., collections stored in DDlog relations.  We therefore plan
to introduce more functional data types in the future, possibly even
replacing the standard collections (`Set`, `Map`, `Vec`) with functional
versions.

Implementation: we implement the library as a DDlog wrapper around the
`im` crate.  Unfortunately, the crate is no longer maintained and in
fact it had some correctness issues described here:
bodil/im-rs#175.  I forked the crate and fixed
the bugs in my fork:
ddlog-dev/im-rs@46f13d8.

We may need to switch to a different crate in the future, e.g., `rpds`,
which is less popular but seems to be better maintained.

Performance considerations.  While functional sets are faster to copy,
they are still expensive to hash and compare (just like normal sets, but
potentially even more so due to more complex internal design).  My
initial implementation of the unique id's use case stored aggregates in
a relation.  It was about as slow as the implementation using
non-functinal sets, with most of the time spent in comparing sets as
they were deleted from/insered into relations.  The stream-based
implementation is >20x faster as it does not compute deltas, and is 8x
faster than equivalent implementation using regular sets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant