Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse-optimized variant of union_with_key? #166

Open
SimonSapin opened this issue Dec 21, 2020 · 0 comments
Open

Reuse-optimized variant of union_with_key? #166

SimonSapin opened this issue Dec 21, 2020 · 0 comments

Comments

@SimonSapin
Copy link
Contributor

Maps have a union_with_key method, like union but with a callback to decide what to do when a key exists on both sides:

    pub fn union_with_key<F>(self, other: Self, mut f: F) -> Self
    where
        F: FnMut(&K, V, V) -> V

Both HashMap and OrdMap are internally immutable trees with sharable copy-on-write nodes. Rc::make_mut is used to mutate a node, cloning it if it was previously shared. When many possibly-large maps exist that are created by cloning a previous map and making a possibly-small number of changes, this sharing can be very significant and minimizing the number of make_mut calls can be impactful.

Would you be open to adding a new API for this?

Stage 1

The implementation of union_with_key consumes and iterates one the input map, while it mutates then returns the other one. Because the signature of the callback involves owned V values however, it needs to always call remove then insert. This leads to more make_mut calls than necessary when the value that was already in the map being mutated ends up being used.

Adding a new method where the callback takes borrowed values would help in that case:

enum MergeResult<V> {
    UseLeftValue,
    UseRightValue,
    UseNewValue(V),
}
    pub fn union_with_merge<F>(self, other: Self, mut merge: F) -> Self
    where
        F: FnMut(&K, /* left: */ &V, /* right: */ &V) -> MergeResult<V>
    {
        for (key, right_value) in other {
            match self.get(&key) {  // get() does not use make_mut() where remove() does
                None => {
                    self.insert(key, right_value);
                }
                Some(left_value) => {
                    match merge(&key, left_value, &right_value) {
                        MergeResult::UseLeftValue => {}, // No insert() here
                        MergeResult::UseRightValue => {
                            self.insert(key, right_value);
                        },
                        MergeResult::UseNewValue(new_value) => {
                            self.insert(key, new_value);
                        },
                    }
                }
            }
        }
        self
    }

(Omitted: swapping the two maps to mutate the larger one: #163. That swap is where MergeResult::UseRightValue becomes useful.)

Stage 2

This is a bit fuzzy. More of an intuition that something might be possible, than a fully-formed plan.

Writing this issue is motivated by an optimization effort of an algorithm in Mercurial (copy tracing). Currently that algorithm does the equivalent of Stage 1 above when one of the two maps to merge is much (2×) larger than the other. When they are of similar size it goes even further to maximize re-use: call OrdMap::diff and collect all respective key-value pairs that we’d need to insert to transform each of the two maps into a merged map. Only then we pick the side with the smaller set of changes.

Since #113 diff short-circuits shared internal subtrees based on ptr_eq. So more node sharing make diff faster and avoids redundant merge computations, which in turn increases sharing in a virtuous cycle.

Still, collecting two sets of changes before applying one of them feels like more work than necessary.

A union_with_merge method in the im crate could potentially walk internal trees directly instead of using ConsumingIter, merge one sub-tree at a time, short-circuit shared sub-trees based on ptr_eq, and… handwave… have some better algorithm than the one based of OrdMap::diff.

Walking internal trees while mutating them to insert merged value and maintaining all invariants would likely be tricky (diff is comparatively easier since read-only). But does this sound possible at all?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant