Optimize insertion to only use a single lookup #277

Zoxc · 2021-07-09T16:12:13Z

This changes the insert method to use a single lookup for insertion instead of 2 separate lookups.

This reduces runtime of rustc on check builds by an average of 0.5% on local benchmarks (1% for winapi). The compiler size is reduced by 0.45%.

unwrap_unchecked is lifted from libstd since it's currently unstable and that probably requires some attribution.

Amanieu

Nice work! I'd like to see the impact on the full set of rust-perf benchmarks first though.

src/raw/mod.rs

Zoxc · 2021-07-13T13:56:37Z

I did some newer rustc benchmarks and it seems performance is now more neutral. The size reduction is down to 0.28%.

clap:check                        1.9650s   1.9607s  -0.22%
helloworld:check                  0.0444s   0.0442s  -0.58%
hyper:check                       0.2991s   0.2997s  +0.20%
regex:check                       1.1642s   1.1614s  -0.24%
syn:check                         1.7067s   1.7051s  -0.09%
syntex_syntax:check               6.9250s   6.9132s  -0.17%
winapi:check                      8.3790s   8.3943s  +0.18%

Total                            20.4834s  20.4786s  -0.02%
Summary                           3.5000s   3.4954s  -0.13%

Adding a branch hint to reserve improved things (code size reduction at 0.27%):

clap:check                        1.9603s   1.9487s  -0.59%
helloworld:check                  0.0436s   0.0436s  +0.10%
hyper:check                       0.2984s   0.2983s  -0.02%
regex:check                       1.1535s   1.1471s  -0.56%
syn:check                         1.7065s   1.6952s  -0.66%
syntex_syntax:check               6.8963s   6.8691s  -0.39%
winapi:check                      8.3529s   8.3326s  -0.24%

Total                            20.4115s  20.3347s  -0.38%
Summary                           3.5000s   3.4881s  -0.34%

I think rustc ends up being more sensitive to inlining / code layout than the real changes here. I may try to use the new LLVM pass manager and see if that behaves the same way.

Amanieu · 2021-07-13T14:00:55Z

check builds are a bit misleading because they stop before monomorphization happens. One concern I have with the code as it is currently is that there is now much more code that is generic over T. This will likely cause a compilation time regression in debug/opt builds due to the increased amount of code generated by HashMap.

Zoxc · 2021-07-13T16:09:13Z

I'm using check builds to measure the run-time of hashbrown and explicitly exclude compile-time. Some compile-time could sneak in as applying hashbrown to stage0 only is no longer trivial (with both Cargo and Rust 2018 not supporting conditional crates).

When insert is fully inlined, this PR reduces the lookups from 3 to 1, which is probably why this sees a code size improvement (with optimizations). so I'm not sure if release build times will increase. Debug builds could still suffer though. Some dynamic dispatch may be helpful there assuming it can be optimized away on release builds. Do you know of some smaller crates than cargo that heavily relies on HashMap I could thrown in my benchmarks suite?

Amanieu · 2021-07-14T20:08:10Z

You might want to measure using cargo-llvm-lines to compare the amount of LLVM IR generated before & after this change. See #205 for a test program that you can use to measure.

Zoxc · 2021-07-14T22:18:03Z

I did some measurements of code in debug mode. I also tried making find_potential and search generic using dynamic dispatch. LLVM seems to be able to inline and produce similar code in release mode with that change.

Type	Before	PR	PR with dynamic dispatch
llvm-lines	15081	15843 (5.05%)	15100 (0.13%)
rustc-driver bytes	226,703,360	227,664,384 (0.42%)	226,968,064 (0.12%)
cargo bytes	31,037,952	31,209,984 (0.55%)	31,093,248 (0.18%)

Amanieu · 2021-07-15T13:49:09Z

src/raw/mod.rs

@@ -1206,6 +1222,90 @@ impl<A: Allocator + Clone> RawTableInner<A> {
        }
    }

+    /// Finds the position to insert something in a group.
+    #[inline]
+    unsafe fn find_insert_slot_in_group(


I don't think there is any situation when calling this function could be unsafe. The invariants of RawTable guarantee that accessing the first Group::WIDTH control bytes is always valid.

Amanieu · 2021-07-15T14:44:56Z

src/raw/mod.rs

+    }
+
+    /// Searches for an element in the table,
+    /// or a potential slot where that element could be inserted.


As with #279, add a comment explaining why we are using dynamic dispatch here.

Amanieu · 2021-07-15T14:58:34Z

src/raw/mod.rs

+                let index = self.find_insert_slot_in_group(&group, &probe_seq);
+
+                if likely(index.is_some()) {
+                    // Only stop the search if the group is empty. The element might be
+                    // in a following group.
+                    if likely(group.match_empty().any_bit_set()) {
+                        // Use a tombstone if we found one
+                        return if unlikely(tombstone.is_some()) {
+                            (tombstone.unwrap(), false)
+                        } else {
+                            (index.unwrap(), false)
+                        };
+                    } else {
+                        // We found a tombstone, record it so we can return it as a potential
+                        // insertion location.
+                        tombstone = index;
+                    }
+                }


I would reorganize this code like this, which makes it much clearer:

// We didn't find the element we were looking for in the group, try to get an // insertion slot from the group if we don't have one yet. if insert_slot.is_none() { insert_slot = self.find_insert_slot_in_group(&group, &probe_seq); } // Only stop the search if the group contains at least one empty element. // Otherwise, the element that we are looking for might be in a following group. if likely(group.match_empty().any_bit_set()) { return (insert_slot.unchecked_unwrap(), false); }

(tombstone is renamed to insert_slot)

Zoxc · 2021-07-15T20:15:26Z

I verified that this is still a performance win for rustc after #279 landed. New benchmark run (size reduction now at 0.19%):

clap:check                        1.9295s   1.9188s  -0.55%
helloworld:check                  0.0440s   0.0437s  -0.62%
hyper:check                       0.2950s   0.2928s  -0.76%
regex:check                       1.1416s   1.1341s  -0.66%
syn:check                         1.6833s   1.6720s  -0.67%
syntex_syntax:check               6.7729s   6.7454s  -0.41%
winapi:check                      8.2021s   8.1426s  -0.73%

Total                            20.0683s  19.9494s  -0.59%
Summary                           3.5000s   3.4781s  -0.63%

Any suggestions on what to do with the rehash_in_place benchmark? That seems to benchmark reusing tombstones at exactly max capacity, which will expand with this PR. Should it be changed to use half capacity so it will trigger rehashes?

Zoxc · 2021-07-15T21:46:54Z

I made a test crate with 117 HashMap instances to see what effect this has on compile times:

use std::collections::HashMap;
use std::fmt::Debug;
use std::hash::Hash;

fn map<K: Hash + Debug + Eq + Clone, V: Debug>(k: K, v: V) {
    let mut map = HashMap::new();
    map.insert(k.clone(), v);
    map.reserve(1000);
    dbg!(map.get(&k), map.iter().next());
}

fn values<K: Hash + Debug + Eq + Clone>(k: K) {
    map(k.clone(), ());
    map(k.clone(), "");
    map(k.clone(), true);
    map(k.clone(), 1i8);
    map(k.clone(), 1u8);
    map(k.clone(), 1u32);
    map(k.clone(), 1i32);
    map(k.clone(), vec![1u32]);
    map(k.clone(), vec![1i32]);
}

fn main() {
    values(());
    values("");
    values(true);
    values(1i8);
    values(1u8);
    values(1u64);
    values(1i64);
    values(1usize);
    values(1isize);
    values(String::new());
    values(vec![""]);
    values(vec![1u32]);
    values(vec![1i32]);
}

Results:

hashmap-instances:check           0.0628s   0.0625s  -0.60%
hashmap-instances:debug           7.8956s   8.0046s  +1.38%
hashmap-instances:release        32.9421s  32.1129s  -2.52%

Amanieu · 2021-07-16T10:32:41Z

At this point, I'd like to see some real benchmarks from rust-perf. You should open a PR in rust-lang/rust where you replace hashbrown in std with your branch. That way we can use the rust timer bot to get some numbers on compilation time. (Like we did for rust-lang/rust#77566)

Make rehashing and resizing less generic This makes the code in `rehash_in_place`, `resize` and `reserve_rehash` less generic on `T`. It also improves the performance of rustc. That performance increase in partially attributed to the use of `#[inline(always)]`. This is the effect on rustc runtime: ``` clap:check 1.9523s 1.9327s -1.00% hashmap-instances:check 0.0628s 0.0624s -0.57% helloworld:check 0.0438s 0.0436s -0.50% hyper:check 0.2987s 0.2970s -0.59% regex:check 1.1497s 1.1402s -0.82% syn:check 1.7004s 1.6851s -0.90% syntex_syntax:check 6.9232s 6.8546s -0.99% winapi:check 8.3220s 8.2857s -0.44% Total 20.4528s 20.3014s -0.74% Summary 4.0000s 3.9709s -0.73% ``` `rustc_driver`'s code size is increased by 0.02%. This is the effect on compile time this has on my [HashMap compile time benchmark](#277 (comment)): ``` hashmap-instances:check 0.0636s 0.0632s -0.61% hashmap-instances:release 33.0166s 32.2487s -2.33% hashmap-instances:debug 7.8677s 7.2012s -8.47% Total 40.9479s 39.5131s -3.50% Summary 1.5000s 1.4430s -3.80% ``` The `hashmap-instances:debug` compile time could be further improved if there was a way to apply `#[inline(always)]` only on release builds.

Inline small functions This adds `#[inline]` to small functions which should be beneficial to inline. rustc compilation performance (code size of `rustc_driver` up by 0.09%): ``` clap:check 1.9486s 1.9416s -0.36% hashmap-instances:check 0.0629s 0.0626s -0.52% helloworld:check 0.0443s 0.0439s -0.69% hyper:check 0.3011s 0.3000s -0.36% regex:check 1.1505s 1.1468s -0.33% syn:check 1.6989s 1.6904s -0.50% syntex_syntax:check 6.8479s 6.8288s -0.28% winapi:check 8.3437s 8.2967s -0.56% Total 20.3979s 20.3108s -0.43% Summary 4.0000s 3.9820s -0.45% ``` This is the effect on compile time this has on my [HashMap compile time benchmark](#277 (comment)): ``` hashmap-instances:check 0.0635s 0.0632s -0.33% hashmap-instances:release 32.0928s 32.4440s +1.09% hashmap-instances:debug 7.2193s 7.2800s +0.84% Total 39.3756s 39.7873s +1.05% Summary 1.5000s 1.5080s +0.54% ``` We saw a 1.6% improvement in rustc's build time for a -20% improvement on `hashmap-instances:release` on rust-lang/rust#87233. So I would expect around a 0.08% regression for rustc's build time from this PR.

Zoxc · 2021-07-22T19:07:57Z

I re-ran the benchmarks after #283 landed.

clap:check                        1.9603s   1.9501s  -0.52%
hashmap-instances:check           0.0632s   0.0626s  -0.95%
helloworld:check                  0.0437s   0.0435s  -0.46%
hyper:check                       0.2971s   0.2949s  -0.72%
regex:check                       1.1596s   1.1548s  -0.42%
syn:check                         1.7156s   1.7048s  -0.63%
syntex_syntax:check               6.8806s   6.8575s  -0.34%
winapi:check                      8.3276s   8.2277s  -1.20%

Total                            20.4478s  20.2959s  -0.74%
Summary                           4.0000s   3.9738s  -0.66%

hashmap-instances:check           0.0624s   0.0622s  -0.23%
hashmap-instances:release        32.3625s  31.2730s  -3.37%
hashmap-instances:debug           7.2383s   7.3376s  +1.37%

Total                            39.6632s  38.6728s  -2.50%
Summary                           1.5000s   1.4889s  -0.74%

Zoxc · 2021-07-23T05:12:47Z

New perf run results

Amanieu · 2021-07-24T23:31:36Z

The rustc-perf results are unfortunately rather disappointing and I don't feel comfortable merging this PR as it is. This PR muddies the water a bit since it involves two distinct changes: combining the insert/lookup and reducing LLVM IR generated.

It would be best to explore each separately so that we get a better understanding of the perf characteristics:

The normal insert function could be decomposed into a generic and non-generic part to see what effect that has on performance.
The new insertion algorithm could be extended to also apply to Entry. This should be possible since VacantEntry can hold the index into which to insert an element.

Zoxc · 2021-07-25T08:20:03Z

The rustc-perf results are unfortunately rather disappointing

I assume you're talking about the instruction counts, if so, may I ask why you seem to value those over wall time benchmarks?

It would be best to explore each separately so that we get a better understanding of the perf characteristics

I did verify that the code with dynamic dispatch optimizes to similar x86-64 assembly to code without. That does assume that LLVM's inlining plays along however.

The normal insert function could be decomposed into a generic and non-generic part to see what effect that has on performance.

I'm not sure what do you mean here. All 3 lookups on insert on master has since #279 been in the less generic part. There doesn't seem to be a lot more to move out of it.

The new insertion algorithm could be extended to also apply to Entry. This should be possible since VacantEntry can hold the index into which to insert an element.

I do think this makes sense. insert is hotter for rustc, so I wouldn't expect this to alter benchmark results much.

Zoxc · 2021-07-25T08:34:08Z

I did an experiment where I put #[inline(never)] on HashMap::insert and built libstd with the inline-more feature. This is to make LLVM's inlining more consistent. I tried 3 variations on HashMap::insert:

find; find_insert_slot; reserve; find_insert_slot (what's on master now)
reserve; find_potential (this PR)
find; reserve; find_insert_slot

These are the results. The number are in order and the percentages are relative to the first column.

clap:check                        1.9480s   1.9552s  +0.37%   1.9466s  -0.07%
hashmap-instances:check           0.0635s   0.0634s  -0.16%   0.0633s  -0.29%
helloworld:check                  0.0443s   0.0442s  -0.28%   0.0441s  -0.48%
hyper:check                       0.2958s   0.2961s  +0.11%   0.2955s  -0.10%
regex:check                       1.1421s   1.1431s  +0.09%   1.1423s  +0.02%
syn:check                         1.6760s   1.6764s  +0.02%   1.6727s  -0.20%
syntex_syntax:check               6.7950s   6.8020s  +0.10%   6.7909s  -0.06%
winapi:check                      8.2145s   8.1965s  -0.22%   8.1979s  -0.20%

Total                            20.1793s  20.1769s  -0.01%  20.1534s  -0.13%
Summary                           2.6667s   2.6668s  +0.00%   2.6621s  -0.17%

There doesn't seem to be a significant difference between them (for rustc at least). This data seems to support my hypothesis that the benefit of this PR is due to inlining, where the smaller size of insert is beneficial.

Amanieu · 2021-07-25T10:38:23Z

I assume you're talking about the instruction counts, if so, may I ask why you seem to value those over wall time benchmarks?

Actually I also looked at the CPU cycles on rustc-perf (cycles:u) and the results still don't look very good.

The normal insert function could be decomposed into a generic and non-generic part to see what effect that has on performance.

I'm not sure what do you mean here. All 3 lookups on insert on master has since #279 been in the less generic part. There doesn't seem to be a lot more to move out of it.

I'm referring to RawTable::insert. Most of it doesn't depend on T and could be moved to RawTableInner like you did with find.

There doesn't seem to be a significant difference between them (for rustc at least). This data seems to support my hypothesis that the benefit of this PR is due to inlining, where the smaller size of insert is beneficial.

In that case let's stick with the existing algorithm and just focusing on the inlining optimizations.

Zoxc · 2021-07-25T20:44:35Z

Actually I also looked at the CPU cycles on rustc-perf (cycles:u) and the results still don't look very good.

I noticed that there's more crates than improve than regresses on wall time, on both perf runs. I wonder if cycles:u actually diverge from wall time on perf here. I do wish there was a sum of all crates on perf, which would be more precise. It's also be possible that the difference is due to Zen (my CPU) vs Zen 2 (perf server).

I'm referring to RawTable::insert. Most of it doesn't depend on T and could be moved to RawTableInner like you did with find.

The reserve call in the middle is a bit annoying. I worked around it by some cheeky unsafe code though. I'll see how it performs.

In that case let's stick with the existing algorithm and just focusing on the inlining optimizations.

I meant that this PR has a smaller algorithm, so it inlines better. I haven't done any pure inlining optimizations here.

Amanieu · 2021-07-25T21:51:34Z

I noticed that there's more crates than improve than regresses on wall time, on both perf runs. I wonder if cycles:u actually diverge from wall time on perf here. I do wish there was a sum of all crates on perf, which would be more precise. It's also be possible that the difference is due to Zen (my CPU) vs Zen 2 (perf server).

Wall-time can be very noisy when multiple threads are involved (codegen units). task-clock is more accurate since it only measures time when a thread is performing actual work rather than being blocked waiting for another thread. cycles is even more accurate since it is not affected by changes in CPU frequency: an add instruction takes 1 cycle no matter what frequency the CPU is running at.

Zoxc · 2021-07-25T22:59:36Z

I tries a less generic insert variant of the current algoritm:

#[cfg_attr(feature = "inline-more", inline)]
pub fn insert(&mut self, hash: u64, value: T, hasher: impl Fn(&T) -> u64) -> Bucket<T> {
    unsafe {
        let index = self.table.mark_insert(hash, &move |table| {
            (*(table as *mut RawTableInner<A> as *mut Self)).reserve(1, &hasher)
        });
        let bucket = self.bucket(index);
        bucket.write(value);
        bucket
    }
}

#[inline]
unsafe fn mark_insert(&mut self, hash: u64, reserve_one: &dyn Fn(&mut Self)) -> usize {
    let mut index = self.find_insert_slot(hash);

    // We can avoid growing the table once we have reached our load
    // factor if we are replacing a tombstone. This works since the
    // number of EMPTY slots does not change in this case.
    let old_ctrl = *self.ctrl(index);
    if unlikely(self.growth_left == 0 && special_is_empty(old_ctrl)) {
        reserve_one(self);
        index = self.find_insert_slot(hash);
    }

    self.record_item_insert_at(index, old_ctrl, hash);

    index
}

Compile times take a hit:

hashmap-instances:check           0.0636s   0.0637s  +0.10%
hashmap-instances:release        32.5219s  39.9375s +22.80%
hashmap-instances:debug           7.2875s   8.8993s +22.12%

rustc performance does improve:

clap:check                        1.9578s   1.9493s  -0.44%
hashmap-instances:check           0.0631s   0.0631s  -0.06%
helloworld:check                  0.0441s   0.0440s  -0.31%
hyper:check                       0.2972s   0.2965s  -0.26%
regex:check                       1.1550s   1.1511s  -0.34%
syn:check                         1.6938s   1.6893s  -0.26%
syntex_syntax:check               6.9113s   6.9025s  -0.13%
winapi:check                      8.3605s   8.2682s  -1.10%

Total                            20.4829s  20.3640s  -0.58%
Summary                           4.0000s   3.9855s  -0.36%

Zoxc · 2021-07-26T18:52:50Z

Here's the result of the perf run with #[inline(never] applied to HashMap::insert again. This shows the instruction count improvement I was expecting, due to the reduced number of lookups. So the regression on instruction count seems to be caused by different inlining behavior.

Amanieu · 2021-07-30T14:30:26Z

I can see the improvement now. However the inlining is still a problem: we want rustc to have good performance but at the same time we want to limit inlining so that crates still compile reasonably fast.

I'm honestly not sure what the best approach to take here is.

Zoxc · 2021-08-07T10:22:03Z

The find; reserve; find_insert_slot variant could be worth considering. That would probably get rid of the 1% regression on hashmap-instances:debug, but the rustc performance improvement is reduced.

Marwes · 2021-08-30T15:57:27Z

An alternative could be to write find_potential in terms of an "iterator" (like ProbeSeq) https://github.com/marwes/hashbrown/tree/opt-insert . Reduces the llvm line count by ~1% compared to the PR currently while also removing the dynamic dispatch.

Zoxc · 2023-02-05T23:04:51Z

I rebased and reran some rustc benchmarks. The result seem pretty similar, but hashmap-instances:release seems to have improved.

Benchmark	Before	After
Benchmark	Time	Time	%
🟣 winapi:check	7.5640s	7.5308s	-0.44%
🟣 clap:check	1.8297s	1.8249s	-0.26%
🟣 hyper:check	0.2645s	0.2645s	0.03%
🟣 syntex_syntax:check	6.3546s	6.3379s	-0.26%
🟣 syn:check	1.6337s	1.6208s	-0.79%
🟣 regex:check	1.0244s	1.0182s	-0.61%
Total	18.6709s	18.5972s	-0.39%
Summary	1.0000s	0.9961s	-0.39%

Benchmark	Before	After
Benchmark	Time	Time	%
🟣 hashmap-instances:check	0.0542s	0.0539s	-0.56%
🔵 hashmap-instances:release	21.4382s	19.4949s	💚 -9.06%
🟠 hashmap-instances:debug	5.6909s	5.8824s	💔 3.37%
Total	27.1833s	25.4313s	💚 -6.45%
Summary	1.0000s	0.9791s	💚 -2.09%

bors · 2023-02-23T13:30:49Z

☔ The latest upstream changes (presumably #405) made this pull request unmergeable. Please resolve the merge conflicts.

JustForFun88 · 2023-02-23T13:50:36Z

You've run into the same bug as me. In fact, you cannot check for full buckets and empty and deleted buckets at the same time. You should first check all full buckets up to the moment you find an empty bucket, and only then look for empty and deleted buckets. This is necessary, since there may be collisions in the map

JustForFun88 · 2023-02-23T13:54:25Z

Here my last attempt master...JustForFun88:hashbrown:one_lookup_other. It gives a performance increase, but very small from 0 to 5 percent.

Amanieu · 2023-02-23T11:57:22Z

src/raw/mod.rs

+                        .match_empty_or_deleted()
+                        .lowest_set_bit_nonzero();
+                }
+            }


I think this check can be moved out of the loop in find_potential_inner so that it is only executed if we find out that the actual element doesn't exist.

I've changed it to do this test on loop exit instead.

Amanieu · 2023-02-23T11:59:31Z

src/raw/mod.rs

+            // We didn't find the element we were looking for in the group, try to get an
+            // insertion slot from the group if we don't have one yet.
+            if likely(insert_slot.is_none()) {
+                insert_slot = self.find_insert_slot_in_group(&group, &probe_seq);


I think you could add a fast path here to immediately continue the loop if the returned insert_slot is None.

That would add a conditional to the common case of finding an insertion slot though, and only save group.match_empty() which is cheap already.

Amanieu · 2023-02-23T14:10:39Z

You've run into the same bug as me. In fact, you cannot check for full buckets and empty and deleted buckets at the same time. You should first check all full buckets up to the moment you find an empty bucket, and only then look for empty and deleted buckets. This is necessary, since there may be collisions in the map

Can you expand on this? It's not clear to me why the code is incorrect. It's essentially merging both loops to run them at the same time.

JustForFun88 · 2023-02-23T17:23:50Z

Can you expand on this? It's not clear to me why the code is incorrect. It's essentially merging both loops to run them at the same time.

Oh, sorry. I did not study the code enough, I just looked at why rehash_in_place fails and thought that here is the same as my first version of the code. It seems that this version has problems specifically with rehashing due to early reservation of additional space.

In principle, here master...JustForFun88:hashbrown:one_lookup_other is an equivalent version that works correctly without early reservation of additional space. It gives an acceleration of up to 10%.

We can combine both approaches.

Cargo bench

running 41 tests                                 Old insert                       New insert

test grow_insert_ahash_highbits  ... bench:      40,496 ns/iter (+/- 468)         38,005 ns/iter (+/- 7,215) 
test grow_insert_ahash_random    ... bench:      37,206 ns/iter (+/- 3,576)       36,579 ns/iter (+/- 5,114) 
test grow_insert_ahash_serial    ... bench:      37,302 ns/iter (+/- 1,749)       37,444 ns/iter (+/- 4,713) 
test grow_insert_std_highbits    ... bench:      59,918 ns/iter (+/- 8,144)       53,695 ns/iter (+/- 7,514) 
test grow_insert_std_random      ... bench:      53,260 ns/iter (+/- 6,954)       52,570 ns/iter (+/- 330)   
test grow_insert_std_serial      ... bench:      59,310 ns/iter (+/- 7,552)       53,242 ns/iter (+/- 232)   
test insert_ahash_highbits       ... bench:      33,941 ns/iter (+/- 402)         25,310 ns/iter (+/- 435)   
test insert_ahash_random         ... bench:      33,520 ns/iter (+/- 197)         27,669 ns/iter (+/- 6,837) 
test insert_ahash_serial         ... bench:      33,553 ns/iter (+/- 213)         25,194 ns/iter (+/- 162)   
test insert_erase_ahash_highbits ... bench:      31,639 ns/iter (+/- 8,427)       28,809 ns/iter (+/- 726)   
test insert_erase_ahash_random   ... bench:      30,538 ns/iter (+/- 3,425)       27,991 ns/iter (+/- 1,656) 
test insert_erase_ahash_serial   ... bench:      29,167 ns/iter (+/- 8,472)       27,278 ns/iter (+/- 6,051) 
test insert_erase_std_highbits   ... bench:      55,232 ns/iter (+/- 2,127)       54,485 ns/iter (+/- 20,457)
test insert_erase_std_random     ... bench:      55,272 ns/iter (+/- 1,277)       55,869 ns/iter (+/- 1,719) 
test insert_erase_std_serial     ... bench:      54,436 ns/iter (+/- 409)         54,708 ns/iter (+/- 533)   
test insert_std_highbits         ... bench:      49,024 ns/iter (+/- 617)         46,770 ns/iter (+/- 1,417) 
test insert_std_random           ... bench:      48,701 ns/iter (+/- 8,394)       44,607 ns/iter (+/- 443)   
test insert_std_serial           ... bench:      48,655 ns/iter (+/- 8,204)       44,600 ns/iter (+/- 365)   
test rehash_in_place             ... bench:     278,943 ns/iter (+/- 2,397)      277,346 ns/iter (+/- 1,913) 

running 2 tests

test insert                  ... bench:      10,316 ns/iter (+/- 151)             9,713 ns/iter (+/- 127)
test insert_unique_unchecked ... bench:       7,902 ns/iter (+/- 86)              7,896 ns/iter (+/- 106)

Amanieu · 2023-02-23T18:23:13Z

@JustForFun88 Isn't your implementation basically the same as what we already had before: first a lookup to find a matching entry, followed by a search for an empty slot if the lookup failed.

JustForFun88 · 2023-02-24T06:21:46Z

@JustForFun88 Isn't your implementation basically the same as what we already had before: first a lookup to find a matching entry, followed by a search for an empty slot if the lookup failed.

Yes you are right. It all boils down to this one way or another. I tried to do something similar to this pull request (master...JustForFun88:hashbrown:one_lookup_fourth). Didn't get any performance improvement from this. I think this is due to the fact that an additional branch is added to the loop (if likely(insert_slot.is_none()) in the case of this pull request, or if likely(!found_empty_or_deleted_slot) in the case of my fourth attempt).
A few percent improvement specifically in master...JustForFun88:hashbrown:one_lookup_other I associate rather with:

A single read (Group::load) from the heap at the initial moment, which in most cases ends in success (that is, either we find an existing element immediately in the first group, or we find an empty or deleted element in the same group (master...JustForFun88:hashbrown:one_lookup_other#diff-655778b213c501f917b62cd79d54725b7fcdc321c37b2f471b5c77ae2d3d818eR1199-R1200):

        let mut group_insert = unsafe { Group::load(self.ctrl(probe_seq.pos)) };
        let mut group_find = group_insert;

With the fact that we are skeptical about the existence of an element in the table (master...JustForFun88:hashbrown:one_lookup_other#diff-655778b213c501f917b62cd79d54725b7fcdc321c37b2f471b5c77ae2d3d818eR850-R853):

        if unlikely(found) {
            // found = true
            return (self.bucket(index), found);
        }

But in general, the acceleration is not great, which is why I closed my PR.

Amanieu · 2023-02-26T23:43:31Z

Have you tried benchmarking your optimization on the rustc perf suite? It's much better to benchmark a real program (rustc itself make heavy use of hashmaps), and we've seen good improvements there that don't show up in microbenchmarks.

To do this you will need to build rust from source with hashbrown in the standard library changed to point to your branch.

bors · 2023-03-29T23:02:58Z

☔ The latest upstream changes (presumably #411) made this pull request unmergeable. Please resolve the merge conflicts.

Amanieu · 2023-03-31T00:18:52Z

I just benchmarked this branch against master with rustc-perf and found almost no difference in perf.

Summary

	Range	Mean	Count
Regressions	0.31%, 0.73%	0.42%	8
Improvements	-0.32%, -0.32%	-0.32%	1
All	-0.32%, 0.73%	0.34%	9

Primary benchmarks

Benchmark	Profile	Scenario	% Change	Significance Factor
ripgrep-13.0.0	check	incr-unchanged	-0.32%	1.58x

Secondary benchmarks

Benchmark	Profile	Scenario	% Change	Significance Factor
issue-58319	check	incr-full	0.73%	3.65x
deep-vector	check	incr-patched: println	0.41%	2.03x
deep-vector	check	incr-full	0.41%	2.03x
deep-vector	check	incr-patched: add vec item	0.41%	2.03x
ucd	check	incr-full	0.39%	1.93x
ctfe-stress-5	check	incr-full	0.39%	1.93x
tuple-stress	check	incr-patched: new row	0.34%	1.69x
tuple-stress	check	incr-full	0.31%	1.54x

Amanieu · 2023-03-31T00:24:13Z

The benchmarks do look much better though:

 name                         orig ns/iter  new ns/iter  diff ns/iter   diff %  speedup 
 insert_ahash_highbits        26,869        18,092             -8,777  -32.67%   x 1.49 
 insert_ahash_random          26,973        18,349             -8,624  -31.97%   x 1.47 
 insert_ahash_serial          27,071        18,290             -8,781  -32.44%   x 1.48 
 insert_std_highbits          33,940        30,837             -3,103   -9.14%   x 1.10 
 insert_std_random            34,001        30,517             -3,484  -10.25%   x 1.11 
 insert_std_serial            33,894        30,478             -3,416  -10.08%   x 1.11

Amanieu · 2023-03-31T18:46:19Z

I re-ran the benchmarks, this time accounting for inlining changes by force-inlining everything and the results are still good with ~5% speedup. I think this is good to merge!

Any suggestions on what to do with the rehash_in_place benchmark? That seems to benchmark reusing tombstones at exactly max capacity, which will expand with this PR. Should it be changed to use half capacity so it will trigger rehashes?

Just change that benchmark to only insert 223 times instead of 224 times. The issue is that this PR makes insert always grow instead of first checking whether it can reuse a tombstone, but this is fine in practice.

Zoxc · 2023-04-01T06:06:11Z

I updated the benchmark.

Amanieu · 2023-04-01T08:29:59Z

@bors r+

bors · 2023-04-01T08:30:01Z

📌 Commit b9d97a8 has been approved by Amanieu

It is now in the queue for this repository.

Optimize insertion to only use a single lookup This changes the `insert` method to use a single lookup for insertion instead of 2 separate lookups. This reduces runtime of rustc on check builds by an average of 0.5% on [local benchmarks](https://github.com/Zoxc/rcb/tree/93790089032fc3fb4a4d708fb0adee9551125916/benchs) (1% for `winapi`). The compiler size is reduced by 0.45%. `unwrap_unchecked` is lifted from `libstd` since it's currently unstable and that probably requires some attribution.

bors · 2023-04-01T08:30:08Z

⌛ Testing commit b9d97a8 with merge 19db643...

bors · 2023-04-01T08:35:18Z

💔 Test failed - checks-actions

Amanieu · 2023-04-01T09:04:53Z

@bors retry

bors · 2023-04-01T09:04:59Z

⌛ Testing commit b9d97a8 with merge 329f86a...

bors · 2023-04-01T09:16:21Z

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing 329f86a to master...

bors · 2023-04-01T09:16:21Z

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing 329f86a to master...

Amanieu reviewed Jul 12, 2021

View reviewed changes

Zoxc force-pushed the opt-insert branch from 0316f9e to 553da4c Compare July 15, 2021 08:07

Amanieu reviewed Jul 15, 2021

View reviewed changes

Zoxc force-pushed the opt-insert branch from 553da4c to 7adcb44 Compare July 15, 2021 19:59

Zoxc mentioned this pull request Jul 20, 2021

Make rehashing and resizing less generic #282

Merged

Zoxc mentioned this pull request Jul 21, 2021

Inline small functions #283

Merged

Zoxc force-pushed the opt-insert branch from ed0f9a8 to d6795c0 Compare July 22, 2021 16:43

JustForFun88 mentioned this pull request Feb 17, 2023

Optimize insertion to use only one lookup (second try) #402

Closed

Amanieu reviewed Feb 23, 2023

View reviewed changes

Zoxc force-pushed the opt-insert branch from ba6ca78 to d278523 Compare March 27, 2023 09:25

Zoxc force-pushed the opt-insert branch from d278523 to 7d2e3e4 Compare March 30, 2023 02:25

Optimize insertion to only use a single lookup

b9d97a8

Zoxc force-pushed the opt-insert branch from 7d2e3e4 to b9d97a8 Compare April 1, 2023 06:04

bors merged commit 329f86a into rust-lang:master Apr 1, 2023
24 checks passed

Optimize insertion to only use a single lookup #277

Optimize insertion to only use a single lookup #277

Conversation

Zoxc commented Jul 9, 2021

Amanieu left a comment

Choose a reason for hiding this comment

Zoxc commented Jul 13, 2021

Amanieu commented Jul 13, 2021

Zoxc commented Jul 13, 2021

Amanieu commented Jul 14, 2021

Zoxc commented Jul 14, 2021

Amanieu Jul 15, 2021

Choose a reason for hiding this comment

Amanieu Jul 15, 2021

Choose a reason for hiding this comment

Amanieu Jul 15, 2021

Choose a reason for hiding this comment

Zoxc commented Jul 15, 2021

Zoxc commented Jul 15, 2021

Amanieu commented Jul 16, 2021 • edited

Zoxc commented Jul 22, 2021

Zoxc commented Jul 23, 2021

Amanieu commented Jul 24, 2021

Zoxc commented Jul 25, 2021

Zoxc commented Jul 25, 2021

Amanieu commented Jul 25, 2021

Zoxc commented Jul 25, 2021

Amanieu commented Jul 25, 2021

Zoxc commented Jul 25, 2021

Zoxc commented Jul 26, 2021

Amanieu commented Jul 30, 2021

Zoxc commented Aug 7, 2021

Marwes commented Aug 30, 2021

Zoxc commented Feb 5, 2023

bors commented Feb 23, 2023

JustForFun88 commented Feb 23, 2023

JustForFun88 commented Feb 23, 2023

Amanieu Feb 23, 2023

Choose a reason for hiding this comment

Zoxc Mar 27, 2023

Choose a reason for hiding this comment

Amanieu Feb 23, 2023

Choose a reason for hiding this comment

Zoxc Mar 27, 2023

Choose a reason for hiding this comment

Amanieu commented Feb 23, 2023

JustForFun88 commented Feb 23, 2023

Amanieu commented Feb 23, 2023

JustForFun88 commented Feb 24, 2023

Amanieu commented Feb 26, 2023

bors commented Mar 29, 2023

Amanieu commented Mar 31, 2023

Summary

Primary benchmarks

Secondary benchmarks

Amanieu commented Mar 31, 2023

Amanieu commented Mar 31, 2023

Zoxc commented Apr 1, 2023

Amanieu commented Apr 1, 2023

bors commented Apr 1, 2023

bors commented Apr 1, 2023

bors commented Apr 1, 2023

Amanieu commented Apr 1, 2023

bors commented Apr 1, 2023

bors commented Apr 1, 2023

bors commented Apr 1, 2023

Amanieu commented Jul 16, 2021 •

edited