Draft: Add BWoS-queue backend to tokio #5283

jschwe · 2022-12-10T16:25:57Z

Motivation

This the PR related to #5240 with the motivation to provide an alternate workstealing queue backend for the multithreaded runtime. The BWoS queue is based on the BBQ (Block-based Bounded Queue) and is specially designed for the workstealing scenario. Based on the real-world observation that the "stealing" operation is
rare and most of the operations are local enqueues and dequeues this queue implementation
offers a single Owner which can enqueue and dequeue without any heavy synchronization mechanisms
on the fast path (intra block) and thus offers a very high performance for these operations.
Concurrent stealing is possible and does not slow done the Owner too much. The improved performance allows stealing policies which steal single items or in small batches, which improves load balancing. Cache contention is reduced due to the split of Metadata into Global Metadata and Block local Metadata.

Remarks about the current status of this PR

The microbenchmarks are in the bwosqueue directory and use criterion. Currently, the BWoS
requires some LTO optimizations for the best performance, but that can be fixed before merging.
The Rust version is tested with loom and the algorithm additionally also with GenMC
The queue API was changed to be trait based, and the worker uses dynamic dispatching on trait objects.
Evaluating performance changes on downstream projects (e.g. hyper) can be done by patching
the downstream project to use this branch.

Evaluation scripts (using rust-web-benchmarks)

For easier evaluation of the queue changes in a hyper "hello-world" application scenario, feel free to use the bwos bench branch forked from rust-web-benchmarks.
The fork mainly differs in that applications which do not use the "rt-multithread" runtime were removed, and a metrics feature was added (which uses a forked version of tokio-metrics to expose both the number of stealing operations and total number of stolen tasks).

At the top-level it has a bench_with_metrics.sh script which should be inspected and modified (adjust which cores are bound to and how many cores rewrk uses). This will benchmark 6 different rust web frameworks, which all provide more or less similar results. The script can benchmark different branches of tokio. I created a number of those to investigate the influence of stealing strategies. I'll update this post with some insights later.

Noah-Kennedy · 2022-12-12T15:40:23Z

This is really cool! I'll take a look in a bit.

hawkw · 2022-12-12T19:38:54Z

I would also really like to take a look at this change some time in the next couple of days!

Darksonn · 2022-12-12T21:56:30Z

Looks cool. I have no bandwidth this week, and next week is christmas, but I will definitely take a look eventually.

jschwe · 2022-12-20T17:45:12Z

The microbenchmarks are in the bwosqueue directory and use criterion. Currently, the BWoS
requires some LTO optimizations for the best performance, but that can be fixed before merging.

I investigated this and it turns out that the loom wrapper type (for non-loom builds) adds significant overhead. LTO seems to mostly eliminate that overhead, but it would be preferable if non-lto builds wouldn't lose more than 50% queue performance on x86.
Since this PR does not use unsync_load() or any of the other special methods that tokio/loom exposes (at least on the Atomic types), I would just add some #[cfg(loom)] statements in the module and use the standard atomic types directly.

Benchmark results on commit 87cee67

Current version (without lto)

Simple Enqueue Dequeue/BWoS 1024 Elems per Block/8192 Total size                                                                            
                        time:   [72.400 µs 72.844 µs 73.512 µs]
                        thrpt:  [222.88 Melem/s 224.92 Melem/s 226.30 Melem/s]
                 change:
                        time:   [-0.2829% +0.0382% +0.4435%] (p = 0.87 > 0.05)
                        thrpt:  [-0.4416% -0.0382% +0.2837%]
                        No change in performance detected.

lto = "fat"

Simple Enqueue Dequeue/BWoS 1024 Elems per Block/8192 Total size                                                                             
                        time:   [28.159 µs 28.192 µs 28.226 µs]
                        thrpt:  [580.46 Melem/s 581.16 Melem/s 581.83 Melem/s]
                 change:
                        time:   [-61.403% -61.225% -61.065%] (p = 0.00 < 0.05)
                        thrpt:  [+156.84% +157.90% +159.09%]
                        Performance has improved.

lto = "thin"

Performance compared to lto = "fat"

Simple Enqueue Dequeue/BWoS 1024 Elems per Block/8192 Total size                                                                             
                        time:   [27.212 µs 27.434 µs 27.792 µs]
                        thrpt:  [589.52 Melem/s 597.21 Melem/s 602.09 Melem/s]
                 change:
                        time:   [-2.8780% -2.0530% -1.2238%] (p = 0.00 < 0.05)
                        thrpt:  [+1.2389% +2.0960% +2.9633%]
                        Performance has improved.

With std AtomicUsize type instead of loom wrapper type (no lto)

Simple Enqueue Dequeue/BWoS 1024 Elems per Block/8192 Total size                                                                             
                        time:   [27.171 µs 27.208 µs 27.250 µs]
                        thrpt:  [601.24 Melem/s 602.18 Melem/s 603.00 Melem/s]
                 change:
                        time:   [-62.575% -62.534% -62.492%] (p = 0.00 < 0.05)
                        thrpt:  [+166.61% +166.91% +167.20%]
                        Performance has improved.

With std AtomicUsize type instead of loom wrapper type (lto = "thin")

Simple Enqueue Dequeue/BWoS 1024 Elems per Block/8192 Total size                                                                             
                        time:   [26.958 µs 26.982 µs 27.015 µs]
                        thrpt:  [606.48 Melem/s 607.23 Melem/s 607.77 Melem/s]
                 change:
                        time:   [-0.7786% -0.6626% -0.5559%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5590% +0.6670% +0.7848%]
                        Change within noise threshold.

taiki-e · 2022-12-20T21:44:19Z

loom wrapper type (for non-loom builds) adds significant overhead

All the functions in that file (atomic_usize.rs) are non-generic and trivial, so try adding #[inline] attributes to them. (I have seen some cases in the past where the compiler could not inline such private functions. tokio-rs/valuable#40 (comment), KokaKiwi/rust-hex#62, KokaKiwi/rust-hex#64, etc.)

Noah-Kennedy · 2022-12-30T19:44:22Z

@jschwe did you find it adding this overhead in vanilla tokio or just in this fork?

jschwe · 2022-12-30T20:18:52Z

@jschwe did you find it adding this overhead in vanilla tokio or just in this fork?

I didn't have time to investigate more before going on vacation, but in flamegraphs the deref operation of the tokio mock wrapper was quite visible in the microbenchmarks (bwosqueue/benches/bench.rs) of the queue (without lto). These microbenchmarks are currently standalone and the mock loom implementation was just copy-pasted from tokio (the bwosqueue folder in this draft pull request is basically an independent library added via git subtree), but should be identical (minus any changes in the last month).

All the functions in that file (atomic_usize.rs) are non-generic and trivial, so try adding #[inline] attributes to them.

Adding the #[inline] on the mocked loom types did significantly improve the performance (from memory to about 520 Mops/s on x86), which is much better than before, but still short of the 600Mops/s without the wrapper. I didn't have time to investigate further yet because of holidays, but I'll have a look next week when I'm back at work.

Edit: I just reran the microbenchmarks and adding #[inline] onto deref seemed to eliminate the overhead (on nightly).

Noah-Kennedy · 2023-02-23T21:39:45Z

I need to take a look through this again soon.

jschwe · 2023-02-24T14:08:18Z

I need to take a look through this again soon.

That would be great! If it would help, I'd also offer to discuss / walkthrough the queue in a call.

The diff of this MR is quite big, but the important parts are ~1200 lines and are basically:

The lib ( bwosqueue.rs, lib.rs and metadata.rs in bwosqueue/src/)

Most of the other stuff is tests or microbenchmarks - so their review and discussion can probably be delayed until later when you have decided you are interested in merging.

jschwe · 2023-05-20T18:26:15Z

I've updated this branch with a Draft implementation of selecting the queue backend via the Builder at runtime.
Since I didn't want to duplicate all the MultiThreaded related logic, I opted for putting a boxed trait object into the worker state. I'd guess that the performance overhead from the dynamic dispatch should be neglibable compared to the cost of the queue operations. Do you have any objections against this approach, or other suggestions on how to select the queue backend at runtime?

CI failing is expected, as I haven't updated all the tests yet, since I first wanted to get some feedback.

This commit is a work-in-progress snapshot of BWoS for tokio, with the intention to get early feedback. Currently, the BWoS queue is just dropped in as a replacement, with the intention to make benchmarking easier (just patch downstream crates to use the modified version). Before merging the queue should be integrated as an alternate queue instead of replacing the current one. The design of the BWoS queue was done by Jiawei Wang. Co-authored-by: Jiawei Wang <jiawei.wang@huawei.com> Designed-by: Jiawei Wang <jiawei.wang@huawei.com> Signed-off-by: Jonathan Schwender <jonathan.schwender@huawei.com> Signed-off-by: Jiawei Wang <jiawei.wang@huawei.com> Signed-off-by: Ming Fu <ming.fu@huawei.com>

cathy-sjh · 2023-07-29T09:52:02Z

bwosqueue/src/lib.rs

+                // From the statistics perspective we consider the reserved range to already be
+                // stolen, since it is not available for the consumer or other stealers anymore.
+                #[cfg(feature = "stats")]
+                self.queue.stats.increment_stolen(num_reserved);


I would like to inquire if it is necessary to adjust the value of stolen here？
blk.stolen.fetch_add(num_reserved , Release);

No, that would be wrong. At this point we have just reserved the entries, i.e. the consumer can't access them anymore, but the stealer has not finished copying the entries over to the stealers queue. The Drop implementation for StealerBlockIter increments stolen once the (now empty) iterator is dropped.

The statistics feature is just an approximation, giving metrics on how many items are in the queue, so that the utilization could be tracked over time.

Darksonn · 2023-11-25T12:14:38Z

Unfortunately, I don't think we will have the review bandwidth to review a change like this anytime soon.

github-actions bot added the R-loom Run loom tests on this PR label Dec 10, 2022

Darksonn added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Dec 12, 2022

jschwe mentioned this pull request Dec 14, 2022

Use multiple single threaded runtimes lnx-search/rewrk#23

Open

jschwe force-pushed the bwos_master_lib branch from 07b1bf5 to b547247 Compare May 20, 2023 18:18

jschwe force-pushed the bwos_master_lib branch 10 times, most recently from c510c2b to 574a77b Compare May 28, 2023 15:57

jschwe and others added 4 commits May 29, 2023 09:46

Make bwos a mod instead of a crate.

5f440b9

Fix Warnings

dd0fbc3

Try fixing dead code warnings

3b46c74

jschwe force-pushed the bwos_master_lib branch from 8165fa2 to 3b46c74 Compare May 29, 2023 07:58

add stats to full

972fe93

jschwe force-pushed the bwos_master_lib branch from 93553cb to c303336 Compare May 29, 2023 13:03

jschwe added 3 commits May 29, 2023 15:38

Testing: Set bwos to default to see which tests fail in CI

41a5250

Remove const for loom compatability

928ea8f

Don't use loom for bwos stats

086cca0

jschwe force-pushed the bwos_master_lib branch from c5e4360 to 086cca0 Compare May 29, 2023 13:39

jschwe added 2 commits May 29, 2023 16:04

Fix inverted is_empty function

113a19d

fixup bwosqueue clippy warning

184556b

cathy-sjh reviewed Jul 29, 2023

View reviewed changes

Darksonn closed this Nov 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Add BWoS-queue backend to tokio #5283

Draft: Add BWoS-queue backend to tokio #5283

jschwe commented Dec 10, 2022 •

edited

Noah-Kennedy commented Dec 12, 2022

hawkw commented Dec 12, 2022

Darksonn commented Dec 12, 2022

jschwe commented Dec 20, 2022 •

edited

Current version (without lto)

lto = "fat"

lto = "thin"

With std AtomicUsize type instead of loom wrapper type (no lto)

With std AtomicUsize type instead of loom wrapper type (lto = "thin")

taiki-e commented Dec 20, 2022 •

edited

Noah-Kennedy commented Dec 30, 2022

jschwe commented Dec 30, 2022 •

edited

Noah-Kennedy commented Feb 23, 2023

jschwe commented Feb 24, 2023

jschwe commented May 20, 2023

cathy-sjh Jul 29, 2023

jschwe Jul 29, 2023

Darksonn commented Nov 25, 2023

Draft: Add BWoS-queue backend to tokio #5283

Draft: Add BWoS-queue backend to tokio #5283

Conversation

jschwe commented Dec 10, 2022 • edited

Motivation

Remarks about the current status of this PR

Evaluation scripts (using rust-web-benchmarks)

Noah-Kennedy commented Dec 12, 2022

hawkw commented Dec 12, 2022

Darksonn commented Dec 12, 2022

jschwe commented Dec 20, 2022 • edited

Current version (without lto)

lto = "fat"

lto = "thin"

With std AtomicUsize type instead of loom wrapper type (no lto)

With std AtomicUsize type instead of loom wrapper type (lto = "thin")

taiki-e commented Dec 20, 2022 • edited

Noah-Kennedy commented Dec 30, 2022

jschwe commented Dec 30, 2022 • edited

Noah-Kennedy commented Feb 23, 2023

jschwe commented Feb 24, 2023

jschwe commented May 20, 2023

cathy-sjh Jul 29, 2023

Choose a reason for hiding this comment

jschwe Jul 29, 2023

Choose a reason for hiding this comment

Darksonn commented Nov 25, 2023

jschwe commented Dec 10, 2022 •

edited

jschwe commented Dec 20, 2022 •

edited

taiki-e commented Dec 20, 2022 •

edited

jschwe commented Dec 30, 2022 •

edited