Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Add BWoS-queue backend to tokio #5283

Closed
wants to merge 10 commits into from

Conversation

jschwe
Copy link
Contributor

@jschwe jschwe commented Dec 10, 2022

Motivation

This the PR related to #5240 with the motivation to provide an alternate workstealing queue backend for the multithreaded runtime. The BWoS queue is based on the BBQ (Block-based Bounded Queue) and is specially designed for the workstealing scenario. Based on the real-world observation that the "stealing" operation is
rare and most of the operations are local enqueues and dequeues this queue implementation
offers a single Owner which can enqueue and dequeue without any heavy synchronization mechanisms
on the fast path (intra block) and thus offers a very high performance for these operations.
Concurrent stealing is possible and does not slow done the Owner too much. The improved performance allows stealing policies which steal single items or in small batches, which improves load balancing. Cache contention is reduced due to the split of Metadata into Global Metadata and Block local Metadata.

Remarks about the current status of this PR

  • The microbenchmarks are in the bwosqueue directory and use criterion. Currently, the BWoS
    requires some LTO optimizations for the best performance, but that can be fixed before merging.
  • The Rust version is tested with loom and the algorithm additionally also with GenMC
  • The queue API was changed to be trait based, and the worker uses dynamic dispatching on trait objects.
  • Evaluating performance changes on downstream projects (e.g. hyper) can be done by patching
    the downstream project to use this branch.

Evaluation scripts (using rust-web-benchmarks)

For easier evaluation of the queue changes in a hyper "hello-world" application scenario, feel free to use the bwos bench branch forked from rust-web-benchmarks.
The fork mainly differs in that applications which do not use the "rt-multithread" runtime were removed, and a metrics feature was added (which uses a forked version of tokio-metrics to expose both the number of stealing operations and total number of stolen tasks).

At the top-level it has a bench_with_metrics.sh script which should be inspected and modified (adjust which cores are bound to and how many cores rewrk uses). This will benchmark 6 different rust web frameworks, which all provide more or less similar results. The script can benchmark different branches of tokio. I created a number of those to investigate the influence of stealing strategies. I'll update this post with some insights later.

@github-actions github-actions bot added the R-loom Run loom tests on this PR label Dec 10, 2022
@Noah-Kennedy
Copy link
Contributor

This is really cool! I'll take a look in a bit.

@hawkw
Copy link
Member

hawkw commented Dec 12, 2022

I would also really like to take a look at this change some time in the next couple of days!

@Darksonn Darksonn added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Dec 12, 2022
@Darksonn
Copy link
Contributor

Looks cool. I have no bandwidth this week, and next week is christmas, but I will definitely take a look eventually.

@jschwe
Copy link
Contributor Author

jschwe commented Dec 20, 2022

The microbenchmarks are in the bwosqueue directory and use criterion. Currently, the BWoS
requires some LTO optimizations for the best performance, but that can be fixed before merging.

I investigated this and it turns out that the loom wrapper type (for non-loom builds) adds significant overhead. LTO seems to mostly eliminate that overhead, but it would be preferable if non-lto builds wouldn't lose more than 50% queue performance on x86.
Since this PR does not use unsync_load() or any of the other special methods that tokio/loom exposes (at least on the Atomic types), I would just add some #[cfg(loom)] statements in the module and use the standard atomic types directly.

Benchmark results on commit 87cee67

Current version (without lto)

Simple Enqueue Dequeue/BWoS 1024 Elems per Block/8192 Total size                                                                            
                        time:   [72.400 µs 72.844 µs 73.512 µs]
                        thrpt:  [222.88 Melem/s 224.92 Melem/s 226.30 Melem/s]
                 change:
                        time:   [-0.2829% +0.0382% +0.4435%] (p = 0.87 > 0.05)
                        thrpt:  [-0.4416% -0.0382% +0.2837%]
                        No change in performance detected.

lto = "fat"

Simple Enqueue Dequeue/BWoS 1024 Elems per Block/8192 Total size                                                                             
                        time:   [28.159 µs 28.192 µs 28.226 µs]
                        thrpt:  [580.46 Melem/s 581.16 Melem/s 581.83 Melem/s]
                 change:
                        time:   [-61.403% -61.225% -61.065%] (p = 0.00 < 0.05)
                        thrpt:  [+156.84% +157.90% +159.09%]
                        Performance has improved.

lto = "thin"

Performance compared to lto = "fat"

Simple Enqueue Dequeue/BWoS 1024 Elems per Block/8192 Total size                                                                             
                        time:   [27.212 µs 27.434 µs 27.792 µs]
                        thrpt:  [589.52 Melem/s 597.21 Melem/s 602.09 Melem/s]
                 change:
                        time:   [-2.8780% -2.0530% -1.2238%] (p = 0.00 < 0.05)
                        thrpt:  [+1.2389% +2.0960% +2.9633%]
                        Performance has improved.

With std AtomicUsize type instead of loom wrapper type (no lto)

Simple Enqueue Dequeue/BWoS 1024 Elems per Block/8192 Total size                                                                             
                        time:   [27.171 µs 27.208 µs 27.250 µs]
                        thrpt:  [601.24 Melem/s 602.18 Melem/s 603.00 Melem/s]
                 change:
                        time:   [-62.575% -62.534% -62.492%] (p = 0.00 < 0.05)
                        thrpt:  [+166.61% +166.91% +167.20%]
                        Performance has improved.

With std AtomicUsize type instead of loom wrapper type (lto = "thin")

Simple Enqueue Dequeue/BWoS 1024 Elems per Block/8192 Total size                                                                             
                        time:   [26.958 µs 26.982 µs 27.015 µs]
                        thrpt:  [606.48 Melem/s 607.23 Melem/s 607.77 Melem/s]
                 change:
                        time:   [-0.7786% -0.6626% -0.5559%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5590% +0.6670% +0.7848%]
                        Change within noise threshold.

@taiki-e
Copy link
Member

taiki-e commented Dec 20, 2022

loom wrapper type (for non-loom builds) adds significant overhead

All the functions in that file (atomic_usize.rs) are non-generic and trivial, so try adding #[inline] attributes to them. (I have seen some cases in the past where the compiler could not inline such private functions. tokio-rs/valuable#40 (comment), KokaKiwi/rust-hex#62, KokaKiwi/rust-hex#64, etc.)

@Noah-Kennedy
Copy link
Contributor

@jschwe did you find it adding this overhead in vanilla tokio or just in this fork?

@jschwe
Copy link
Contributor Author

jschwe commented Dec 30, 2022

@jschwe did you find it adding this overhead in vanilla tokio or just in this fork?

I didn't have time to investigate more before going on vacation, but in flamegraphs the deref operation of the tokio mock wrapper was quite visible in the microbenchmarks (bwosqueue/benches/bench.rs) of the queue (without lto). These microbenchmarks are currently standalone and the mock loom implementation was just copy-pasted from tokio (the bwosqueue folder in this draft pull request is basically an independent library added via git subtree), but should be identical (minus any changes in the last month).

All the functions in that file (atomic_usize.rs) are non-generic and trivial, so try adding #[inline] attributes to them.

Adding the #[inline] on the mocked loom types did significantly improve the performance (from memory to about 520 Mops/s on x86), which is much better than before, but still short of the 600Mops/s without the wrapper. I didn't have time to investigate further yet because of holidays, but I'll have a look next week when I'm back at work.

Edit: I just reran the microbenchmarks and adding #[inline] onto deref seemed to eliminate the overhead (on nightly).

@Noah-Kennedy
Copy link
Contributor

I need to take a look through this again soon.

@jschwe
Copy link
Contributor Author

jschwe commented Feb 24, 2023

I need to take a look through this again soon.

That would be great! If it would help, I'd also offer to discuss / walkthrough the queue in a call.

The diff of this MR is quite big, but the important parts are ~1200 lines and are basically:

  • The lib ( bwosqueue.rs, lib.rs and metadata.rs in bwosqueue/src/)

Most of the other stuff is tests or microbenchmarks - so their review and discussion can probably be delayed until later when you have decided you are interested in merging.

@jschwe
Copy link
Contributor Author

jschwe commented May 20, 2023

I've updated this branch with a Draft implementation of selecting the queue backend via the Builder at runtime.
Since I didn't want to duplicate all the MultiThreaded related logic, I opted for putting a boxed trait object into the worker state. I'd guess that the performance overhead from the dynamic dispatch should be neglibable compared to the cost of the queue operations. Do you have any objections against this approach, or other suggestions on how to select the queue backend at runtime?

CI failing is expected, as I haven't updated all the tests yet, since I first wanted to get some feedback.

@jschwe jschwe force-pushed the bwos_master_lib branch 10 times, most recently from c510c2b to 574a77b Compare May 28, 2023 15:57
jschwe and others added 4 commits May 29, 2023 09:46
This commit is a work-in-progress snapshot of BWoS for
tokio, with the intention to get early feedback.
Currently, the BWoS queue is just dropped in as a replacement,
with the intention to make benchmarking easier (just patch
downstream crates to use the modified version).

Before merging the queue should be integrated as an alternate
queue instead of replacing the current one.

The design of the BWoS queue was done by Jiawei Wang.

Co-authored-by: Jiawei Wang <jiawei.wang@huawei.com>
Designed-by: Jiawei Wang <jiawei.wang@huawei.com>
Signed-off-by: Jonathan Schwender <jonathan.schwender@huawei.com>
Signed-off-by: Jiawei Wang <jiawei.wang@huawei.com>
Signed-off-by: Ming Fu <ming.fu@huawei.com>
// From the statistics perspective we consider the reserved range to already be
// stolen, since it is not available for the consumer or other stealers anymore.
#[cfg(feature = "stats")]
self.queue.stats.increment_stolen(num_reserved);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to inquire if it is necessary to adjust the value of stolen here?
blk.stolen.fetch_add(num_reserved , Release);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that would be wrong. At this point we have just reserved the entries, i.e. the consumer can't access them anymore, but the stealer has not finished copying the entries over to the stealers queue. The Drop implementation for StealerBlockIter increments stolen once the (now empty) iterator is dropped.

The statistics feature is just an approximation, giving metrics on how many items are in the queue, so that the utilization could be tracked over time.

@Darksonn
Copy link
Contributor

Unfortunately, I don't think we will have the review bandwidth to review a change like this anytime soon.

@Darksonn Darksonn closed this Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime R-loom Run loom tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants