Rust: Use rayon and better performance #203

nazar-pc · 2024-02-09T23:27:06Z

This PR introduces a bunch of changes that target at better performance and higher code quality.

Performance

Before:

aggregate_verify/aggregate_verify/8                                                                            
                        time:   [959.44 µs 973.55 µs 988.39 µs]
aggregate_verify/aggregate_verify/16                                                                             
                        time:   [1.0332 ms 1.0502 ms 1.0672 ms]
aggregate_verify/aggregate_verify/32                                                                             
                        time:   [1.0678 ms 1.0764 ms 1.0854 ms]
aggregate_verify/aggregate_verify/64                                                                             
                        time:   [1.6251 ms 1.6368 ms 1.6503 ms]
aggregate_verify/aggregate_verify/128                                                                             
                        time:   [3.1520 ms 3.1741 ms 3.1962 ms]
verify_multi_aggregate/verify_multi_aggregate/8                                                                             
                        time:   [1.4169 ms 1.4476 ms 1.4795 ms]
verify_multi_aggregate/verify_multi_aggregate/16                                                                             
                        time:   [1.4360 ms 1.4616 ms 1.4874 ms]
verify_multi_aggregate/verify_multi_aggregate/32                                                                             
                        time:   [1.4743 ms 1.4874 ms 1.5005 ms]
verify_multi_aggregate/verify_multi_aggregate/64                                                                             
                        time:   [2.4119 ms 2.4302 ms 2.4486 ms]
verify_multi_aggregate/verify_multi_aggregate/128                                                                             
                        time:   [4.1066 ms 4.1342 ms 4.1630 ms]

After:

aggregate_verify/aggregate_verify/8                                                                            
                        time:   [932.61 µs 942.15 µs 952.58 µs]
aggregate_verify/aggregate_verify/16                                                                            
                        time:   [959.34 µs 968.31 µs 978.62 µs]
aggregate_verify/aggregate_verify/32                                                                             
                        time:   [1.0789 ms 1.0993 ms 1.1215 ms]
aggregate_verify/aggregate_verify/64                                                                             
                        time:   [1.6040 ms 1.6177 ms 1.6329 ms]
aggregate_verify/aggregate_verify/128                                                                             
                        time:   [2.7126 ms 2.7322 ms 2.7533 ms]
verify_multi_aggregate/verify_multi_aggregate/8                                                                             
                        time:   [1.2968 ms 1.3164 ms 1.3372 ms]
verify_multi_aggregate/verify_multi_aggregate/16                                                                             
                        time:   [1.3326 ms 1.3475 ms 1.3630 ms]
verify_multi_aggregate/verify_multi_aggregate/32                                                                             
                        time:   [1.4790 ms 1.5010 ms 1.5250 ms]
verify_multi_aggregate/verify_multi_aggregate/64                                                                             
                        time:   [2.1851 ms 2.2054 ms 2.2271 ms]
verify_multi_aggregate/verify_multi_aggregate/128                                                                             
                        time:   [3.6696 ms 3.6972 ms 3.7270 ms]

The reason is most likely due to higher parallelism and not using channels in those operations.

Rayon

Rayon doesn't always result in better performance than manual thread management (see rayon-rs/rayon#1124), but changes in this PR do result in higher performance.

There are many great benefits to rayon though in the way it is used:

safe API, previous thread pool usage was either unsound or sound by accident (comments were claiming things like:

// Bypass 'lifetime limitations by brute force. It works,
// because we explicitly join the threads...

Which while worked by accident, was a lie (nothing was actually joined in most cases, but worked in practice due to use of channels to send work back to the parent thread), fundamentally unsound API (parent thread panic would result in worker threads accessing undefined memory location) and generally a major foot gun for anyone who would try to change the code in any way.
support for customizing number of threads using environment variables
support for customizing number of threads and their pinning to CPU cores depending on thread pool from which operations are called (by default global thread pool is used and default behavior is equivalent to old code)
avoid conflicting/overlapping with other threads if used from within parallel code that already uses rayon
no need to special-case platforms like WASM, if specific WASM environment supports threading, blst will now enjoy it automatically

Various cleanups

I changed all but one channel usage with a cleaner code using Mutex, Arc usage was removed as unnecessary and some other bookkeeping was cleaned up after understanding what code does and rewriting intention in more idiomatic Rust code (see fn from for example). The only place where channel still remains is fn mult, but it is quite difficult code and I didn't have the will to decipher what it tries to do to get rid of channel, but hopefully examples in this PR will prompt someone to clean it up at some later point.

Some unsafe blocks with vec.set_len() are moved after memory is actually initialized and some questionable in terms of soundness places were refactored such that they are quite obviously sound now. Some remaining places can also be fixed, but require pointer arithmetic that might require newer versions of Rust, so I left TODOs there.

I also took liberty to refactor things like &q[0] to q.as_ptr() since while both result in the same pointer, the former is deceptive because it suggests to the reader that only the first element is used, while in practice the whole q is used by the function that is being called later (documentation is almost non-existent, so it wasn't obvious at first and I hope this improves readability).

no-threads

no-threads was a hack and non-idiomatic because it violates "additivity" assumptions of Rust features. Thankfully it is no longer necessary and can be safely removed, but since this is a breaking change (the only one in this PR to the best of my knowledge), I bumped crate version from 0.3.11 to 0.4.0.

How to review this PR

This PR is split into self-contained logically structured commits and it is recommended to review commits one by one instead of the complete diff, this way it is much easier to understand what was changed and why.

Let me know if there are any questions, happy to explain each line change in detail.

mratsim · 2024-02-10T13:58:00Z

The issue with Rayon or any dependencies is that it brings a lot of new dependencies that need to be audited or at least vetted. This is very undesirable in a cryptographic product securing billions of assets. It is especially critical when supply chain attacks are increasing in appeal.

Furthermore, a simple threadpool written with cryptographic/security application in mind and fully control by the dev is significantly easier to maintain and update if there is a security issue.

nazar-pc · 2024-02-10T14:11:43Z

The issue with Rayon or any dependencies is that it brings a lot of new dependencies that need to be audited or at least vetted. This is very undesirable in a cryptographic product securing billions of assets. It is especially critical when supply chain attacks are increasing in appeal.

rayon is a well known established library with short list of mandatory dependencies and it used by Rustc itself, so it is not like you're adding a lot of new stuff here.

Furthermore, a simple threadpool written with cryptographic/security application in mind and fully control by the dev is significantly easier to maintain and update if there is a security issue.

I generally agree with you on this one. However, as I have shown in the description already, there were some questionable decisions from soundness point of view around thread pool as it was used and performance was lower than it could have been.

You 100% can rewrite original code and make it faster without rayon, achieving rayon's features in a safe misuse-resistant way is non-trivial and I'd rather use something that it maintained by Rust experts for this purpose (check top contributors to rayon project, they are literally working on Rust itself) than seeing every library invent their own a bit flawed version of the same features that doesn't integrated as well with the rest of Rust ecosystem.

I would estimate risk of relying on rayon as very low, but decision is up to blst maintainers of course.

bindings/rust/src/lib.rs

bindings/rust/src/pippenger.rs

bindings/rust/src/pippenger-no_std.rs

mratsim · 2024-02-10T14:33:01Z

bindings/rust/src/pippenger-test_mod.rs

@@ -25,8 +25,7 @@ macro_rules! pippenger_test_mod {
                let mut scalars = Box::new([0u8; nbytes * npoints]);
                ChaCha20Rng::from_seed([0u8; 32]).fill_bytes(scalars.as_mut());

-                let mut points: Vec<$point> = Vec::with_capacity(npoints);
-                unsafe { points.set_len(points.capacity()) };
+                let mut points = vec![<$point>::default(); npoints];


ditto: avoid useless zero init

Left because it is test code

mratsim · 2024-02-10T14:33:05Z

bindings/rust/src/pippenger-test_mod.rs

@@ -61,8 +60,7 @@ macro_rules! pippenger_test_mod {
                let mut scalars = Box::new([0u8; nbytes * npoints]);
                ChaCha20Rng::from_seed([0u8; 32]).fill_bytes(scalars.as_mut());

-                let mut points: Vec<$point> = Vec::with_capacity(npoints);
-                unsafe { points.set_len(points.capacity()) };
+                let mut points = vec![<$point>::default(); npoints];


ditto: avoid useless zero init

Left because it is test code

bindings/rust/src/pippenger.rs

nazar-pc

Thanks for review @mratsim! If the rest of the stuff is fine, I can try to remove zero initialization and see if it makes a performance difference, if yes then I'll look to refactor things in a way that makes code easier to audit.

Having documentation at least in C code would have been great, ideally in bindings as well. Right now without C expanding macros in my head I have no idea that blst_p1s_to_affine initializes memory internally and is guaranteed to never ever read it prior to that, which to me is horrifying from security point of view, especially when there is an easy way to avoid it.

bindings/rust/src/lib.rs

bindings/rust/src/pippenger-no_std.rs

…yon`

… newer version of Rust

…dependency and unnecessary/unused `no-threads` feature

nazar-pc

I refactored blst_fp12::miller_loop_n into something that is close to original implementation.

For zero initialized vectors I wrote this simple bench:

fn bench_alloc(c: &mut Criterion) {
    let mut group = c.benchmark_group("allocation");

    group.bench_function("alloc/set_size", |b| {
        b.iter(|| {
            let mut points =
                Vec::<u8>::with_capacity(criterion::black_box(1_000_000));

            unsafe { points.set_len(points.capacity()) };

            criterion::black_box(points);
        });
    });

    group.bench_function("alloc/zeroed", |b| {
        b.iter(|| {
            let points = vec![0u8; criterion::black_box(1_000_000)];

            criterion::black_box(points);
        });
    });

    group.finish();
}

The results were like this:

allocation/alloc/set_size                                                                             
                        time:   [12.233 ns 12.276 ns 12.321 ns]
allocation/alloc/zeroed time:   [11.259 µs 11.263 µs 11.267 µs]

It seemed significant, so I decided to partially revert set_len, but I moved some of them after the function that initializes memory to make it clearer to the reader what is actually happening there. Only left zero initialization in 2 places in tests where performance doesn't matter as much.

And I applied relaxed ordering in places where suggested.

@mratsim, appreciate careful review, hopefully you'll have time to check the diff.

nazar-pc · 2024-02-10T22:10:29Z

bindings/rust/src/pippenger-test_mod.rs

@@ -25,8 +25,7 @@ macro_rules! pippenger_test_mod {
                let mut scalars = Box::new([0u8; nbytes * npoints]);
                ChaCha20Rng::from_seed([0u8; 32]).fill_bytes(scalars.as_mut());

-                let mut points: Vec<$point> = Vec::with_capacity(npoints);
-                unsafe { points.set_len(points.capacity()) };
+                let mut points = vec![<$point>::default(); npoints];


Left because it is test code

nazar-pc · 2024-02-10T22:10:33Z

bindings/rust/src/pippenger-test_mod.rs

@@ -61,8 +60,7 @@ macro_rules! pippenger_test_mod {
                let mut scalars = Box::new([0u8; nbytes * npoints]);
                ChaCha20Rng::from_seed([0u8; 32]).fill_bytes(scalars.as_mut());

-                let mut points: Vec<$point> = Vec::with_capacity(npoints);
-                unsafe { points.set_len(points.capacity()) };
+                let mut points = vec![<$point>::default(); npoints];


Left because it is test code

mratsim

Sorry, forgot to add my review.

On the technical side, LGTM.

Now on whether to use Rayon or keep BLST's own Threadpool, I'll leave the decision to the team.

nazar-pc · 2024-02-21T15:53:34Z

Just confirmed that performance improvements here translate to performance improvements downstream in our usage usage of https://github.com/sifraitech/rust-kzg (uses blst) in https://github.com/subspace/subspace, both of which also take advantage of Rayon as well.

mratsim · 2024-02-22T08:41:55Z

Just confirmed that performance improvements here translate to performance improvements downstream in our usage usage of https://github.com/sifraitech/rust-kzg (uses blst) in https://github.com/subspace/subspace, both of which also take advantage of Rayon as well.

I don't know whether commitment, proof generation or verification is the bottleneck for your use-case but @sauliusgrigaitis might be able to point you on how to best use rust-kzg, i.e. either through native BLST or reimplemented+parallelized BLST (i.e. using their own multithreading strategy).

Also they've been working on a new set of benchmarks. I have my own in mratsim/constantine#304 (comment), but I don't cover parallel BLST.

nazar-pc · 2024-04-03T00:45:40Z

@dot-asm is there something I can do to interest you reviewing this?

mratsim reviewed Feb 10, 2024

View reviewed changes

nazar-pc commented Feb 10, 2024

View reviewed changes

bindings/rust/src/lib.rs Outdated Show resolved Hide resolved

bindings/rust/src/lib.rs Show resolved Hide resolved

bindings/rust/src/lib.rs Outdated Show resolved Hide resolved

bindings/rust/src/pippenger-no_std.rs Outdated Show resolved Hide resolved

nazar-pc added 9 commits February 10, 2024 23:23

Refactor blst_fp12::miller_loop_n to use rayon

5e08f33

Refactor Signature::aggregate_verify to use rayon

223e83e

Refactor Signature::verify_multiple_aggregate_signatures to use `ra…

55aa55e

…yon`

Improve unsafe usage and leave some TODOs that can be resolved with…

7c96826

… newer version of Rust

Refactor p[1|2]_affines::from() to use rayon

f747bab

Refactor p[1|2]_affines::mult() to use rayon

9d3178b

Refactor p[1|2]_affines::add() to use rayon, remove threadpool …

ff90e4a

…dependency and unnecessary/unused `no-threads` feature

Tiny cleanups

ab042e1

Bump version

10428e2

nazar-pc force-pushed the use-rayon branch from d981fc9 to 10428e2 Compare February 10, 2024 22:02

nazar-pc commented Feb 10, 2024

View reviewed changes

nazar-pc requested a review from mratsim February 10, 2024 22:13

nazar-pc mentioned this pull request Feb 11, 2024

More scalable Proof of Space subspace/subspace#2526

Merged

1 task

mratsim approved these changes Feb 20, 2024

View reviewed changes

nazar-pc mentioned this pull request Feb 22, 2024

Small optimizations subspace/subspace#2547

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust: Use rayon and better performance #203

Rust: Use rayon and better performance #203

nazar-pc commented Feb 9, 2024 •

edited

mratsim commented Feb 10, 2024

nazar-pc commented Feb 10, 2024

mratsim Feb 10, 2024

nazar-pc Feb 10, 2024

mratsim Feb 10, 2024

nazar-pc Feb 10, 2024

nazar-pc left a comment

nazar-pc left a comment

nazar-pc Feb 10, 2024

nazar-pc Feb 10, 2024

mratsim left a comment

nazar-pc commented Feb 21, 2024

mratsim commented Feb 22, 2024

nazar-pc commented Apr 3, 2024

Rust: Use rayon and better performance #203

Are you sure you want to change the base?

Rust: Use rayon and better performance #203

Conversation

nazar-pc commented Feb 9, 2024 • edited

Performance

Rayon

Various cleanups

no-threads

How to review this PR

mratsim commented Feb 10, 2024

nazar-pc commented Feb 10, 2024

mratsim Feb 10, 2024

Choose a reason for hiding this comment

nazar-pc Feb 10, 2024

Choose a reason for hiding this comment

mratsim Feb 10, 2024

Choose a reason for hiding this comment

nazar-pc Feb 10, 2024

Choose a reason for hiding this comment

nazar-pc left a comment

Choose a reason for hiding this comment

nazar-pc left a comment

Choose a reason for hiding this comment

nazar-pc Feb 10, 2024

Choose a reason for hiding this comment

nazar-pc Feb 10, 2024

Choose a reason for hiding this comment

mratsim left a comment

Choose a reason for hiding this comment

nazar-pc commented Feb 21, 2024

mratsim commented Feb 22, 2024

nazar-pc commented Apr 3, 2024

nazar-pc commented Feb 9, 2024 •

edited