Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust: Use rayon and better performance #203

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

nazar-pc
Copy link

@nazar-pc nazar-pc commented Feb 9, 2024

This PR introduces a bunch of changes that target at better performance and higher code quality.

Performance

Before:

aggregate_verify/aggregate_verify/8                                                                            
                        time:   [959.44 µs 973.55 µs 988.39 µs]
aggregate_verify/aggregate_verify/16                                                                             
                        time:   [1.0332 ms 1.0502 ms 1.0672 ms]
aggregate_verify/aggregate_verify/32                                                                             
                        time:   [1.0678 ms 1.0764 ms 1.0854 ms]
aggregate_verify/aggregate_verify/64                                                                             
                        time:   [1.6251 ms 1.6368 ms 1.6503 ms]
aggregate_verify/aggregate_verify/128                                                                             
                        time:   [3.1520 ms 3.1741 ms 3.1962 ms]
verify_multi_aggregate/verify_multi_aggregate/8                                                                             
                        time:   [1.4169 ms 1.4476 ms 1.4795 ms]
verify_multi_aggregate/verify_multi_aggregate/16                                                                             
                        time:   [1.4360 ms 1.4616 ms 1.4874 ms]
verify_multi_aggregate/verify_multi_aggregate/32                                                                             
                        time:   [1.4743 ms 1.4874 ms 1.5005 ms]
verify_multi_aggregate/verify_multi_aggregate/64                                                                             
                        time:   [2.4119 ms 2.4302 ms 2.4486 ms]
verify_multi_aggregate/verify_multi_aggregate/128                                                                             
                        time:   [4.1066 ms 4.1342 ms 4.1630 ms]

After:

aggregate_verify/aggregate_verify/8                                                                            
                        time:   [932.61 µs 942.15 µs 952.58 µs]
aggregate_verify/aggregate_verify/16                                                                            
                        time:   [959.34 µs 968.31 µs 978.62 µs]
aggregate_verify/aggregate_verify/32                                                                             
                        time:   [1.0789 ms 1.0993 ms 1.1215 ms]
aggregate_verify/aggregate_verify/64                                                                             
                        time:   [1.6040 ms 1.6177 ms 1.6329 ms]
aggregate_verify/aggregate_verify/128                                                                             
                        time:   [2.7126 ms 2.7322 ms 2.7533 ms]
verify_multi_aggregate/verify_multi_aggregate/8                                                                             
                        time:   [1.2968 ms 1.3164 ms 1.3372 ms]
verify_multi_aggregate/verify_multi_aggregate/16                                                                             
                        time:   [1.3326 ms 1.3475 ms 1.3630 ms]
verify_multi_aggregate/verify_multi_aggregate/32                                                                             
                        time:   [1.4790 ms 1.5010 ms 1.5250 ms]
verify_multi_aggregate/verify_multi_aggregate/64                                                                             
                        time:   [2.1851 ms 2.2054 ms 2.2271 ms]
verify_multi_aggregate/verify_multi_aggregate/128                                                                             
                        time:   [3.6696 ms 3.6972 ms 3.7270 ms]

The reason is most likely due to higher parallelism and not using channels in those operations.

Rayon

Rayon doesn't always result in better performance than manual thread management (see rayon-rs/rayon#1124), but changes in this PR do result in higher performance.

There are many great benefits to rayon though in the way it is used:

  • safe API, previous thread pool usage was either unsound or sound by accident (comments were claiming things like:

    // Bypass 'lifetime limitations by brute force. It works,
    // because we explicitly join the threads...

    Which while worked by accident, was a lie (nothing was actually joined in most cases, but worked in practice due to use of channels to send work back to the parent thread), fundamentally unsound API (parent thread panic would result in worker threads accessing undefined memory location) and generally a major foot gun for anyone who would try to change the code in any way.

  • support for customizing number of threads using environment variables

  • support for customizing number of threads and their pinning to CPU cores depending on thread pool from which operations are called (by default global thread pool is used and default behavior is equivalent to old code)

  • avoid conflicting/overlapping with other threads if used from within parallel code that already uses rayon

  • no need to special-case platforms like WASM, if specific WASM environment supports threading, blst will now enjoy it automatically

Various cleanups

I changed all but one channel usage with a cleaner code using Mutex, Arc usage was removed as unnecessary and some other bookkeeping was cleaned up after understanding what code does and rewriting intention in more idiomatic Rust code (see fn from for example). The only place where channel still remains is fn mult, but it is quite difficult code and I didn't have the will to decipher what it tries to do to get rid of channel, but hopefully examples in this PR will prompt someone to clean it up at some later point.

Some unsafe blocks with vec.set_len() are moved after memory is actually initialized and some questionable in terms of soundness places were refactored such that they are quite obviously sound now. Some remaining places can also be fixed, but require pointer arithmetic that might require newer versions of Rust, so I left TODOs there.

I also took liberty to refactor things like &q[0] to q.as_ptr() since while both result in the same pointer, the former is deceptive because it suggests to the reader that only the first element is used, while in practice the whole q is used by the function that is being called later (documentation is almost non-existent, so it wasn't obvious at first and I hope this improves readability).

no-threads

no-threads was a hack and non-idiomatic because it violates "additivity" assumptions of Rust features. Thankfully it is no longer necessary and can be safely removed, but since this is a breaking change (the only one in this PR to the best of my knowledge), I bumped crate version from 0.3.11 to 0.4.0.

How to review this PR

This PR is split into self-contained logically structured commits and it is recommended to review commits one by one instead of the complete diff, this way it is much easier to understand what was changed and why.

Let me know if there are any questions, happy to explain each line change in detail.

@mratsim
Copy link
Contributor

mratsim commented Feb 10, 2024

The issue with Rayon or any dependencies is that it brings a lot of new dependencies that need to be audited or at least vetted. This is very undesirable in a cryptographic product securing billions of assets. It is especially critical when supply chain attacks are increasing in appeal.

Furthermore, a simple threadpool written with cryptographic/security application in mind and fully control by the dev is significantly easier to maintain and update if there is a security issue.

@nazar-pc
Copy link
Author

The issue with Rayon or any dependencies is that it brings a lot of new dependencies that need to be audited or at least vetted. This is very undesirable in a cryptographic product securing billions of assets. It is especially critical when supply chain attacks are increasing in appeal.

rayon is a well known established library with short list of mandatory dependencies and it used by Rustc itself, so it is not like you're adding a lot of new stuff here.

Furthermore, a simple threadpool written with cryptographic/security application in mind and fully control by the dev is significantly easier to maintain and update if there is a security issue.

I generally agree with you on this one. However, as I have shown in the description already, there were some questionable decisions from soundness point of view around thread pool as it was used and performance was lower than it could have been.

You 100% can rewrite original code and make it faster without rayon, achieving rayon's features in a safe misuse-resistant way is non-trivial and I'd rather use something that it maintained by Rust experts for this purpose (check top contributors to rayon project, they are literally working on Rust itself) than seeing every library invent their own a bit flawed version of the same features that doesn't integrated as well with the rest of Rust ecosystem.

I would estimate risk of relying on rayon as very low, but decision is up to blst maintainers of course.

bindings/rust/src/lib.rs Outdated Show resolved Hide resolved
bindings/rust/src/lib.rs Show resolved Hide resolved
bindings/rust/src/lib.rs Outdated Show resolved Hide resolved
bindings/rust/src/lib.rs Outdated Show resolved Hide resolved
bindings/rust/src/pippenger.rs Outdated Show resolved Hide resolved
bindings/rust/src/pippenger-no_std.rs Outdated Show resolved Hide resolved
@@ -25,8 +25,7 @@ macro_rules! pippenger_test_mod {
let mut scalars = Box::new([0u8; nbytes * npoints]);
ChaCha20Rng::from_seed([0u8; 32]).fill_bytes(scalars.as_mut());

let mut points: Vec<$point> = Vec::with_capacity(npoints);
unsafe { points.set_len(points.capacity()) };
let mut points = vec![<$point>::default(); npoints];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: avoid useless zero init

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left because it is test code

@@ -61,8 +60,7 @@ macro_rules! pippenger_test_mod {
let mut scalars = Box::new([0u8; nbytes * npoints]);
ChaCha20Rng::from_seed([0u8; 32]).fill_bytes(scalars.as_mut());

let mut points: Vec<$point> = Vec::with_capacity(npoints);
unsafe { points.set_len(points.capacity()) };
let mut points = vec![<$point>::default(); npoints];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: avoid useless zero init

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left because it is test code

bindings/rust/src/pippenger.rs Outdated Show resolved Hide resolved
bindings/rust/src/pippenger.rs Outdated Show resolved Hide resolved
Copy link
Author

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review @mratsim! If the rest of the stuff is fine, I can try to remove zero initialization and see if it makes a performance difference, if yes then I'll look to refactor things in a way that makes code easier to audit.

Having documentation at least in C code would have been great, ideally in bindings as well. Right now without C expanding macros in my head I have no idea that blst_p1s_to_affine initializes memory internally and is guaranteed to never ever read it prior to that, which to me is horrifying from security point of view, especially when there is an easy way to avoid it.

bindings/rust/src/lib.rs Outdated Show resolved Hide resolved
bindings/rust/src/lib.rs Show resolved Hide resolved
bindings/rust/src/lib.rs Outdated Show resolved Hide resolved
bindings/rust/src/pippenger-no_std.rs Outdated Show resolved Hide resolved
Copy link
Author

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored blst_fp12::miller_loop_n into something that is close to original implementation.

For zero initialized vectors I wrote this simple bench:

fn bench_alloc(c: &mut Criterion) {
    let mut group = c.benchmark_group("allocation");

    group.bench_function("alloc/set_size", |b| {
        b.iter(|| {
            let mut points =
                Vec::<u8>::with_capacity(criterion::black_box(1_000_000));

            unsafe { points.set_len(points.capacity()) };

            criterion::black_box(points);
        });
    });

    group.bench_function("alloc/zeroed", |b| {
        b.iter(|| {
            let points = vec![0u8; criterion::black_box(1_000_000)];

            criterion::black_box(points);
        });
    });

    group.finish();
}

The results were like this:

allocation/alloc/set_size                                                                             
                        time:   [12.233 ns 12.276 ns 12.321 ns]
allocation/alloc/zeroed time:   [11.259 µs 11.263 µs 11.267 µs]                                     

It seemed significant, so I decided to partially revert set_len, but I moved some of them after the function that initializes memory to make it clearer to the reader what is actually happening there. Only left zero initialization in 2 places in tests where performance doesn't matter as much.

And I applied relaxed ordering in places where suggested.

@mratsim, appreciate careful review, hopefully you'll have time to check the diff.

@@ -25,8 +25,7 @@ macro_rules! pippenger_test_mod {
let mut scalars = Box::new([0u8; nbytes * npoints]);
ChaCha20Rng::from_seed([0u8; 32]).fill_bytes(scalars.as_mut());

let mut points: Vec<$point> = Vec::with_capacity(npoints);
unsafe { points.set_len(points.capacity()) };
let mut points = vec![<$point>::default(); npoints];
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left because it is test code

@@ -61,8 +60,7 @@ macro_rules! pippenger_test_mod {
let mut scalars = Box::new([0u8; nbytes * npoints]);
ChaCha20Rng::from_seed([0u8; 32]).fill_bytes(scalars.as_mut());

let mut points: Vec<$point> = Vec::with_capacity(npoints);
unsafe { points.set_len(points.capacity()) };
let mut points = vec![<$point>::default(); npoints];
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left because it is test code

Copy link
Contributor

@mratsim mratsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to add my review.

On the technical side, LGTM.

Now on whether to use Rayon or keep BLST's own Threadpool, I'll leave the decision to the team.

@nazar-pc
Copy link
Author

Just confirmed that performance improvements here translate to performance improvements downstream in our usage usage of https://github.com/sifraitech/rust-kzg (uses blst) in https://github.com/subspace/subspace, both of which also take advantage of Rayon as well.

@mratsim
Copy link
Contributor

mratsim commented Feb 22, 2024

Just confirmed that performance improvements here translate to performance improvements downstream in our usage usage of https://github.com/sifraitech/rust-kzg (uses blst) in https://github.com/subspace/subspace, both of which also take advantage of Rayon as well.

I don't know whether commitment, proof generation or verification is the bottleneck for your use-case but @sauliusgrigaitis might be able to point you on how to best use rust-kzg, i.e. either through native BLST or reimplemented+parallelized BLST (i.e. using their own multithreading strategy).

Also they've been working on a new set of benchmarks. I have my own in mratsim/constantine#304 (comment), but I don't cover parallel BLST.

@nazar-pc
Copy link
Author

nazar-pc commented Apr 3, 2024

@dot-asm is there something I can do to interest you reviewing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants