Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallel prefix sum / parallel scan #262

Open
mratsim opened this issue Feb 2, 2024 · 1 comment
Open

Implement parallel prefix sum / parallel scan #262

mratsim opened this issue Feb 2, 2024 · 1 comment

Comments

@mratsim
Copy link

mratsim commented Feb 2, 2024

Discovered by Axiom in the process of porting Scroll/Geometry LogUp (Multivariate Lookups) to their own lib and kindly pointed out to us when we're preparing to upstream the work.

Scroll+Geometry PRs:

Axiom / @jonathanpwang comment:

The MV Lookup PR assumes that parallelize chunks the iteration domain into same sized chunks.
This has been removed in #186 to fix load imbalance:

Currently if we take 40 items divided into 12 threads (AMD Ryzen 7800X, Apple M2 Pro or Intel i5-12600) the partitioning will lead to 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 7 = 3*11 + 7 = 40. The “remainder” thread will have 2.33x more work to do.

This can be quite extreme for example if we have 351 items to split on 32 cores, 351/32 = 10.96, rounded to 10 with integer division. 10*31 = 310, the last core needs to process 41 items, 4.1x more than the others.

Instead we need a load-balanced prefix sum algorithm / parallel scan.

This is a common parallelization problem due to being well constrained but difficult enough with several approaches to teach CS student about parallelization algorithms and also cache concerns.

There is an excellent overview here on 3 different algorithms https://github.com/matteomazza91/parallel-prefix-sum/blob/master/report.pdf

  1. Summing i and i+1, iterating i += 2 until N
  2. Summing i and 2i, iterating i +=1 until N/2
  3. Multi-stage processing that is a bit too complex to describe

An another:

And High-Performance Computing courses:

And GPU techniques (which are quite relevant now that CPUs have 100+ cores)


Also, unsure yet of how MV-lookups are implemented but it seems like currently the code builds a big table and do a prefix sum once everything is there. An alternative, streaming, approach could be to use Fenwick Trees so the prefix sum is updated on-the-fly. Note that this only make sense if the data isn't available at once but cumulated over many parts.

@mratsim
Copy link
Author

mratsim commented Feb 2, 2024

Looking into the rayon ecosystem and issues:

There is a solution via: https://crates.io/crates/rayon-scan
though a quick glance at it and I'm concerned at:

So it can be used as a stop-gap (but then again we can re-add the old parallelize as a stopgap as well) but if this is a bottleneck, we likely might still want an optimized implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant