You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
minimum chunk threshold of 32 to make the batch inversion worthwhile
ideally chunk as big as possible within 2¹⁸ = 262144 bytes of memory to fit in most L1 cache
there is a lot of overhead just for iterate by 32 indices and copy in temp mem.
We may be able to improve load-balance by using a shared atomic to represent the current point index, have one accumulator per thread have them accumulate as many points as possible.
There will be more cache-line contention on this atomic but LBS does have contention as well when the work is just copying data most of the time (until we reach accumulator threshold):
Currently parallel sum reduction uses 2 strategies depending on the number of points to be summed.
constantine/constantine/math/elliptic/ec_shortweierstrass_batch_ops_parallel.nim
Lines 110 to 123 in 0afccb4
constantine/constantine/math/elliptic/ec_shortweierstrass_batch_ops_parallel.nim
Lines 27 to 67 in 0afccb4
constantine/constantine/math/elliptic/ec_shortweierstrass_batch_ops_parallel.nim
Lines 69 to 108 in 0afccb4
The automated split uses the threadpool implementation of Lazy Binary Splitting
Tzannes, Caragea, Barua, Vishkin, 2010
https://terpconnect.umd.edu/~barua/ppopp164.pdf
This is 2x faster for a medium amount of points.
Improving load balancing
Due to the complex chunking:
there is a lot of overhead just for iterate by 32 indices and copy in temp mem.
We may be able to improve load-balance by using a shared atomic to represent the current point index, have one accumulator per thread have them accumulate as many points as possible.
There will be more cache-line contention on this atomic but LBS does have contention as well when the work is just copying data most of the time (until we reach accumulator threshold):
constantine/constantine/threadpool/threadpool.nim
Lines 499 to 522 in 0afccb4
That strategy is already used for parallel BLS signatures:
constantine/constantine/signatures/bls_signatures_parallel.nim
Lines 109 to 179 in 0afccb4
The text was updated successfully, but these errors were encountered: