Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Pixel Maps #602

Merged
merged 17 commits into from
May 26, 2024
Merged

Parallel Pixel Maps #602

merged 17 commits into from
May 26, 2024

Conversation

ripytide
Copy link
Contributor

@ripytide ripytide commented May 12, 2024

This is just a draft so I can get some feedback on an initial implementation of map_subpixels_parallel() before I add the remaining variation of functions map_subpixels_parallel_mut() map_pixels_parallel() map_pixels_parallel_mut().

This would address #429

@ripytide
Copy link
Contributor Author

I've just realized this was attempted in #446 but was closed due to being slower than the original. After reading the PR it looks like it used par_bridge as image may not have had raw par iterators 4 years ago which it now has. I'll do some benchmarks to test how my implementation is compared to the non-parallel version.

@ripytide
Copy link
Contributor Author

ripytide commented May 12, 2024

Here are the results:

running 4 tests
test map::benches::bench_map_subpixels_100           ... bench:       9,856 ns/iter (+/- 1,715)
test map::benches::bench_map_subpixels_1000              ... bench:   1,047,116 ns/iter (+/- 42,256)
test map::benches::bench_map_subpixels_parallel_100   ... bench:      26,220 ns/iter (+/- 14,109)
test map::benches::bench_map_subpixels_parallel_1000   ... bench:     375,886 ns/iter (+/- 170,319)

As can be seen which function is faster depends on the size of the image, 100x100 images are faster with the non-parallel version whereas 1000x1000 images are faster with the parallel version.

@Morgane55440
Copy link
Contributor

my PC is slower than yours (probably all the other stuff i'm running), but i might have something to help for the inefficiencies for low values.

lots of time is lost on dispatching, so we need to dispatch less, by parallelizing on lines instead of pixels.
of course, there is no support for this from the image crate, so i had to improvize.

here are my results :
your impl on my pc

test map::benches::bench_map_subpixels_100                                            ... bench:      16,226 ns/iter (+/- 189)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,021,782 ns/iter (+/- 138,602)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      31,478 ns/iter (+/- 3,454)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     685,178 ns/iter (+/- 105,519)

my impl on my pc

test map::benches::bench_map_subpixels_100                                            ... bench:      16,320 ns/iter (+/- 214)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,001,935 ns/iter (+/- 114,785)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      15,008 ns/iter (+/- 2,176)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     634,393 ns/iter (+/- 158,002)

it is still much slower for very small images, so one could just default to sequential for images under 10k pixels

test map::benches::bench_map_subpixels_10                                             ... bench:         200 ns/iter (+/- 12)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,187 ns/iter (+/- 278)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,009,675 ns/iter (+/- 115,502)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:       5,757 ns/iter (+/- 1,303)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      15,283 ns/iter (+/- 1,232)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     699,168 ns/iter (+/- 148,572)

@Morgane55440
Copy link
Contributor

Morgane55440 commented May 12, 2024

here is my take on the function body

{
    use itertools::Itertools;
    use rayon::iter::IndexedParallelIterator;
    use rayon::iter::ParallelIterator;
    use rayon::slice::{ParallelSlice, ParallelSliceMut};

    let (width, height) = image.dimensions();
    let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);

    image
        .par_chunks(width as usize * <P as Pixel>::CHANNEL_COUNT as usize)
        .zip_eq(
            out.par_chunks_mut(
                width as usize * <ChannelMap<P, S> as Pixel>::CHANNEL_COUNT as usize,
            ),
        )
        .for_each(|(input_line, output_line)| {
            input_line
                .chunks_exact(<P as Pixel>::CHANNEL_COUNT as usize)
                .map(|v| <P as Pixel>::from_slice(v))
                .zip_eq(
                    output_line
                        .chunks_exact_mut(<ChannelMap<P, S> as Pixel>::CHANNEL_COUNT as usize)
                        .map(|v| <ChannelMap<P, S> as Pixel>::from_slice_mut(v)),
                )
                .for_each(|(input_pixel, output_channels)| {
                    input_pixel
                        .channels()
                        .iter()
                        .zip(output_channels.channels_mut())
                        .for_each(|(input_subpixel, output_subpixel)| {
                            *output_subpixel = f(*input_subpixel)
                        })
                });
        });

    out
}

this is more efficient the faster the function is, which is great when it's just a sum or multiplication

@ripytide
Copy link
Contributor Author

Great work, line based parallelism does seem like the best option for CPU cache since all the memory would be right next to each other. I wonder how these scale at 10000x10000.

@Morgane55440
Copy link
Contributor

i have a better iteration, that is both faster for small values and more readable

{
    use rayon::iter::IndexedParallelIterator;
    use rayon::iter::ParallelIterator;
    use rayon::slice::{ParallelSlice, ParallelSliceMut};

    let (width, height) = image.dimensions();
    let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);

    const SUBPIXEL_PER_CHUNK : usize = 512;
    image
        .par_chunks(SUBPIXEL_PER_CHUNK)
        .zip_eq(out.par_chunks_mut(SUBPIXEL_PER_CHUNK))
        .for_each(|(input_line, output_line)| {
            input_line.iter().zip(output_line.iter_mut()).for_each(
                |(input_subpixel, output_subpixel)| *output_subpixel = f(*input_subpixel),
            );
        });

    out
}

@Morgane55440
Copy link
Contributor

Morgane55440 commented May 12, 2024

SUBPIXEL_PER_CHUNK is arbitrary, but 512 seems pretty good experimentally

@ripytide
Copy link
Contributor Author

I've added you to my fork in case you want to update the implementation as we find faster versions.

@Morgane55440
Copy link
Contributor

Morgane55440 commented May 12, 2024

Great work, line based parallelism does seem like the best option for CPU cache since all the memory would be right next to each other. I wonder how these scale at 10000x10000.

sadly for me it doesn't seem to improve much beyond 3x. might be my pc's fault

test map::benches::bench_map_subpixels_10                                             ... bench:         204 ns/iter (+/- 2)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,439 ns/iter (+/- 195)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,079,892 ns/iter (+/- 395,519)
test map::benches::bench_map_subpixels_10_000                                         ... bench: 222,443,410 ns/iter (+/- 20,580,424)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:          66 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      12,425 ns/iter (+/- 3,846)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     700,052 ns/iter (+/- 121,973)
test map::benches::bench_map_subpixels_parallel_10_000                                ... bench:  61,087,340 ns/iter (+/- 5,226,359)

@Morgane55440
Copy link
Contributor

the values i get from my tests are really weird, i think there might be something weird going on.

it'd be nice if you could test it on your machine.

@Morgane55440
Copy link
Contributor

Morgane55440 commented May 12, 2024

it says it's 10x faster for a 30 by 30 but only about 3x faster for a 1000 by 1000

test map::benches::bench_map_subpixels_10                                             ... bench:         192 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,343 ns/iter (+/- 233)
test map::benches::bench_map_subpixels_1000                                           ... bench:   2,019,145 ns/iter (+/- 997,715)
test map::benches::bench_map_subpixels_30                                             ... bench:       1,425 ns/iter (+/- 47)
test map::benches::bench_map_subpixels_60                                             ... bench:       5,580 ns/iter (+/- 86)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:          65 ns/iter (+/- 0)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      13,283 ns/iter (+/- 1,725)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     728,398 ns/iter (+/- 118,186)
test map::benches::bench_map_subpixels_parallel_30                                    ... bench:         156 ns/iter (+/- 4)
test map::benches::bench_map_subpixels_parallel_60                                    ... bench:       2,558 ns/iter (+/- 1,018)

which is extra weird because for 30 by 30 there should only be one thread running so there shouldn't be any significant improvement

@Morgane55440
Copy link
Contributor

Morgane55440 commented May 12, 2024

small question : why does map_subpixels_parallel only support ImageBuffer while all the other functions support GenericImage ?

@ripytide
Copy link
Contributor Author

ripytide commented May 12, 2024

GenericImage doesn't require the image to be stored contiguously, whereas ImageBuffer does since it implements Deref<Target=[P::Subpixel]> which is needed for rayon methods like par_chunks(). Actually, on reading the source it looks like it implements Deref<[]> only when the Container also implements Deref<> which is the case for Vec. Maybe we could expand it to GenericImage with some bound like Deref<[]>?

@Morgane55440
Copy link
Contributor

well, i have bad news
all the speed increase is solely due to the fact that we're working with ImageBuffer and contiguous memory, and removing the parralel bits makes it all faster including 1000 by 1000

@ripytide
Copy link
Contributor Author

ripytide commented May 12, 2024

The benches on my machine (6-core Ryzen 5 4500U):

test map::benches::bench_map_subpixels_10                                             ... bench:         207 ns/iter (+/- 2)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,271 ns/iter (+/- 145)
test map::benches::bench_map_subpixels_1000                                           ... bench:   1,725,553 ns/iter (+/- 107,609)
test map::benches::bench_map_subpixels_30                                             ... bench:       1,467 ns/iter (+/- 12)
test map::benches::bench_map_subpixels_60                                             ... bench:       5,660 ns/iter (+/- 26)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:          81 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:       9,735 ns/iter (+/- 1,870)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     243,560 ns/iter (+/- 43,263)
test map::benches::bench_map_subpixels_parallel_30                                    ... bench:         168 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_60                                    ... bench:       7,178 ns/iter (+/- 1,279)

@Morgane55440
Copy link
Contributor

what happens if you replace the body by

{
    use itertools::Itertools;

    let (width, height) = image.dimensions();
    let mut out: ImageBuffer<ChannelMap<P, S>, Vec<S>> = ImageBuffer::new(width, height);

    image.iter().zip_eq(out.iter_mut()).for_each(
            |(input_subpixel, output_subpixel)| *output_subpixel = f(*input_subpixel),
        );

    out
}

?

@ripytide
Copy link
Contributor Author

test map::benches::bench_map_subpixels_10                                             ... bench:         208 ns/iter (+/- 2)
test map::benches::bench_map_subpixels_100                                            ... bench:      16,292 ns/iter (+/- 337)
test map::benches::bench_map_subpixels_1000                                           ... bench:   1,698,793 ns/iter (+/- 202,548)
test map::benches::bench_map_subpixels_30                                             ... bench:       1,464 ns/iter (+/- 19)
test map::benches::bench_map_subpixels_60                                             ... bench:       6,032 ns/iter (+/- 54)
test map::benches::bench_map_subpixels_parallel_10                                    ... bench:          58 ns/iter (+/- 0)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:       1,035 ns/iter (+/- 9)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     384,083 ns/iter (+/- 22,183)
test map::benches::bench_map_subpixels_parallel_30                                    ... bench:         137 ns/iter (+/- 1)
test map::benches::bench_map_subpixels_parallel_60                                    ... bench:         393 ns/iter (+/- 2)

@Morgane55440
Copy link
Contributor

well it seems the parallel version is slightly faster for big images

@ripytide
Copy link
Contributor Author

Well at least we've learnt that there is massive room for optimization via ImageBuffer over GenericImage hehe.

@Morgane55440
Copy link
Contributor

still a bit perplexed at how inefficient the parallelization is, there might be something else to it i'm missing

@ripytide
Copy link
Contributor Author

ripytide commented May 12, 2024

From the cargo docs:

The --jobs argument affects the building of the benchmark executable but does not affect how many threads are used when running the benchmarks. The Rust test harness runs benchmarks serially in a single thread.

I think libtest may only be running programs with singlethreaded? Oh wait I think that just means they are run sequentially and not that they are limited to one thread when executing.

@Morgane55440
Copy link
Contributor

the parallelization is happening for sure

fishy_stuff

this is all very fishy

@Morgane55440
Copy link
Contributor

Morgane55440 commented May 13, 2024

here is what happen if i remove the with_min_len (dramatic slow down for small images), but use more expensive computation as the input function (|x| f64::from(x).sqrt() as u8 + 1) :

test map::benches::bench_map_subpixels_parallel_10                                    ... bench:      15,389 ns/iter (+/- 2,866)
test map::benches::bench_map_subpixels_parallel_100                                   ... bench:      40,917 ns/iter (+/- 5,783)
test map::benches::bench_map_subpixels_parallel_1000                                  ... bench:     911,286 ns/iter (+/- 100,335)
test map::benches::bench_map_subpixels_parallel_10_000                                ... bench:  79,796,480 ns/iter (+/- 8,970,528)
test map::benches::bench_map_subpixels_parallel_30                                    ... bench:      26,007 ns/iter (+/- 7,937)
test map::benches::bench_map_subpixels_parallel_60                                    ... bench:      33,506 ns/iter (+/- 7,006)
test map::benches::bench_map_subpixels_sequential_10                                  ... bench:         401 ns/iter (+/- 6)
test map::benches::bench_map_subpixels_sequential_100                                 ... bench:      37,682 ns/iter (+/- 605)
test map::benches::bench_map_subpixels_sequential_1000                                ... bench:   4,110,240 ns/iter (+/- 190,140)
test map::benches::bench_map_subpixels_sequential_10_000                              ... bench: 427,149,130 ns/iter (+/- 326,256,736)
test map::benches::bench_map_subpixels_sequential_30                                  ... bench:       3,537 ns/iter (+/- 572)
test map::benches::bench_map_subpixels_sequential_60                                  ... bench:      13,524 ns/iter (+/- 326)

the good value of MIN_SUBPIXEL_PER_CHUNK completely depends on the input functions ; if we had a function that took 1ms to compute, then the parallel version would be much faster even for a 10 by 10 image.
therefore, imo, we should ship the parralel version without using with_min_len, explaining that it is slower than sequential for small images and fast functions, but can be significantly faster for large images(1 million pixel+) and slow fuctions

@Morgane55440
Copy link
Contributor

i have also come to realize that chunking manually is actually pointless, because that's already what rayon does

@Morgane55440
Copy link
Contributor

T_T

i thought i had done clippy on my machine

@ripytide
Copy link
Contributor Author

I've de-duplicated the benchmark mapping function into the macro and switched map_subpixels_parallel() to be more similar to the rest of the mapping functions using GenericImage but with a Deref<[]> bound. I tested benchmarks before and after the commits and it had no effects as expected.

@ripytide
Copy link
Contributor Author

The benchmarks take forever (200s) on my machine mainly due to the 10_000 functions (especially the sequential one), we'll probably have to remove those for the final PR.

@Morgane55440
Copy link
Contributor

makes sense

@theotherphil
Copy link
Contributor

Thanks for the PR. I’ll wait until after the next release to merge this. Which will probably be next weekend as I don’t have much time on weekdays.

src/map.rs Outdated
S: std::marker::Send,
S: std::marker::Sync,
<P as WithChannel<S>>::Pixel: std::marker::Send,
<P as WithChannel<S>>::Pixel: std::marker::Sync,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This where clause is horrific! Is there any way of shortening it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could add use std::marker::{Send, Sync} at the top of the file and then

    S: std::marker::Send,
    S: std::marker::Sync,

would become

    S: Send + Sync,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the final result would be something like that :

pub fn map_subpixels_parallel<I, P, F, S>(image: &I, f: F) -> Image<ChannelMap<P, S>>
where
    I: GenericImage<Pixel = P> + Deref<Target = [P::Subpixel]>,
    P: WithChannel<S> + Send + Sync,
    S: Primitive + Send + Sync,
    F: Fn(P::Subpixel) -> S + Sync,
    P::Subpixel: Send + Sync,
    <P as WithChannel<S>>::Pixel: Send + Sync,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Send and Sync are already in rust prelude

@ripytide
Copy link
Contributor Author

ripytide commented May 25, 2024

Okay now that all the blockers have been resolved I've fleshed out the rest of this PR.
Some other changes I made:

  • Renamed red_channel() to into_red_channel() since that is more clear on the operation being done.
  • Renamed as_red_channel() to from_red_channel() for the same reason as above, also as_ functions are supposed to be cheap operations according to the rust API guidelines.
  • All signatures have changes from map_pixels<I, P>(image: &I) where I: GenericImage<P> to map_pixels<P>(image: &Image<P>) since it provides substantial performance improvements as it enforces contiguous storage. This is more consistent with other imageproc functions which use Image<P>'s and is more inline with the projects principle of performance over maximum genericness.

@ripytide
Copy link
Contributor Author

@theotherphil this is now ready for review.

@ripytide ripytide marked this pull request as ready for review May 25, 2024 13:24
@theotherphil
Copy link
Contributor

The new code looks good and the signature change makes sense. I’m not convinced by the name changes though. You’re right that as_ should be cheap, but there’s also a standard convention that into_ and from_ consume an owned value which isn’t the case here.

@ripytide
Copy link
Contributor Author

That's true, but it is the lesser evil in my opinion as users can see their signatures fairly easily from the docs. Unless you have an alternative suggestion?

@ripytide
Copy link
Contributor Author

ripytide commented May 26, 2024

One alternative:
grayscale_from_red_channel() and rgb_from_red_grayscale()?

@theotherphil
Copy link
Contributor

I’ll merge as is and think about what best to name these before the next release.

Thanks for the new functions, and it’s good to see your new doc macros already paying off.

@theotherphil theotherphil merged commit 161be33 into image-rs:master May 26, 2024
15 checks passed
@ripytide ripytide deleted the map_parallel branch May 26, 2024 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants