-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimisation of grayscale morphology operators #597
Comments
Sounds good! |
What kind of optimizations? |
I think you have a lot of cache misses here because you are accessing image pixels in col-major order. |
So you can safely change the order of |
i'm mostly aiming to improve the
|
example of the potential use of SIMD instructions.(this might not work at all, but i thought it might be worth it for me to test)
bitwise and -> 1 operation
reduce_max -> 1 operations ? the nightly rust SIMD api and other sources mention reduce_max for packed being a cpu operation operation.
of course this is idealized greatly, there are many issues i haven't mentioned, and as SIMD is still nightly i can only nudge the compiler toward using these tools, but i know in the right circumstances it is able to use SIMD instructions, so it might be worth looking into at least |
@Morgane55440 is this issue complete now or are you still considering the other optimisation ideas you mentioned above? |
i tink i still would like to try the bit_mask technique. |
Simd is architecture dependent, currently imageproc is targeting cross platform CPU I think |
We can remain cross-platform using runtime detection of SIMD support. I've no idea how practical this will end up being, but there's potentially a lot of speedup available for some functions. |
i think so too, and i'm not actually telling the compiler to use any SIMD at the moment (the standard SIMD module is still unstable), but aside from completely platform independent things like removing unnecessary instructions, i have to make assumptions to guide optimizations, like how the cache works, and what SIMD instructions are available. my understanding is things ultimately, i can only bench things on my machine, i'd love more targets but i'm not sure how to do that easily. btw i just checked (that was completely unnecessary), and all modules wich mention SIMD in core::arch mention a version of |
the simd module aims to be
this would be great in the future when it is stable, but for now i'm relying on the compiler. the nightly API provides both versions of packed_min(simd_min) and reduce_min(reduce_min), but my guess is it would do non-SIMD operation on architectures that don't support it, which i would expect to be more common for reduce_min than simd_min. of course my knowledge of hardware instructions is very limited, so i could be completely wrong |
There is no |
if you reread my comments, you might notice that is what i've spent the last 2 comments saying i'm really sorry if i appear aggressive, that is not my intention, but i spent a long time to be as clear as possible in my messages and it feels like you just didn't read them |
I don't believe there is a potential benefit. Also, runtime architecture detection is not cross-platform. In addition, difficulties with tests will begin, |
Firstly : sorry for the delay, swamped with deadlines atm, will continue to be for the close future secondly : having taken a step bak, i have been some making some passive-agressive/regular agressive comments. |
i agree with that, i don't think detecting specific hardware would be super useful for a hardware agnostic library. once simd goes stable (whenever that happens, it took years for all i know), it might be worth looking into, but for now, simply write code that nudges the compiler into using SIMD. for (input, output) in (u8_slice_1).zip(u8_slice_2)
{
*output = min(*input, *output);
} are very likely to be compiled to SIMD whenever the compiler allows it on the given architecture and optimization flags are on. for example, when i put the following code in a compiler explorer with "-O3" optimization (which as i understand it is our bench and relase optimization level) (i'm not 100% if it was targeting my architecture specifically, but it was some intel architecture), i found that it used the pub fn main() {
let mut n = 20usize;
std::hint::black_box(&mut n);
let mut v = vec![0u8;n];
let mut v2 = vec![0u8;n];
std::hint::black_box(&mut v);
std::hint::black_box(&mut v2);
for (input, output) in v.iter().zip(v2.iter_mut())
{
*output = (*output).min(*input);
};
let x = v.into_iter().min();
std::hint::black_box(&x);
} of course, to be absolutely clear, these things would impact different architectures differently, and it would be absolutely impossible to predict them all. but modifying code in ways that make it obvious to the compiler that SIMD could be used could potentially lead to significant speed increases. also given that i'm gonna look into it on my own, it doesn't matter much if i'm wrong |
@cospectrum there surely are some performance benefits to be had from manual vectorisation, but I agree that handling multiple architectures and differing processor support would likely complicate the code significantly. @Morgane55440 please don’t feel any pressure to respond quickly to discussions on an open source library :-) And for what it’s worth your comments all come across as perfectly polite and reasonable to me. As @Morgane55440 has landed some significant performance improvements already and it looks like they have no specific future changes in flight I’ll close this issue as complete. |
i'd like to make a new PR focused on optimisation for the grayscale morphology operators now that #581 has been merged if that's ok.
The text was updated successfully, but these errors were encountered: