New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD: Use universal intrinsics to implement comparison functions #21483
Conversation
@seiko2plus thoughts? This significantly speeds up integer comparisons |
685eeeb
to
8da7100
Compare
Just noticed that I also have to add a test for array OP scalar (and scalar OP array). |
8da7100
to
0bb6b36
Compare
I already have the required tests. The PR is ready to be reviewed again. Thanks :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necessary improvements to the x86 architecture
0bb6b36
to
1134be2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this test would lend itself very well to parametrizing across the comparison too, examples below.
02e9694
to
09b22a1
Compare
We have some errors in the CI but I am not sure if they are related to the latest changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, we're almost done here just few changes more.
See below the results updated: Benchmark/Performance:SSE
AVX2
AVX512F
AVX512BW
VSX3
Binary size:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done!, just one thing left.
def time_less_than_binary(self, dtype): | ||
(self.x < self.y) | ||
|
||
def time_less_than_scalar1(self, dtype): | ||
(self.s < self.x) | ||
|
||
def time_less_than_scalar2(self, dtype): | ||
(self.d < 1) | ||
(self.x < self.s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattip, benchmarks here doesn't cover the rest of operations, not sure if its necessary to cover them or its better to save a room for the upcoming benchmarks since its the benchmark process start to became pretty slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for minimal effective benchmarks. I don't think there is a need to repeatedly hit the same code multiple times
@rafaelcfsousa, its better to strip the debugging symbols before reporting the binary size. also I think we need a release note. |
On the one hand, there are hints in 'numpy/doc/release/upcoming_changes'. On the other, I see we are not very diligent about adding release notes to other PRs for SIMD performance. |
This commit also applies some techniques to reduce the size of the binary generated from the source loops_comparison.dispatch.c.src
This commit also rewrite the tests andc, orc and xnor
04f6803
to
dc4a9e3
Compare
PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
The PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
1efb0b8
to
1969089
Compare
The PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
1969089
to
2aabbb8
Compare
The PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
2aabbb8
to
2701a5a
Compare
@seiko2plus :
|
@seiko2plus and @mattip : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
Thanks @rafaelcfsousa and thanks @seiko2plus for the careful review |
The PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
This PR moves the comparison functions (eq, ne, lt, le, ge and gt) to a new dispatchable source file to make use of the universal intrinsics. This optimization benefits all architectures.
The following universal intrinsics were added in this PR:
Benchmark:
X86
CPU
OS
Benchmark
SSE
AVX2
AVX512F
AVX512BW
Power little-endian (Power9/VSX3)
CPU
OS
Benchmark
VSX3
AArch64
As I do not have access to an ARM processor, I didn't execute the benchmark to check performance. But I was able to test the NEON code I added in this PR on a small ARM processor.
cc: @mattip and @seiko2plus