Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USIMD: Optimize the performace of np.einsum for all platforms #16641

Closed
wants to merge 256 commits into from

Conversation

Qiyu8
Copy link
Member

@Qiyu8 Qiyu8 commented Jun 20, 2020

seven sub module of np.einsum is optimized using neon intrinsics. The optimization resulted in a performance improvement of 10~42%.

System info:

  • HardWare: KunPeng
  • Processor: ARMv8 2.6GMHZ 8 processors
  • OS: Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64

The optimization effect:

einsum sub module triggered master einsum-neon ratio
sum_of_products_stride0_contig_outstride0_two 110±0.5μs 100±0.5μs 0.90
sum_of_products_contig_stride0_outstride0_two 110±0.5μs 101±0.5μs 0.90
sum_of_products_contig_stride0_outcontig_two 23.7±1ms 20.2±0.3ms 0.85
sum_of_products_contig_two 152±0.8μs 140±2μs 0.91
sum_of_products_stride0_contig_outcontig_two 946±10μs 841±20μs 0.89
contig_contig_outstride0_two 522±5μs 353±7μs 0.68
sum_of_products_contig_outstride0_one 871±10μs 502±4μs 0.58

@Qiyu8 Qiyu8 changed the title SIMD: Optimize the performace of np.enisum in ARM-based machine. SIMD: Optimize the performace of np.einsum in ARM-based machine. Jun 20, 2020
@charris charris added 03 - Maintenance component: SIMD Issues in SIMD (fast instruction sets) code or machinery component: numpy._core labels Jun 21, 2020
Qiyu8 and others added 3 commits July 3, 2020 11:07
Merge branch 'master' of https://github.com/numpy/numpy into einsum-neon
These never worked to the best of my knowledge, so disable
it explicitly to hopefully simplify cleanup in other parts.
@mattip
Copy link
Member

mattip commented Jul 11, 2020

We discussed this briefly in the sprint. Correct me if I am wrong: this is a reimplementation of the SSE2 intrinsics in NEON. I wonder if we could already use the universal intrinsics to make this generic?

Floats besides float64 were being coerced to integers
and complex step sizes for the index trick classes
would fail for complex64 input. Fixes numpy#16466
@Qiyu8
Copy link
Member Author

Qiyu8 commented Jul 13, 2020

I'd be happy to change it to universal intrinsics,Is there any guidance document or example that I should read first? @mattip @seiko2plus

@Qiyu8 Qiyu8 marked this pull request as draft July 13, 2020 02:02
@seiko2plus
Copy link
Member

I'd be happy to change it to universal intrinsics,Is there any guidance document or example that I should read first?

not yet, but sure I can guide you

Merge branch 'master' of https://github.com/numpy/numpy into einsum-neon
/**begin repeat2
* #i = 0, 2, 4, 6#
*/
a = vmulq_f64(vld1q_f64(data0+@i@), vld1q_f64(data1+@i@));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace these intrinsics with:
vld1q_f64 -> npyv_load_f64
vmulq_f64 -> npyv_mul_f64
vaddq_f64 -> npyv_add_f64
vst1q_f64 -> npyv_store_f64

for more intrinsics, you have to search or dive in the NPYV

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW why you didn't use FMA3 here, well currently there's no mul-add intrinsic in NPYV but I guess It will not be that hard to impmlent it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make another PR for mul-add intrinsic in NPYV.

@Qiyu8
Copy link
Member Author

Qiyu8 commented Jul 14, 2020

@seiko2plus Thanks for your advice, Reconstruct einsum using Universal SIMD after reading simd-optimizations.rst, here is the routine:
1- Configuration: add a dispatch file to corresponding module that needs to optimized, remember to add to setup.py for compiler purpose.
2- Rewrite: rewrite the original platform's related intrinsics by using NPY_SIMD and npyv simd api
3- Test Case: write a SIMD dispatch test case to the caller module. in this case it's multiarraymodule.c, then add to test_cpu_dispatcher in order to trigger this test case.

I just make @name@_sum_of_products_contig_two function an example, correct me if i were wrong, then I will rewrite other functions by following this pattern.

PS: USIMD is really powerful, saved so much coding time. 😃

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice to see use of the universal intrinsics.

.gitignore Outdated
numpy/core/src/umath/_umath_tests.dispatch.h
numpy/core/src/umath/_umath_tests.dispatch.sse41.c
numpy/core/src/*/*.dispatch.*.c
numpy/core/src/*/*.dispatch.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seiko2plus shouldn't we be generating files in build/* rather than in the source directories, or is this similar to cython-generated files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, should be generated into build/src.*, but maybe @Qiyu8 use --inplace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I used -i option during build.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you used inplace option -i because you received error like that

: fatal error: einsum.dispatch.h: No such file or directory

well it's my bad, CcompilerOpt should include build/src.*/numpy/core/src/multiarray to the search dir, even if it wasn't added by 'core/setup.py'.

This issue should be fixed by #16858.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#16858 is merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @mattip.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this ignore part, I build the source with -i option, and the *.dispatch.*.c will generated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default people should not be using -i since it messes up the source directory, as you discovered. Why don't you want to use the build directory? Is there another problem that using -i fixes?

Copy link
Member Author

@Qiyu8 Qiyu8 Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build directory is good , no problem if I don't use -i option. will remove from ignore list.

self.c = np.arange(24000, dtype=dtype).reshape(20, 30, 40)
self.c1 = np.arange(1200, dtype=dtype).reshape(30, 40)
self.c2 = np.arange(480000, dtype=dtype)
self.c3 = np.arange(600, dtype=dtype)
self.d = np.arange(10000, dtype=dtype).reshape(10,100,10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good at some point to rename these with meaningful names. Is there a reason to have the different size variants? Perhaps a very small variant would be good as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattip I found that meaningless naming is pretty common in every bench module, should we open a separate rename PR? I want to focus on the universal intrinsics transform process in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. What about adding a very small variant as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that small variant is not enough to evaluate the degree of optimization, I will test with small size and provide a report anyway.

numpy/core/src/multiarray/multiarraymodule.c Outdated Show resolved Hide resolved
@seiko2plus
Copy link
Member

@Qiyu8, you should dispatch the tabs that hold the functions pointers within PyArray_EinsteinSum() and also there's no need to defining tests. could you please allow me to help you? I'm just gonna initialize the dispatching process for you and then you will handle the universal intrinsics part.

Qiyu8 and others added 28 commits August 11, 2020 10:59
Co-authored-by: Sayed Adel <seiko@imavr.com>
  - paves the way for integer optimization
  - fix memory overflow/bus errors
  - improve the unrolling
  - activate the unrolling when NPYV isn't available
  - add support for fma3
  - robust/cleanup
  - other minor fixes
  - paves the way for integer optimization
  - fix memory overflow/bus errors
  - improve the unrolling
  - activate the unrolling when NPYV isn't available
  - add support for fma3
  - robust/cleanup
  - other minor fixes
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Co-authored-by: Sayed Adel <seiko@imavr.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 - Maintenance component: numpy._core component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet