USIMD: Optimize the performace of np.einsum for all platforms #16641

Qiyu8 · 2020-06-20T03:03:29Z

seven sub module of np.einsum is optimized using neon intrinsics. The optimization resulted in a performance improvement of 10~42%.

System info:

HardWare: KunPeng
Processor: ARMv8 2.6GMHZ 8 processors
OS: Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64

The optimization effect:

einsum sub module triggered	master	einsum-neon	ratio
sum_of_products_stride0_contig_outstride0_two	110±0.5μs	100±0.5μs	0.90
sum_of_products_contig_stride0_outstride0_two	110±0.5μs	101±0.5μs	0.90
sum_of_products_contig_stride0_outcontig_two	23.7±1ms	20.2±0.3ms	0.85
sum_of_products_contig_two	152±0.8μs	140±2μs	0.91
sum_of_products_stride0_contig_outcontig_two	946±10μs	841±20μs	0.89
contig_contig_outstride0_two	522±5μs	353±7μs	0.68
sum_of_products_contig_outstride0_one	871±10μs	502±4μs	0.58

Merge branch 'master' of https://github.com/numpy/numpy into einsum-neon

These never worked to the best of my knowledge, so disable it explicitly to hopefully simplify cleanup in other parts.

mattip · 2020-07-11T19:52:37Z

We discussed this briefly in the sprint. Correct me if I am wrong: this is a reimplementation of the SSE2 intrinsics in NEON. I wonder if we could already use the universal intrinsics to make this generic?

Floats besides float64 were being coerced to integers and complex step sizes for the index trick classes would fail for complex64 input. Fixes numpy#16466

Qiyu8 · 2020-07-13T01:41:01Z

I'd be happy to change it to universal intrinsics，Is there any guidance document or example that I should read first? @mattip @seiko2plus

seiko2plus · 2020-07-13T03:45:28Z

I'd be happy to change it to universal intrinsics，Is there any guidance document or example that I should read first?

not yet, but sure I can guide you

numpy/core/src/multiarray/einsum.c.src

Merge branch 'master' of https://github.com/numpy/numpy into einsum-neon

numpy/core/src/multiarray/einsum.c.src

seiko2plus · 2020-07-13T04:26:11Z

numpy/core/src/multiarray/einsum.c.src

+/**begin repeat2
+ * #i = 0, 2, 4, 6#
+ */
+            a = vmulq_f64(vld1q_f64(data0+@i@), vld1q_f64(data1+@i@));


replace these intrinsics with:
vld1q_f64 -> npyv_load_f64
vmulq_f64 -> npyv_mul_f64
vaddq_f64 -> npyv_add_f64
vst1q_f64 -> npyv_store_f64

for more intrinsics, you have to search or dive in the NPYV

BTW why you didn't use FMA3 here, well currently there's no mul-add intrinsic in NPYV but I guess It will not be that hard to impmlent it.

I will make another PR for mul-add intrinsic in NPYV.

… of github.com:numpy/numpy into einsum-neon

Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>

Qiyu8 · 2020-07-14T04:17:18Z

@seiko2plus Thanks for your advice, Reconstruct einsum using Universal SIMD after reading simd-optimizations.rst, here is the routine:
1- Configuration: add a dispatch file to corresponding module that needs to optimized, remember to add to setup.py for compiler purpose.
2- Rewrite: rewrite the original platform's related intrinsics by using NPY_SIMD and npyv simd api
3- Test Case: write a SIMD dispatch test case to the caller module. in this case it's multiarraymodule.c, then add to test_cpu_dispatcher in order to trigger this test case.

I just make @name@_sum_of_products_contig_two function an example, correct me if i were wrong, then I will rewrite other functions by following this pattern.

PS: USIMD is really powerful, saved so much coding time. 😃

mattip

nice to see use of the universal intrinsics.

mattip · 2020-07-14T07:18:26Z

.gitignore

-numpy/core/src/umath/_umath_tests.dispatch.h
-numpy/core/src/umath/_umath_tests.dispatch.sse41.c
+numpy/core/src/*/*.dispatch.*.c
+numpy/core/src/*/*.dispatch.h


@seiko2plus shouldn't we be generating files in build/* rather than in the source directories, or is this similar to cython-generated files?

yes, should be generated into build/src.*, but maybe @Qiyu8 use --inplace.

yes, I used -i option during build.

I guess you used inplace option -i because you received error like that

: fatal error: einsum.dispatch.h: No such file or directory

well it's my bad, CcompilerOpt should include build/src.*/numpy/core/src/multiarray to the search dir, even if it wasn't added by 'core/setup.py'.

This issue should be fixed by #16858.

#16858 is merged.

thank you @mattip.

I think we should keep this ignore part, I build the source with -i option, and the *.dispatch.*.c will generated.

By default people should not be using -i since it messes up the source directory, as you discovered. Why don't you want to use the build directory? Is there another problem that using -i fixes?

build directory is good , no problem if I don't use -i option. will remove from ignore list.

mattip · 2020-07-14T07:20:38Z

benchmarks/benchmarks/bench_linalg.py

        self.c = np.arange(24000, dtype=dtype).reshape(20, 30, 40)
        self.c1 = np.arange(1200, dtype=dtype).reshape(30, 40)
+        self.c2 = np.arange(480000, dtype=dtype)
+        self.c3 = np.arange(600, dtype=dtype)
        self.d = np.arange(10000, dtype=dtype).reshape(10,100,10)


It would be good at some point to rename these with meaningful names. Is there a reason to have the different size variants? Perhaps a very small variant would be good as well.

@mattip I found that meaningless naming is pretty common in every bench module, should we open a separate rename PR? I want to focus on the universal intrinsics transform process in this PR.

ok. What about adding a very small variant as well?

I doubt that small variant is not enough to evaluate the degree of optimization, I will test with small size and provide a report anyway.

numpy/core/src/multiarray/multiarraymodule.c

seiko2plus · 2020-07-14T07:31:15Z

@Qiyu8, you should dispatch the tabs that hold the functions pointers within PyArray_EinsteinSum() and also there's no need to defining tests. could you please allow me to help you? I'm just gonna initialize the dispatching process for you and then you will handle the universal intrinsics part.

Co-authored-by: Sayed Adel <seiko@imavr.com>

- paves the way for integer optimization - fix memory overflow/bus errors - improve the unrolling - activate the unrolling when NPYV isn't available - add support for fma3 - robust/cleanup - other minor fixes

Co-authored-by: Sayed Adel <seiko@imavr.com>

Qiyu8 added 4 commits June 16, 2020 11:31

optimize sum_of_products_contig_stride0_outcontig_two using neon.

9d5812f

optimize sum_of_products_stride0_contig_outstride0_two using neon.

0b6e5b6

optimize sum_of_products_contig_stride0_outstride0_two using neon.

5ec4da5

add dtype parameter

4454cb0

Qiyu8 changed the title ~~SIMD: Optimize the performace of np.enisum in ARM-based machine.~~ SIMD: Optimize the performace of np.einsum in ARM-based machine. Jun 20, 2020

charris added 03 - Maintenance component: SIMD Issues in SIMD (fast instruction sets) code or machinery component: numpy._core labels Jun 21, 2020

Qiyu8 and others added 3 commits July 3, 2020 11:07

rebase

7e40b1b

Merge branch 'master' of https://github.com/numpy/numpy into einsum-neon

modified accoriding to new NPY_HAVE_NEON flag.

fecd458

MAINT: Explicitly disallow object user dtypes

c90ac6c

These never worked to the best of my knowledge, so disable it explicitly to hopefully simplify cleanup in other parts.

cjblocker added 2 commits July 12, 2020 03:38

BUG: fix mgrid output for lower precision float inputs

1cdc9a8

Floats besides float64 were being coerced to integers and complex step sizes for the index trick classes would fail for complex64 input. Fixes numpy#16466

TST: fixed dtype check error from code review

5d1fbf4

Qiyu8 marked this pull request as draft July 13, 2020 02:02