Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD: Use universal intrinsics to implement comparison functions #21483

Merged
merged 6 commits into from May 30, 2022

Conversation

rafaelcfsousa
Copy link
Contributor

This PR moves the comparison functions (eq, ne, lt, le, ge and gt) to a new dispatchable source file to make use of the universal intrinsics. This optimization benefits all architectures.

The following universal intrinsics were added in this PR:

npyv_b8 npyv_pack_b8_b16(npyv_b16 a, npyv_b16 b);
npyv_b8 npyv_pack_b8_b32(npyv_b32 a, npyv_b32 b, npyv_b32  c, npyv_b32 d);
npyv_b8 npyv_pack_b8_b64(npyv_b64 a, npyv_b64 b, npyv_b64  c, npyv_b64 d, 
                         npyv_b64 e, npyv_b64 f, npyv_b64  g, npyv_b64 h);

Benchmark:

X86

CPU
CPU
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz
Stepping:              7
CPU MHz:               2100.128
CPU max MHz:           2100.0000
CPU min MHz:           800.0000
BogoMIPS:              4200.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              11264K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology non
stop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cd
p_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt cl
wb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
OS
CentOS Linux 7 (Core)"
gcc (Spack GCC) 9.4.0

Benchmark

SSE
export NPY_DISABLE_CPU_FEATURES="SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2 AVX512F AVX512CD AVX512_SKX"
        before          after
+     10.1±0.04μs      11.6±0.05μs     1.15  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float32'>)
+     8.43±0.03μs      9.16±0.03μs     1.09  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint16'>)
+      41.3±0.1μs         43.7±1μs     1.06  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint64'>)
-      18.3±0.2μs      16.9±0.09μs     0.92  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float64'>)
-      18.1±0.1μs      16.1±0.03μs     0.89  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float64'>)
-      30.2±0.4μs       26.3±0.2μs     0.87  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float64'>)
-     50.9±0.09μs       44.2±0.5μs     0.87  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint64'>)
-     8.23±0.08μs      6.97±0.01μs     0.85  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)
-      8.88±0.1μs      7.50±0.04μs     0.85  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint16'>)
-      50.8±0.2μs       42.1±0.2μs     0.83  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int64'>)
-     18.1±0.02μs       14.9±0.3μs     0.82  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int32'>)
-     8.15±0.09μs      6.47±0.04μs     0.79  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int16'>)
-     50.5±0.05μs      39.3±0.05μs     0.78  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int64'>)
-     10.2±0.02μs      7.80±0.03μs     0.77  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint16'>)
-      50.7±0.1μs       38.6±0.1μs     0.76  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int64'>)
-      20.3±0.2μs      14.7±0.03μs     0.72  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint32'>)
-      19.9±0.3μs       12.3±0.1μs     0.62  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint32'>)
-     19.3±0.04μs      11.7±0.07μs     0.60  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint32'>)
-     19.0±0.01μs      10.1±0.01μs     0.53  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int32'>)
-     19.1±0.02μs      9.69±0.04μs     0.51  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int32'>)
-     77.7±0.05μs      6.92±0.01μs     0.09  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.bool_'>)
-     82.7±0.05μs      6.79±0.02μs     0.08  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.bool_'>)
-     77.7±0.04μs      6.38±0.04μs     0.08  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.bool_'>)

AVX2
export NPY_DISABLE_CPU_FEATURES="AVX512F AVX512_SKX"
        before          after
+     4.80±0.03μs      5.29±0.02μs     1.10  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint8'>)
-     9.95±0.01μs      9.44±0.01μs     0.95  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float32'>)
-     10.0±0.04μs      9.42±0.01μs     0.94  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float32'>)
-     14.6±0.08μs       13.1±0.2μs     0.90  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint32'>)
-     7.84±0.09μs      6.98±0.04μs     0.89  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint16'>)
-     14.8±0.07μs      13.0±0.05μs     0.88  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float32'>)
-     7.31±0.03μs      6.32±0.04μs     0.86  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int16'>)
-      7.94±0.1μs      6.85±0.06μs     0.86  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint16'>)
-     7.29±0.02μs      6.28±0.08μs     0.86  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)
-     18.2±0.08μs      15.1±0.04μs     0.83  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float64'>)
-     18.6±0.07μs      15.2±0.07μs     0.82  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float64'>)
-      29.3±0.2μs       23.8±0.3μs     0.81  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float64'>)
-     11.4±0.04μs      9.26±0.04μs     0.81  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int32'>)
-      31.0±0.4μs       24.7±0.1μs     0.79  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int64'>)
-      11.7±0.2μs      9.25±0.06μs     0.79  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int32'>)
-      34.6±0.8μs       27.1±0.2μs     0.78  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint64'>)
-      12.7±0.2μs      9.83±0.03μs     0.77  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint32'>)
-      12.7±0.2μs      9.67±0.02μs     0.76  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint32'>)
-     28.2±0.01μs      17.8±0.03μs     0.63  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int64'>)
-      28.4±0.1μs      17.7±0.03μs     0.62  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint64'>)
-      28.4±0.1μs      17.7±0.02μs     0.62  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint64'>)
-      28.3±0.1μs      17.6±0.02μs     0.62  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int64'>)
-     82.7±0.03μs      6.03±0.01μs     0.07  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.bool_'>)
-     77.7±0.02μs      5.42±0.02μs     0.07  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.bool_'>)
-      77.8±0.2μs      5.37±0.06μs     0.07  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.bool_'>)
AVX512F
export NPY_DISABLE_CPU_FEATURES="AVX512_SKX"
        before          after
+     18.5±0.09μs       19.5±0.2μs     1.06  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float64'>)
-      7.19±0.1μs       6.83±0.2μs     0.95  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)
-      5.02±0.1μs       4.72±0.2μs     0.94  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint8'>)
-      5.04±0.2μs       4.69±0.2μs     0.93  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint8'>)
-     4.99±0.08μs       4.63±0.1μs     0.93  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int8'>)
-     4.69±0.02μs      4.32±0.06μs     0.92  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint8'>)
-      7.39±0.1μs       6.73±0.2μs     0.91  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int16'>)
-     7.85±0.06μs       7.01±0.3μs     0.89  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint16'>)
-     8.26±0.02μs       7.26±0.1μs     0.88  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int16'>)
-      11.6±0.2μs       10.1±0.1μs     0.87  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int32'>)
-      7.92±0.1μs       6.79±0.3μs     0.86  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint16'>)
-     12.1±0.07μs       10.2±0.1μs     0.84  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int32'>)
-     8.36±0.03μs       7.00±0.2μs     0.84  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint16'>)
-      12.5±0.1μs       10.0±0.2μs     0.80  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint32'>)
-      13.1±0.3μs       10.1±0.1μs     0.77  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint32'>)
-      30.4±0.6μs       22.9±0.9μs     0.75  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float64'>)
-      14.2±0.4μs      10.1±0.04μs     0.71  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint32'>)
-      14.0±0.2μs      9.94±0.09μs     0.71  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int32'>)
-      31.0±0.9μs       21.8±0.6μs     0.70  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int64'>)
-      28.4±0.1μs       19.7±0.3μs     0.70  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int64'>)
-      28.4±0.1μs       19.7±0.1μs     0.69  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint64'>)
-      28.4±0.1μs       19.6±0.1μs     0.69  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int64'>)
-      28.6±0.1μs       19.7±0.2μs     0.69  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint64'>)
-      14.8±0.1μs      10.0±0.06μs     0.68  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float32'>)
-      34.3±0.8μs       21.8±0.4μs     0.63  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint64'>)
-      82.5±0.2μs      6.08±0.09μs     0.07  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.bool_'>)
-     77.7±0.07μs       5.61±0.2μs     0.07  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.bool_'>)
-      77.9±0.5μs       5.60±0.2μs     0.07  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.bool_'>)
AVX512BW
unset NPY_DISABLE_CPU_FEATURES
        before          after
-     4.87±0.04μs      4.54±0.08μs     0.93  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int8'>)
-     4.92±0.02μs      4.56±0.04μs     0.93  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint8'>)
-     4.88±0.01μs      4.52±0.05μs     0.93  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int8'>)
-     4.95±0.04μs      4.54±0.03μs     0.92  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint8'>)
-     4.72±0.01μs      4.21±0.07μs     0.89  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int8'>)
-     4.73±0.02μs      4.21±0.03μs     0.89  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint8'>)
-     7.09±0.07μs      5.82±0.05μs     0.82  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int16'>)
-     7.04±0.06μs      5.62±0.07μs     0.80  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)
-     8.26±0.03μs      6.24±0.04μs     0.76  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int16'>)
-     7.60±0.06μs      5.67±0.04μs     0.75  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint16'>)
-     9.94±0.03μs       7.38±0.4μs     0.74  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float32'>)
-     8.38±0.05μs      6.22±0.04μs     0.74  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint16'>)
-     7.64±0.07μs      5.62±0.05μs     0.74  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint16'>)
-     10.1±0.05μs       7.32±0.4μs     0.73  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float32'>)
-     13.3±0.04μs       8.81±0.7μs     0.66  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int32'>)
-     14.9±0.08μs       9.44±0.2μs     0.63  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float32'>)
-      11.8±0.3μs       7.36±0.3μs     0.62  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int32'>)
-      11.7±0.2μs       7.27±0.4μs     0.62  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int32'>)
-      18.2±0.3μs       11.3±0.2μs     0.62  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float64'>)
-      29.9±0.5μs       18.4±0.6μs     0.62  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float64'>)
-      31.6±0.6μs         19.3±1μs     0.61  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int64'>)
-     18.5±0.02μs       11.1±0.2μs     0.60  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float64'>)
-      14.2±0.3μs       8.43±0.8μs     0.59  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint32'>)
-     12.6±0.08μs       7.41±0.5μs     0.59  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint32'>)
-      33.2±0.2μs       19.5±0.6μs     0.59  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint64'>)
-     13.0±0.06μs       7.33±0.4μs     0.56  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint32'>)
-      28.3±0.1μs       11.1±0.3μs     0.39  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint64'>)
-     28.2±0.02μs       11.1±0.3μs     0.39  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int64'>)
-      28.2±0.1μs       11.0±0.3μs     0.39  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int64'>)
-     28.5±0.05μs       11.1±0.3μs     0.39  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint64'>)
-     77.7±0.04μs      4.52±0.07μs     0.06  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.bool_'>)
-     77.6±0.01μs      4.47±0.04μs     0.06  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.bool_'>)
-      82.7±0.1μs      4.24±0.04μs     0.05  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.bool_'>)

Power little-endian (Power9/VSX3)

CPU
Machine
Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          160
On-line CPU(s) list:             0-159
Thread(s) per core:              4
Core(s) per socket:              20
Socket(s):                       2
NUMA node(s):                    2
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9, altivec supported
Frequency boost:                 enabled
CPU max MHz:                     3800.0000
CPU min MHz:                     2166.0000
L1d cache:                       1.3 MiB
L1i cache:                       1.3 MiB
L2 cache:                        10 MiB
L3 cache:                        200 MiB
NUMA node0 CPU(s):               0-79
NUMA node8 CPU(s):               80-159
OS
gcc (GCC) 11.2.1 20210921
Ubuntu 20.04.3 LTS

Benchmark

VSX3
        before          after
-      17.2±0.4μs       14.9±0.5μs     0.86  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int32'>)
-      11.9±0.2μs       9.93±0.3μs     0.83  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)
-      18.2±0.5μs       14.6±0.3μs     0.80  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint32'>)
-      12.6±0.4μs       9.72±0.3μs     0.77  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint16'>)
-      20.5±0.3μs       15.7±0.4μs     0.76  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int32'>)
-      18.6±0.2μs       14.0±0.1μs     0.75  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int32'>)
-     20.7±0.05μs       15.2±0.3μs     0.73  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint32'>)
-      30.0±0.2μs       21.7±0.2μs     0.72  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int64'>)
-      30.4±0.4μs       22.0±0.4μs     0.72  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int64'>)
-      12.8±0.1μs       9.29±0.4μs     0.72  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int16'>)
-      30.2±0.2μs       21.8±0.2μs     0.72  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint64'>)
-      35.0±0.3μs       25.0±0.6μs     0.72  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int64'>)
-      30.4±0.5μs       21.8±0.2μs     0.72  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint64'>)
-      10.3±0.3μs       6.96±0.2μs     0.68  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int8'>)
-        15.9±2μs       10.1±0.3μs     0.63  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int16'>)
-        43.2±8μs      24.6±0.08μs     0.57  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint64'>)
-        28.2±6μs       14.3±0.7μs     0.51  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint32'>)
-      56.5±0.8μs       24.6±0.3μs     0.44  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float64'>)
-      57.0±0.3μs       22.1±0.5μs     0.39  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float64'>)
-      57.3±0.7μs       21.7±0.2μs     0.38  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float64'>)
-      70.8±0.5μs       15.3±0.2μs     0.22  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float32'>)
-      68.7±0.5μs       14.1±0.4μs     0.21  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float32'>)
-        112±30μs       14.1±0.5μs     0.13  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float32'>)
-        72.5±9μs       8.25±0.2μs     0.11  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.bool_'>)
-      74.2±0.2μs       8.04±0.1μs     0.11  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.bool_'>)
-        75.2±1μs      7.78±0.07μs     0.10  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.bool_'>)


AArch64

As I do not have access to an ARM processor, I didn't execute the benchmark to check performance. But I was able to test the NEON code I added in this PR on a small ARM processor.

cc: @mattip and @seiko2plus

@mattip
Copy link
Member

mattip commented May 19, 2022

@seiko2plus thoughts? This significantly speeds up integer comparisons

@rafaelcfsousa rafaelcfsousa force-pushed the simd_comparison branch 4 times, most recently from 685eeeb to 8da7100 Compare May 20, 2022 14:53
@rafaelcfsousa
Copy link
Contributor Author

Just noticed that I also have to add a test for array OP scalar (and scalar OP array).

@rafaelcfsousa
Copy link
Contributor Author

I already have the required tests. The PR is ready to be reviewed again. Thanks :)
cc: @mattip and @seiko2plus

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary improvements to the x86 architecture

numpy/core/src/common/simd/avx2/conversion.h Outdated Show resolved Hide resolved
numpy/core/src/common/simd/avx512/conversion.h Outdated Show resolved Hide resolved
numpy/core/src/umath/loops_comparison.dispatch.c.src Outdated Show resolved Hide resolved
numpy/core/src/umath/loops_comparison.dispatch.c.src Outdated Show resolved Hide resolved
Copy link
Contributor

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this test would lend itself very well to parametrizing across the comparison too, examples below.

numpy/core/tests/test_umath.py Outdated Show resolved Hide resolved
numpy/core/tests/test_umath.py Outdated Show resolved Hide resolved
numpy/core/tests/test_umath.py Outdated Show resolved Hide resolved
@rafaelcfsousa
Copy link
Contributor Author

We have some errors in the CI but I am not sure if they are related to the latest changes.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, we're almost done here just few changes more.

numpy/core/src/umath/loops_comparison.dispatch.c.src Outdated Show resolved Hide resolved
numpy/core/src/common/simd/avx512/conversion.h Outdated Show resolved Hide resolved
numpy/core/src/common/simd/avx2/operators.h Outdated Show resolved Hide resolved
numpy/core/src/common/simd/avx2/operators.h Outdated Show resolved Hide resolved
numpy/core/tests/test_simd.py Outdated Show resolved Hide resolved
numpy/core/src/umath/loops_comparison.dispatch.c.src Outdated Show resolved Hide resolved
numpy/core/src/common/simd/avx512/conversion.h Outdated Show resolved Hide resolved
@rafaelcfsousa
Copy link
Contributor Author

See below the results updated:

Benchmark/Performance:

SSE
export NPY_DISABLE_CPU_FEATURES="AVX F16C FMA3 AVX2 AVX512F AVX512CD AVX512_SKX"

       before           after         ratio
     [ae8b9ce9]       [04f68033]
     <main>           <simd_comparison>
+     8.24±0.07μs      8.92±0.02μs     1.08  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint16'>)
+     5.40±0.04μs      5.76±0.02μs     1.07  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint8'>)
+     5.37±0.02μs      5.70±0.07μs     1.06  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint8'>)
-      17.8±0.1μs       16.0±0.2μs     0.90  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float64'>)
-     5.84±0.02μs      5.11±0.01μs     0.87  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint8'>)
-      50.8±0.1μs       43.6±0.2μs     0.86  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint64'>)
-      8.92±0.1μs      7.61±0.03μs     0.85  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint16'>)
-     8.72±0.09μs      7.37±0.04μs     0.85  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint16'>)
-      50.4±0.1μs       42.5±0.1μs     0.84  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint64'>)
-     7.70±0.07μs       6.48±0.2μs     0.84  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int16'>)
-        30.0±2μs       25.1±0.6μs     0.84  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float64'>)
-     17.9±0.03μs       14.8±0.1μs     0.83  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int32'>)
-      50.4±0.3μs       40.0±0.4μs     0.79  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int64'>)
-     50.5±0.05μs      39.0±0.04μs     0.77  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int64'>)
-      50.4±0.1μs       38.1±0.1μs     0.76  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int64'>)
-     20.1±0.04μs      14.8±0.05μs     0.73  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint32'>)
-     9.81±0.02μs       6.83±0.2μs     0.70  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)
-     20.2±0.09μs       13.0±0.1μs     0.64  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint32'>)
-      19.0±0.1μs      11.6±0.07μs     0.61  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint32'>)
-     18.8±0.04μs      11.3±0.02μs     0.60  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int32'>)
-     18.7±0.03μs      9.46±0.08μs     0.50  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int32'>)
-     77.5±0.05μs      5.76±0.03μs     0.07  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.bool_'>)
-     77.6±0.02μs      5.40±0.04μs     0.07  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.bool_'>)
-     76.5±0.05μs      5.32±0.02μs     0.07  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.bool_'>)
AVX2
export NPY_DISABLE_CPU_FEATURES="AVX512F AVX512CD AVX512_SKX"

-      13.5±0.3μs      12.7±0.03μs     0.94  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint32'>)
-      7.28±0.1μs       6.65±0.1μs     0.91  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint16'>)
-      6.72±0.1μs      6.11±0.07μs     0.91  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int16'>)
-      7.38±0.2μs      6.67±0.05μs     0.90  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint16'>)
-      6.77±0.1μs      6.04±0.06μs     0.89  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)
-      14.5±0.2μs      12.8±0.05μs     0.89  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float32'>)
-      11.0±0.2μs      9.08±0.04μs     0.82  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int32'>)
-      11.1±0.2μs      9.04±0.05μs     0.81  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int32'>)
-     12.0±0.04μs      9.64±0.04μs     0.81  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint32'>)
-      12.1±0.2μs      9.67±0.03μs     0.80  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint32'>)
-      17.8±0.1μs       14.1±0.2μs     0.80  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float64'>)
-     17.9±0.08μs      13.9±0.08μs     0.78  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float64'>)
-      28.5±0.5μs       22.2±0.2μs     0.78  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float64'>)
-      29.3±0.7μs       21.7±0.4μs     0.74  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int64'>)
-      32.4±0.8μs       23.8±0.7μs     0.74  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint64'>)
-     27.8±0.07μs       16.5±0.2μs     0.60  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint64'>)
-     27.9±0.07μs      16.5±0.04μs     0.59  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint64'>)
-     27.8±0.07μs      16.0±0.04μs     0.57  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int64'>)
-     27.8±0.08μs      16.0±0.03μs     0.57  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int64'>)
-      76.2±0.1μs         5.10±0μs     0.07  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.bool_'>)
-      77.2±0.2μs      4.69±0.04μs     0.06  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.bool_'>)
-      77.5±0.1μs      4.69±0.01μs     0.06  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.bool_'>)
AVX512F
export NPY_DISABLE_CPU_FEATURES="AVX512CD AVX512_SKX"

-      6.79±0.1μs       6.37±0.2μs     0.94  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int16'>)
-      6.84±0.1μs       6.33±0.2μs     0.92  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)
-      7.30±0.1μs       6.71±0.3μs     0.92  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint16'>)
-      7.42±0.2μs       6.74±0.3μs     0.91  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint16'>)
-     4.54±0.01μs      4.08±0.09μs     0.90  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int8'>)
-     4.57±0.03μs      4.05±0.02μs     0.88  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int8'>)
-     4.71±0.03μs      4.16±0.09μs     0.88  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint8'>)
-     4.58±0.04μs      4.00±0.04μs     0.87  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int8'>)
-     4.72±0.03μs      4.08±0.09μs     0.87  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint8'>)
-     4.60±0.01μs       3.91±0.1μs     0.85  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint8'>)
-     11.7±0.04μs      9.72±0.03μs     0.83  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int32'>)
-     11.6±0.05μs      9.70±0.01μs     0.83  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int32'>)
-     8.14±0.06μs      6.55±0.04μs     0.80  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint16'>)
-     8.07±0.04μs      6.47±0.06μs     0.80  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int16'>)
-     12.6±0.03μs      9.73±0.02μs     0.77  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint32'>)
-     12.6±0.03μs       9.72±0.1μs     0.77  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint32'>)
-      18.0±0.2μs      12.9±0.03μs     0.72  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float64'>)
-     18.3±0.08μs      12.8±0.04μs     0.70  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float64'>)
-     13.7±0.03μs      9.00±0.08μs     0.66  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int32'>)
-      29.0±0.3μs         18.9±1μs     0.65  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float64'>)
-     14.5±0.05μs      8.90±0.03μs     0.61  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint32'>)
-      14.6±0.2μs      8.93±0.06μs     0.61  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float32'>)
-      32.1±0.3μs       18.1±0.8μs     0.56  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int64'>)
-      34.2±0.2μs       18.1±0.9μs     0.53  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint64'>)
-     27.9±0.03μs       12.9±0.3μs     0.46  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int64'>)
-     27.9±0.07μs      12.8±0.04μs     0.46  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int64'>)
-      28.3±0.2μs      12.8±0.08μs     0.45  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint64'>)
-     28.2±0.07μs      12.8±0.02μs     0.45  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint64'>)
-     76.4±0.02μs      4.12±0.06μs     0.05  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.bool_'>)
-     77.4±0.03μs      4.06±0.04μs     0.05  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.bool_'>)
-     77.4±0.03μs      4.04±0.05μs     0.05  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.bool_'>)
AVX512BW
unset NPY_DISABLE_CPU_FEATURES

       before           after         ratio
     [ae8b9ce9]       [04f68033]
     <main>           <simd_comparison>
-     4.49±0.01μs      4.27±0.03μs     0.95  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int8'>)
-        4.49±0μs      4.21±0.05μs     0.94  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int8'>)
-     4.52±0.01μs      4.23±0.05μs     0.94  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint8'>)
-     4.56±0.04μs      4.24±0.05μs     0.93  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint8'>)
-     4.56±0.01μs      4.01±0.02μs     0.88  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint8'>)
-     4.54±0.02μs      3.96±0.02μs     0.87  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int8'>)
-      9.41±0.1μs      7.08±0.02μs     0.75  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float32'>)
-      13.0±0.2μs      9.71±0.05μs     0.75  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int32'>)
-     6.91±0.04μs       5.06±0.2μs     0.73  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int16'>)
-     6.82±0.09μs       4.96±0.1μs     0.73  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)
-     9.67±0.02μs      6.98±0.03μs     0.72  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float32'>)
-      28.1±0.1μs       20.3±0.4μs     0.72  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float64'>)
-      13.5±0.3μs      9.69±0.06μs     0.72  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint32'>)
-      28.8±0.2μs       20.3±0.8μs     0.71  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int64'>)
-     14.6±0.09μs      9.71±0.02μs     0.67  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float32'>)
-     7.53±0.04μs       5.00±0.1μs     0.66  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint16'>)
-     8.03±0.02μs       5.26±0.4μs     0.66  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int16'>)
-     7.60±0.04μs       4.97±0.2μs     0.65  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint16'>)
-      11.0±0.1μs      6.94±0.04μs     0.63  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int32'>)
-     8.13±0.02μs       5.09±0.5μs     0.63  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint16'>)
-     11.1±0.02μs      6.95±0.02μs     0.62  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int32'>)
-      17.6±0.2μs      10.8±0.03μs     0.62  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float64'>)
-     18.2±0.08μs      11.0±0.05μs     0.60  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float64'>)
-     31.7±0.07μs       19.1±0.4μs     0.60  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint64'>)
-     12.1±0.05μs      7.01±0.02μs     0.58  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint32'>)
-     12.2±0.08μs      7.00±0.02μs     0.58  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint32'>)
-     27.8±0.05μs      10.9±0.03μs     0.39  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int64'>)
-     27.8±0.04μs      10.9±0.08μs     0.39  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint64'>)
-     27.8±0.04μs      10.8±0.04μs     0.39  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int64'>)
-     27.8±0.03μs      10.8±0.03μs     0.39  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint64'>)
-     77.3±0.05μs      4.17±0.06μs     0.05  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.bool_'>)
-      77.4±0.1μs      4.16±0.03μs     0.05  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.bool_'>)
-     76.4±0.02μs      3.97±0.01μs     0.05  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.bool_'>)
VSX3
-      11.4±0.8μs       8.90±0.9μs     0.78  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint16'>)
-        16.4±2μs         12.7±1μs     0.78  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint32'>)
-        28.2±2μs         21.5±1μs     0.76  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint64'>)
-        28.0±2μs         20.3±1μs     0.72  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.uint64'>)
-        16.8±2μs         12.1±1μs     0.72  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.int32'>)
-        16.8±2μs         12.1±1μs     0.72  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.uint32'>)
-        17.1±4μs         12.3±1μs     0.72  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int32'>)
-        19.6±6μs         14.0±1μs     0.71  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint32'>)
-        32.6±2μs         23.3±2μs     0.71  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint64'>)
-      11.9±0.8μs       8.47±0.7μs     0.71  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int16'>)
-        28.4±2μs         20.2±1μs     0.71  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int64'>)
-        19.6±2μs       13.9±0.9μs     0.71  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int32'>)
-        15.6±5μs       8.71±0.9μs     0.56  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.uint16'>)
-       42.6±10μs         23.6±2μs     0.55  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.int64'>)
-      50.3±0.5μs         22.9±1μs     0.46  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float64'>)
-       19.6±10μs       8.73±0.9μs     0.44  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.int16'>)
-        54.2±4μs         20.3±2μs     0.37  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float64'>)
-        55.3±5μs         20.3±1μs     0.37  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float64'>)
-        65.2±5μs      13.2±0.07μs     0.20  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.float32'>)
-        65.0±4μs         13.1±1μs     0.20  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.float32'>)
-        64.9±5μs         12.7±1μs     0.20  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.float32'>)
-        67.8±5μs       7.34±0.6μs     0.11  bench_ufunc.CustomComparison.time_less_than_binary(<class 'numpy.bool_'>)
-        70.3±6μs       6.30±0.5μs     0.09  bench_ufunc.CustomComparison.time_less_than_scalar2(<class 'numpy.bool_'>)
-       75.9±70μs         6.66±2μs     0.09  bench_ufunc.CustomComparison.time_less_than_scalar1(<class 'numpy.bool_'>

Binary size:

File main simd simd+binaryopt
_multiarray_umath.cpython-39-x86_64-linux-gnu.so 5138K 5603K 5407K
_multiarray_umath.cpython-39-powerpc64le-linux-gnu.so 30435K 31712K 30477K

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!, just one thing left.

numpy/core/src/umath/loops_comparison.dispatch.c.src Outdated Show resolved Hide resolved
Comment on lines +184 to +191
def time_less_than_binary(self, dtype):
(self.x < self.y)

def time_less_than_scalar1(self, dtype):
(self.s < self.x)

def time_less_than_scalar2(self, dtype):
(self.d < 1)
(self.x < self.s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattip, benchmarks here doesn't cover the rest of operations, not sure if its necessary to cover them or its better to save a room for the upcoming benchmarks since its the benchmark process start to became pretty slow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for minimal effective benchmarks. I don't think there is a need to repeatedly hit the same code multiple times

@seiko2plus
Copy link
Member

@rafaelcfsousa, its better to strip the debugging symbols before reporting the binary size. also I think we need a release note.

@mattip
Copy link
Member

mattip commented May 29, 2022

I think we need a release note.

On the one hand, there are hints in 'numpy/doc/release/upcoming_changes'. On the other, I see we are not very diligent about adding release notes to other PRs for SIMD performance.

@mattip mattip added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label May 29, 2022
This commit also applies some techniques to reduce the size of the binary generated from the source loops_comparison.dispatch.c.src
This commit also rewrite the tests andc, orc and xnor
rafaelcfsousa added a commit to rafaelcfsousa/numpy that referenced this pull request May 30, 2022
PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
rafaelcfsousa added a commit to rafaelcfsousa/numpy that referenced this pull request May 30, 2022
The PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
rafaelcfsousa added a commit to rafaelcfsousa/numpy that referenced this pull request May 30, 2022
The PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
The PR numpy#21483 improves the execution time of the comparison functions by using universal intrinsics
@rafaelcfsousa
Copy link
Contributor Author

rafaelcfsousa commented May 30, 2022

@rafaelcfsousa, its better to strip the debugging symbols before reporting the binary size. also I think we need a release note.

@seiko2plus :
For some reason, the compiler I was using on Power9/VSX3 was adding the debug flag -g even without specifying the use of the flag --debug to build NumPy. After using a different compiler, now I see the following for Power9/VSX3:

File main simd simd+binaryopt
_multiarray_umath.cpython-39-powerpc64le-linux-gnu.so 6661K 6666K 6603K

@rafaelcfsousa
Copy link
Contributor Author

@seiko2plus and @mattip :
I added a release note.

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@mattip mattip merged commit c8de16e into numpy:main May 30, 2022
@mattip
Copy link
Member

mattip commented May 30, 2022

Thanks @rafaelcfsousa and thanks @seiko2plus for the careful review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants