Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH:Umath Replace raw SIMD of unary float point(32-64) with NPYV - g0 #16247

Merged
merged 4 commits into from
Nov 10, 2020

Conversation

seiko2plus
Copy link
Member

@seiko2plus seiko2plus commented May 15, 2020

This pull-request:

  • replaces the raw SIMD of sqrt, absolute, square, reciprocal
    with NumPy C SIMD vectorization interface(NPYV).

  • fix SIMD memory overlap check for aliasing(same ptr & stride)

  • unify fp/domain errors for both scalars and vectors,
    which lead to fix AVX test failures for 32 bit manylinux1 wheels #17174.

  • improves float32 division precision on NEON/A32

  • add new NPYV intrinsics sqrt, abs, recip and square

  • reorder Python.h to suppress warning 'declaration of 'struct timespec*'

merge after #17340
closes #17174
TODO:

  • put the old to rest
  • init the new
  • add new intrinsics for non-contaguies load/store (x86)
  • add new intrinsics for non-contaguies load/store (ppc64le)
  • add new intrinsics for non-contaguies load/store (arm)
  • add new intrinsics for the mentiond operations (x86)
  • add new intrinsics for the mentiond operations (ppc64le)
  • add new intrinsics for the mentiond operations (arm)
  • add the SIMD kernels
  • benchmarks
  • add testing cases for the new intrinsic

Performance tests

Args used within #15987

--filter "(absol*|recip*|sqrt|square).*[fd]::.*->" --strides 1 2 10 --msleep 1 --iteration 100

Note: --msleep 1 force the running thread to sleep 1 millisecond before collecting each sample
to revert any frequency reduction, since it seems that throttling effect on wall time when AVX512F is enabled.

X86

CPU
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GH
                                 z
Stepping:                        7
CPU MHz:                         3604.410
BogoMIPS:                        6000.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        2 MiB
L3 cache:                        35.8 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no m
                                 icrocode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __us
                                 er pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP dis
                                 abled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep m
                                 trr pge mca cmov pat pse36 clflush mmx fxsr s
                                 se sse2 ss ht syscall nx pdpe1gb rdtscp lm co
                                 nstant_tsc rep_good nopl xtopology nonstop_ts
                                 c cpuid aperfmperf tsc_known_freq pni pclmulq
                                 dq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic m
                                 ovbe popcnt tsc_deadline_timer aes xsave avx 
                                 f16c rdrand hypervisor lahf_lm abm 3dnowprefe
                                 tch invpcid_single pti fsgsbase tsc_adjust bm
                                 i1 avx2 smep bmi2 erms invpcid mpx avx512f av
                                 x512dq rdseed adx smap clflushopt clwb avx512
                                 cd avx512bw avx512vl xsaveopt xsavec xgetbv1 
                                 xsaves ida arat pku ospke
OS
Linux ip-172-31-28-146 5.4.0-1025-aws
gcc version 7.5.0 (Ubuntu 7.5.0-6ubuntu2)

Benchmark

AVX512F - Contiguous only

metric: gmean, units: ms

name of test before_avx512f after_sse3 after_sse3 vs before_avx512f
absolute::1024      d::1  ->  d::1 0.0004 0.0002 2.37
absolute::2048      d::1  ->  d::1 0.0007 0.0004 1.82
absolute::4096      d::1  ->  d::1 0.0013 0.0009 1.44
absolute::1024      f::1  ->  f::1 0.0002 0.0001 2.6
absolute::2048      f::1  ->  f::1 0.0004 0.0002 2.46
absolute::4096      f::1  ->  f::1 0.0007 0.0004 1.5
reciprocal::1024      d::1  ->  d::1 0.0008 0.0006 1.33
reciprocal::2048      d::1  ->  d::1 0.0014 0.0011 1.24
reciprocal::4096      d::1  ->  d::1 0.0027 0.0023 1.2
reciprocal::1024      f::1  ->  f::1 0.0003 0.0002 1.5
reciprocal::2048      f::1  ->  f::1 0.0007 0.0004 1.61
reciprocal::4096      f::1  ->  f::1 0.0011 0.0009 1.28
sqrt::1024      d::1  ->  d::1 0.0011 0.0009 1.26
sqrt::2048      d::1  ->  d::1 0.0021 0.0017 1.21
sqrt::4096      d::1  ->  d::1 0.0041 0.0034 1.19
sqrt::1024      f::1  ->  f::1 0.0003 0.0002 1.5
sqrt::2048      f::1  ->  f::1 0.0007 0.0004 1.62
sqrt::4096      f::1  ->  f::1 0.0012 0.0009 1.4
square::1024      d::1  ->  d::1 0.0004 0.0002 2.25
square::2048      d::1  ->  d::1 0.0007 0.0004 1.97
square::4096      d::1  ->  d::1 0.0013 0.0009 1.51
square::1024      f::1  ->  f::1 0.0002 0.0001 1.96
square::2048      f::1  ->  f::1 0.0004 0.0002 1.84
square::4096      f::1  ->  f::1 0.0007 0.0004 1.58
AVX512F

metric: gmean, units: ms

name of test before_avx512f after_sse3 after_sse3 vs before_avx512f
absolute::1024      d::1  ->  d::1 0.0004 0.0002 2.36
absolute::2048      d::1  ->  d::1 0.0007 0.0004 1.81
absolute::4096      d::1  ->  d::1 0.0013 0.0009 1.44
absolute::1024      d::1  ->  d::2 0.0007 0.0003 2.23
absolute::2048      d::1  ->  d::2 0.0013 0.0009 1.53
absolute::4096      d::1  ->  d::2 0.0027 0.0018 1.52
absolute::1024      d::1  ->  d::10 0.0010 0.0010 1.0
absolute::2048      d::1  ->  d::10 0.0021 0.0020 1.0
absolute::4096      d::1  ->  d::10 0.0041 0.0041 1.0
absolute::1024      d::2  ->  d::1 0.0004 0.0002 1.87
absolute::2048      d::2  ->  d::1 0.0010 0.0007 1.48
absolute::4096      d::2  ->  d::1 0.0018 0.0013 1.37
absolute::1024      d::2  ->  d::2 0.0008 0.0005 1.7
absolute::2048      d::2  ->  d::2 0.0016 0.0009 1.76
absolute::4096      d::2  ->  d::2 0.0029 0.0018 1.55
absolute::1024      d::2  ->  d::10 0.0011 0.0011 0.99
absolute::2048      d::2  ->  d::10 0.0022 0.0022 0.98
absolute::4096      d::2  ->  d::10 0.0043 0.0044 0.99
absolute::1024      d::10  ->  d::1 0.0008 0.0005 1.46
absolute::2048      d::10  ->  d::1 0.0013 0.0011 1.26
absolute::4096      d::10  ->  d::1 0.0026 0.0021 1.22
absolute::1024      d::10  ->  d::2 0.0008 0.0007 1.13
absolute::2048      d::10  ->  d::2 0.0015 0.0014 1.11
absolute::4096      d::10  ->  d::2 0.0031 0.0028 1.12
absolute::1024      d::10  ->  d::10 0.0015 0.0014 1.07
absolute::2048      d::10  ->  d::10 0.0029 0.0028 1.04
absolute::4096      d::10  ->  d::10 0.0059 0.0056 1.05
absolute::1024      f::1  ->  f::1 0.0002 0.0001 2.42
absolute::2048      f::1  ->  f::1 0.0004 0.0002 2.42
absolute::4096      f::1  ->  f::1 0.0007 0.0004 1.6
absolute::1024      f::1  ->  f::2 0.0007 0.0003 2.23
absolute::2048      f::1  ->  f::2 0.0013 0.0006 2.26
absolute::4096      f::1  ->  f::2 0.0027 0.0014 1.91
absolute::1024      f::1  ->  f::10 0.0008 0.0007 1.1
absolute::2048      f::1  ->  f::10 0.0015 0.0013 1.14
absolute::4096      f::1  ->  f::10 0.0030 0.0026 1.13
absolute::1024      f::2  ->  f::1 0.0003 0.0003 1.3
absolute::2048      f::2  ->  f::1 0.0007 0.0005 1.33
absolute::4096      f::2  ->  f::1 0.0014 0.0012 1.22
absolute::1024      f::2  ->  f::2 0.0008 0.0005 1.7
absolute::2048      f::2  ->  f::2 0.0015 0.0009 1.68
absolute::4096      f::2  ->  f::2 0.0027 0.0018 1.52
absolute::1024      f::2  ->  f::10 0.0008 0.0007 1.08
absolute::2048      f::2  ->  f::10 0.0016 0.0014 1.1
absolute::4096      f::2  ->  f::10 0.0031 0.0028 1.12
absolute::1024      f::10  ->  f::1 0.0006 0.0004 1.45
absolute::2048      f::10  ->  f::1 0.0010 0.0008 1.25
absolute::4096      f::10  ->  f::1 0.0019 0.0016 1.18
absolute::1024      f::10  ->  f::2 0.0008 0.0005 1.41
absolute::2048      f::10  ->  f::2 0.0015 0.0011 1.43
absolute::4096      f::10  ->  f::2 0.0030 0.0021 1.44
absolute::1024      f::10  ->  f::10 0.0008 0.0008 1.06
absolute::2048      f::10  ->  f::10 0.0017 0.0016 1.05
absolute::4096      f::10  ->  f::10 0.0033 0.0032 1.05
reciprocal::1024      d::1  ->  d::1 0.0007 0.0006 1.32
reciprocal::2048      d::1  ->  d::1 0.0014 0.0011 1.24
reciprocal::4096      d::1  ->  d::1 0.0027 0.0023 1.2
reciprocal::1024      d::1  ->  d::2 0.0011 0.0006 2.0
reciprocal::2048      d::1  ->  d::2 0.0023 0.0011 1.99
reciprocal::4096      d::1  ->  d::2 0.0046 0.0023 2.0
reciprocal::1024      d::1  ->  d::10 0.0011 0.0010 1.13
reciprocal::2048      d::1  ->  d::10 0.0023 0.0020 1.13
reciprocal::4096      d::1  ->  d::10 0.0046 0.0041 1.12
reciprocal::1024      d::2  ->  d::1 0.0008 0.0006 1.33
reciprocal::2048      d::2  ->  d::1 0.0014 0.0011 1.24
reciprocal::4096      d::2  ->  d::1 0.0027 0.0023 1.2
reciprocal::1024      d::2  ->  d::2 0.0011 0.0006 1.97
reciprocal::2048      d::2  ->  d::2 0.0023 0.0011 1.98
reciprocal::4096      d::2  ->  d::2 0.0046 0.0023 1.99
reciprocal::1024      d::2  ->  d::10 0.0011 0.0011 1.06
reciprocal::2048      d::2  ->  d::10 0.0023 0.0022 1.05
reciprocal::4096      d::2  ->  d::10 0.0046 0.0044 1.06
reciprocal::1024      d::10  ->  d::1 0.0008 0.0006 1.36
reciprocal::2048      d::10  ->  d::1 0.0014 0.0015 0.96
reciprocal::4096      d::10  ->  d::1 0.0028 0.0023 1.21
reciprocal::1024      d::10  ->  d::2 0.0011 0.0007 1.63
reciprocal::2048      d::10  ->  d::2 0.0023 0.0014 1.64
reciprocal::4096      d::10  ->  d::2 0.0046 0.0028 1.63
reciprocal::1024      d::10  ->  d::10 0.0015 0.0014 1.05
reciprocal::2048      d::10  ->  d::10 0.0029 0.0028 1.04
reciprocal::4096      d::10  ->  d::10 0.0058 0.0056 1.04
reciprocal::1024      f::1  ->  f::1 0.0003 0.0002 1.5
reciprocal::2048      f::1  ->  f::1 0.0007 0.0004 1.61
reciprocal::4096      f::1  ->  f::1 0.0011 0.0009 1.27
reciprocal::1024      f::1  ->  f::2 0.0009 0.0003 2.78
reciprocal::2048      f::1  ->  f::2 0.0017 0.0006 2.89
reciprocal::4096      f::1  ->  f::2 0.0034 0.0014 2.45
reciprocal::1024      f::1  ->  f::10 0.0009 0.0007 1.3
reciprocal::2048      f::1  ->  f::10 0.0017 0.0013 1.29
reciprocal::4096      f::1  ->  f::10 0.0034 0.0026 1.33
reciprocal::1024      f::2  ->  f::1 0.0003 0.0003 1.35
reciprocal::2048      f::2  ->  f::1 0.0007 0.0005 1.43
reciprocal::4096      f::2  ->  f::1 0.0015 0.0012 1.25
reciprocal::1024      f::2  ->  f::2 0.0009 0.0005 1.74
reciprocal::2048      f::2  ->  f::2 0.0017 0.0010 1.7
reciprocal::4096      f::2  ->  f::2 0.0034 0.0020 1.7
reciprocal::1024      f::2  ->  f::10 0.0009 0.0007 1.18
reciprocal::2048      f::2  ->  f::10 0.0017 0.0014 1.2
reciprocal::4096      f::2  ->  f::10 0.0034 0.0028 1.2
reciprocal::1024      f::10  ->  f::1 0.0005 0.0004 1.24
reciprocal::2048      f::10  ->  f::1 0.0010 0.0008 1.27
reciprocal::4096      f::10  ->  f::1 0.0019 0.0017 1.12
reciprocal::1024      f::10  ->  f::2 0.0009 0.0006 1.5
reciprocal::2048      f::10  ->  f::2 0.0017 0.0011 1.51
reciprocal::4096      f::10  ->  f::2 0.0034 0.0022 1.53
reciprocal::1024      f::10  ->  f::10 0.0009 0.0008 1.11
reciprocal::2048      f::10  ->  f::10 0.0017 0.0016 1.08
reciprocal::4096      f::10  ->  f::10 0.0034 0.0032 1.08
sqrt::1024      d::1  ->  d::1 0.0011 0.0009 1.27
sqrt::2048      d::1  ->  d::1 0.0021 0.0017 1.21
sqrt::4096      d::1  ->  d::1 0.0041 0.0034 1.19
sqrt::1024      d::1  ->  d::2 0.0054 0.0009 6.33
sqrt::2048      d::1  ->  d::2 0.0108 0.0017 6.32
sqrt::4096      d::1  ->  d::2 0.0217 0.0034 6.35
sqrt::1024      d::1  ->  d::10 0.0054 0.0010 5.15
sqrt::2048      d::1  ->  d::10 0.0108 0.0020 5.3
sqrt::4096      d::1  ->  d::10 0.0217 0.0041 5.33
sqrt::1024      d::2  ->  d::1 0.0011 0.0009 1.26
sqrt::2048      d::2  ->  d::1 0.0021 0.0017 1.21
sqrt::4096      d::2  ->  d::1 0.0041 0.0034 1.19
sqrt::1024      d::2  ->  d::2 0.0054 0.0009 6.33
sqrt::2048      d::2  ->  d::2 0.0108 0.0017 6.34
sqrt::4096      d::2  ->  d::2 0.0217 0.0034 6.35
sqrt::1024      d::2  ->  d::10 0.0054 0.0011 4.92
sqrt::2048      d::2  ->  d::10 0.0108 0.0022 4.97
sqrt::4096      d::2  ->  d::10 0.0217 0.0044 4.94
sqrt::1024      d::10  ->  d::1 0.0011 0.0009 1.26
sqrt::2048      d::10  ->  d::1 0.0021 0.0017 1.21
sqrt::4096      d::10  ->  d::1 0.0041 0.0034 1.19
sqrt::1024      d::10  ->  d::2 0.0054 0.0009 6.32
sqrt::2048      d::10  ->  d::2 0.0108 0.0017 6.34
sqrt::4096      d::10  ->  d::2 0.0217 0.0034 6.33
sqrt::1024      d::10  ->  d::10 0.0054 0.0014 3.86
sqrt::2048      d::10  ->  d::10 0.0108 0.0028 3.87
sqrt::4096      d::10  ->  d::10 0.0217 0.0056 3.86
sqrt::1024      f::1  ->  f::1 0.0003 0.0002 1.63
sqrt::2048      f::1  ->  f::1 0.0007 0.0004 1.61
sqrt::4096      f::1  ->  f::1 0.0012 0.0009 1.4
sqrt::1024      f::1  ->  f::2 0.0037 0.0003 12.14
sqrt::2048      f::1  ->  f::2 0.0074 0.0006 12.51
sqrt::4096      f::1  ->  f::2 0.0148 0.0014 10.57
sqrt::1024      f::1  ->  f::10 0.0037 0.0007 5.51
sqrt::2048      f::1  ->  f::10 0.0074 0.0013 5.62
sqrt::4096      f::1  ->  f::10 0.0148 0.0026 5.66
sqrt::1024      f::2  ->  f::1 0.0004 0.0003 1.5
sqrt::2048      f::2  ->  f::1 0.0008 0.0005 1.61
sqrt::4096      f::2  ->  f::1 0.0014 0.0012 1.24
sqrt::1024      f::2  ->  f::2 0.0037 0.0005 7.36
sqrt::2048      f::2  ->  f::2 0.0074 0.0010 7.32
sqrt::4096      f::2  ->  f::2 0.0148 0.0020 7.33
sqrt::1024      f::2  ->  f::10 0.0037 0.0007 5.12
sqrt::2048      f::2  ->  f::10 0.0074 0.0014 5.21
sqrt::4096      f::2  ->  f::10 0.0148 0.0028 5.35
sqrt::1024      f::10  ->  f::1 0.0006 0.0004 1.49
sqrt::2048      f::10  ->  f::1 0.0011 0.0008 1.31
sqrt::4096      f::10  ->  f::1 0.0019 0.0016 1.2
sqrt::1024      f::10  ->  f::2 0.0037 0.0006 6.42
sqrt::2048      f::10  ->  f::2 0.0074 0.0011 6.56
sqrt::4096      f::10  ->  f::2 0.0148 0.0023 6.47
sqrt::1024      f::10  ->  f::10 0.0037 0.0008 4.64
sqrt::2048      f::10  ->  f::10 0.0074 0.0016 4.66
sqrt::4096      f::10  ->  f::10 0.0148 0.0032 4.63
square::1024      d::1  ->  d::1 0.0004 0.0002 2.17
square::2048      d::1  ->  d::1 0.0007 0.0004 1.92
square::4096      d::1  ->  d::1 0.0012 0.0009 1.34
square::1024      d::1  ->  d::2 0.0006 0.0003 1.91
square::2048      d::1  ->  d::2 0.0011 0.0009 1.28
square::4096      d::1  ->  d::2 0.0023 0.0018 1.29
square::1024      d::1  ->  d::10 0.0010 0.0010 1.0
square::2048      d::1  ->  d::10 0.0020 0.0020 0.99
square::4096      d::1  ->  d::10 0.0041 0.0041 1.0
square::1024      d::2  ->  d::1 0.0004 0.0003 1.61
square::2048      d::2  ->  d::1 0.0010 0.0007 1.52
square::4096      d::2  ->  d::1 0.0018 0.0013 1.35
square::1024      d::2  ->  d::2 0.0006 0.0004 1.65
square::2048      d::2  ->  d::2 0.0012 0.0009 1.31
square::4096      d::2  ->  d::2 0.0023 0.0018 1.29
square::1024      d::2  ->  d::10 0.0011 0.0011 0.98
square::2048      d::2  ->  d::10 0.0022 0.0022 1.0
square::4096      d::2  ->  d::10 0.0043 0.0043 1.0
square::1024      d::10  ->  d::1 0.0008 0.0005 1.43
square::2048      d::10  ->  d::1 0.0013 0.0011 1.28
square::4096      d::10  ->  d::1 0.0025 0.0021 1.18
square::1024      d::10  ->  d::2 0.0007 0.0007 1.07
square::2048      d::10  ->  d::2 0.0014 0.0014 1.0
square::4096      d::10  ->  d::2 0.0028 0.0028 1.0
square::1024      d::10  ->  d::10 0.0014 0.0014 1.04
square::2048      d::10  ->  d::10 0.0029 0.0028 1.03
square::4096      d::10  ->  d::10 0.0058 0.0062 0.93
square::1024      f::1  ->  f::1 0.0002 0.0001 1.92
square::2048      f::1  ->  f::1 0.0004 0.0002 1.82
square::4096      f::1  ->  f::1 0.0007 0.0004 1.63
square::1024      f::1  ->  f::2 0.0005 0.0003 1.7
square::2048      f::1  ->  f::2 0.0010 0.0006 1.71
square::4096      f::1  ->  f::2 0.0020 0.0014 1.42
square::1024      f::1  ->  f::10 0.0007 0.0007 1.03
square::2048      f::1  ->  f::10 0.0013 0.0013 1.0
square::4096      f::1  ->  f::10 0.0026 0.0026 1.02
square::1024      f::2  ->  f::1 0.0003 0.0002 1.32
square::2048      f::2  ->  f::1 0.0007 0.0005 1.4
square::4096      f::2  ->  f::1 0.0014 0.0012 1.2
square::1024      f::2  ->  f::2 0.0006 0.0005 1.2
square::2048      f::2  ->  f::2 0.0011 0.0009 1.19
square::4096      f::2  ->  f::2 0.0020 0.0018 1.11
square::1024      f::2  ->  f::10 0.0007 0.0007 0.97
square::2048      f::2  ->  f::10 0.0014 0.0014 0.98
square::4096      f::2  ->  f::10 0.0028 0.0028 0.99
square::1024      f::10  ->  f::1 0.0006 0.0004 1.43
square::2048      f::10  ->  f::1 0.0010 0.0008 1.25
square::4096      f::10  ->  f::1 0.0019 0.0016 1.19
square::1024      f::10  ->  f::2 0.0005 0.0005 1.0
square::2048      f::10  ->  f::2 0.0011 0.0011 1.0
square::4096      f::10  ->  f::2 0.0022 0.0021 1.03
square::1024      f::10  ->  f::10 0.0008 0.0008 1.02
square::2048      f::10  ->  f::10 0.0016 0.0016 0.98
square::4096      f::10  ->  f::10 0.0032 0.0032 0.98
AVX2 - Contiguous only

metric: gmean, units: ms

name of test before_avx2 after_sse3 after_sse3 vs before_avx2
absolute::1024      d::1  ->  d::1 0.0004 0.0002 2.27
absolute::2048      d::1  ->  d::1 0.0006 0.0004 1.6
absolute::4096      d::1  ->  d::1 0.0011 0.0009 1.19
absolute::1024      f::1  ->  f::1 0.0003 0.0001 3.32
absolute::2048      f::1  ->  f::1 0.0004 0.0002 2.37
absolute::4096      f::1  ->  f::1 0.0006 0.0004 1.48
reciprocal::1024      d::1  ->  d::1 0.0006 0.0006 1.14
reciprocal::2048      d::1  ->  d::1 0.0012 0.0011 1.06
reciprocal::4096      d::1  ->  d::1 0.0023 0.0023 1.02
reciprocal::1024      f::1  ->  f::1 0.0003 0.0002 1.55
reciprocal::2048      f::1  ->  f::1 0.0005 0.0004 1.17
reciprocal::4096      f::1  ->  f::1 0.0008 0.0009 0.96
sqrt::1024      d::1  ->  d::1 0.0009 0.0009 1.03
sqrt::2048      d::1  ->  d::1 0.0017 0.0017 1.01
sqrt::4096      d::1  ->  d::1 0.0034 0.0034 1.0
sqrt::1024      f::1  ->  f::1 0.0003 0.0002 1.56
sqrt::2048      f::1  ->  f::1 0.0005 0.0004 1.24
sqrt::4096      f::1  ->  f::1 0.0009 0.0009 1.1
square::1024      d::1  ->  d::1 0.0004 0.0002 2.1
square::2048      d::1  ->  d::1 0.0006 0.0004 1.67
square::4096      d::1  ->  d::1 0.0011 0.0009 1.28
square::1024      f::1  ->  f::1 0.0004 0.0001 3.93
square::2048      f::1  ->  f::1 0.0005 0.0002 2.04
square::4096      f::1  ->  f::1 0.0008 0.0004 1.88
AVX2

metric: gmean, units: ms

name of test before_avx2 after_sse3 after_sse3 vs before_avx2
absolute::1024      d::1  ->  d::1 0.0004 0.0002 2.22
absolute::2048      d::1  ->  d::1 0.0006 0.0004 1.58
absolute::4096      d::1  ->  d::1 0.0011 0.0009 1.25
absolute::1024      d::1  ->  d::2 0.0012 0.0003 3.78
absolute::2048      d::1  ->  d::2 0.0023 0.0009 2.6
absolute::4096      d::1  ->  d::2 0.0046 0.0018 2.59
absolute::1024      d::1  ->  d::10 0.0012 0.0010 1.11
absolute::2048      d::1  ->  d::10 0.0023 0.0020 1.12
absolute::4096      d::1  ->  d::10 0.0046 0.0041 1.13
absolute::1024      d::2  ->  d::1 0.0006 0.0002 2.68
absolute::2048      d::2  ->  d::1 0.0011 0.0007 1.64
absolute::4096      d::2  ->  d::1 0.0021 0.0013 1.6
absolute::1024      d::2  ->  d::2 0.0012 0.0005 2.47
absolute::2048      d::2  ->  d::2 0.0023 0.0009 2.44
absolute::4096      d::2  ->  d::2 0.0046 0.0018 2.48
absolute::1024      d::2  ->  d::10 0.0012 0.0011 1.07
absolute::2048      d::2  ->  d::10 0.0024 0.0022 1.06
absolute::4096      d::2  ->  d::10 0.0048 0.0044 1.09
absolute::1024      d::10  ->  d::1 0.0007 0.0005 1.26
absolute::2048      d::10  ->  d::1 0.0012 0.0011 1.13
absolute::4096      d::10  ->  d::1 0.0023 0.0021 1.08
absolute::1024      d::10  ->  d::2 0.0012 0.0007 1.65
absolute::2048      d::10  ->  d::2 0.0023 0.0014 1.65
absolute::4096      d::10  ->  d::2 0.0046 0.0028 1.64
absolute::1024      d::10  ->  d::10 0.0015 0.0014 1.07
absolute::2048      d::10  ->  d::10 0.0029 0.0028 1.04
absolute::4096      d::10  ->  d::10 0.0057 0.0056 1.02
absolute::1024      f::1  ->  f::1 0.0003 0.0001 3.51
absolute::2048      f::1  ->  f::1 0.0004 0.0002 2.36
absolute::4096      f::1  ->  f::1 0.0006 0.0004 1.48
absolute::1024      f::1  ->  f::2 0.0009 0.0003 2.86
absolute::2048      f::1  ->  f::2 0.0017 0.0006 2.89
absolute::4096      f::1  ->  f::2 0.0034 0.0014 2.46
absolute::1024      f::1  ->  f::10 0.0009 0.0007 1.26
absolute::2048      f::1  ->  f::10 0.0017 0.0013 1.31
absolute::4096      f::1  ->  f::10 0.0034 0.0026 1.3
absolute::1024      f::2  ->  f::1 0.0004 0.0003 1.5
absolute::2048      f::2  ->  f::1 0.0006 0.0005 1.25
absolute::4096      f::2  ->  f::1 0.0013 0.0012 1.12
absolute::1024      f::2  ->  f::2 0.0009 0.0005 1.93
absolute::2048      f::2  ->  f::2 0.0017 0.0009 1.9
absolute::4096      f::2  ->  f::2 0.0034 0.0018 1.95
absolute::1024      f::2  ->  f::10 0.0009 0.0007 1.2
absolute::2048      f::2  ->  f::10 0.0017 0.0014 1.22
absolute::4096      f::2  ->  f::10 0.0034 0.0028 1.24
absolute::1024      f::10  ->  f::1 0.0005 0.0004 1.09
absolute::2048      f::10  ->  f::1 0.0008 0.0008 1.01
absolute::4096      f::10  ->  f::1 0.0015 0.0016 0.95
absolute::1024      f::10  ->  f::2 0.0009 0.0005 1.61
absolute::2048      f::10  ->  f::2 0.0017 0.0011 1.63
absolute::4096      f::10  ->  f::2 0.0034 0.0021 1.65
absolute::1024      f::10  ->  f::10 0.0009 0.0008 1.1
absolute::2048      f::10  ->  f::10 0.0017 0.0016 1.1
absolute::4096      f::10  ->  f::10 0.0035 0.0032 1.1
reciprocal::1024      d::1  ->  d::1 0.0006 0.0006 1.14
reciprocal::2048      d::1  ->  d::1 0.0012 0.0011 1.05
reciprocal::4096      d::1  ->  d::1 0.0023 0.0023 1.02
reciprocal::1024      d::1  ->  d::2 0.0011 0.0006 2.0
reciprocal::2048      d::1  ->  d::2 0.0023 0.0011 1.99
reciprocal::4096      d::1  ->  d::2 0.0046 0.0023 2.0
reciprocal::1024      d::1  ->  d::10 0.0011 0.0010 1.13
reciprocal::2048      d::1  ->  d::10 0.0023 0.0020 1.13
reciprocal::4096      d::1  ->  d::10 0.0046 0.0041 1.12
reciprocal::1024      d::2  ->  d::1 0.0007 0.0006 1.15
reciprocal::2048      d::2  ->  d::1 0.0012 0.0011 1.06
reciprocal::4096      d::2  ->  d::1 0.0023 0.0023 1.01
reciprocal::1024      d::2  ->  d::2 0.0011 0.0006 1.97
reciprocal::2048      d::2  ->  d::2 0.0023 0.0011 1.98
reciprocal::4096      d::2  ->  d::2 0.0046 0.0023 1.99
reciprocal::1024      d::2  ->  d::10 0.0011 0.0011 1.06
reciprocal::2048      d::2  ->  d::10 0.0023 0.0022 1.04
reciprocal::4096      d::2  ->  d::10 0.0046 0.0044 1.05
reciprocal::1024      d::10  ->  d::1 0.0007 0.0006 1.17
reciprocal::2048      d::10  ->  d::1 0.0012 0.0015 0.82
reciprocal::4096      d::10  ->  d::1 0.0023 0.0023 1.02
reciprocal::1024      d::10  ->  d::2 0.0011 0.0007 1.63
reciprocal::2048      d::10  ->  d::2 0.0023 0.0014 1.64
reciprocal::4096      d::10  ->  d::2 0.0046 0.0028 1.63
reciprocal::1024      d::10  ->  d::10 0.0014 0.0014 1.02
reciprocal::2048      d::10  ->  d::10 0.0029 0.0028 1.02
reciprocal::4096      d::10  ->  d::10 0.0057 0.0056 1.02
reciprocal::1024      f::1  ->  f::1 0.0003 0.0002 1.51
reciprocal::2048      f::1  ->  f::1 0.0005 0.0004 1.19
reciprocal::4096      f::1  ->  f::1 0.0013 0.0009 1.48
reciprocal::1024      f::1  ->  f::2 0.0009 0.0003 2.78
reciprocal::2048      f::1  ->  f::2 0.0017 0.0006 2.89
reciprocal::4096      f::1  ->  f::2 0.0034 0.0014 2.45
reciprocal::1024      f::1  ->  f::10 0.0009 0.0007 1.3
reciprocal::2048      f::1  ->  f::10 0.0017 0.0013 1.29
reciprocal::4096      f::1  ->  f::10 0.0034 0.0026 1.33
reciprocal::1024      f::2  ->  f::1 0.0004 0.0003 1.44
reciprocal::2048      f::2  ->  f::1 0.0006 0.0005 1.2
reciprocal::4096      f::2  ->  f::1 0.0013 0.0012 1.12
reciprocal::1024      f::2  ->  f::2 0.0009 0.0005 1.74
reciprocal::2048      f::2  ->  f::2 0.0017 0.0010 1.74
reciprocal::4096      f::2  ->  f::2 0.0034 0.0020 1.7
reciprocal::1024      f::2  ->  f::10 0.0009 0.0007 1.18
reciprocal::2048      f::2  ->  f::10 0.0017 0.0014 1.2
reciprocal::4096      f::2  ->  f::10 0.0034 0.0028 1.2
reciprocal::1024      f::10  ->  f::1 0.0004 0.0004 1.07
reciprocal::2048      f::10  ->  f::1 0.0008 0.0008 1.0
reciprocal::4096      f::10  ->  f::1 0.0015 0.0017 0.91
reciprocal::1024      f::10  ->  f::2 0.0009 0.0006 1.5
reciprocal::2048      f::10  ->  f::2 0.0017 0.0011 1.51
reciprocal::4096      f::10  ->  f::2 0.0034 0.0022 1.53
reciprocal::1024      f::10  ->  f::10 0.0009 0.0008 1.09
reciprocal::2048      f::10  ->  f::10 0.0017 0.0016 1.08
reciprocal::4096      f::10  ->  f::10 0.0034 0.0032 1.08
sqrt::1024      d::1  ->  d::1 0.0009 0.0009 1.03
sqrt::2048      d::1  ->  d::1 0.0017 0.0017 1.01
sqrt::4096      d::1  ->  d::1 0.0034 0.0034 1.0
sqrt::1024      d::1  ->  d::2 0.0054 0.0009 6.33
sqrt::2048      d::1  ->  d::2 0.0108 0.0017 6.32
sqrt::4096      d::1  ->  d::2 0.0217 0.0034 6.35
sqrt::1024      d::1  ->  d::10 0.0054 0.0010 5.15
sqrt::2048      d::1  ->  d::10 0.0108 0.0020 5.3
sqrt::4096      d::1  ->  d::10 0.0216 0.0041 5.33
sqrt::1024      d::2  ->  d::1 0.0009 0.0009 1.03
sqrt::2048      d::2  ->  d::1 0.0017 0.0017 1.01
sqrt::4096      d::2  ->  d::1 0.0034 0.0034 1.0
sqrt::1024      d::2  ->  d::2 0.0054 0.0009 6.33
sqrt::2048      d::2  ->  d::2 0.0108 0.0017 6.34
sqrt::4096      d::2  ->  d::2 0.0217 0.0034 6.35
sqrt::1024      d::2  ->  d::10 0.0054 0.0011 4.92
sqrt::2048      d::2  ->  d::10 0.0108 0.0022 4.97
sqrt::4096      d::2  ->  d::10 0.0217 0.0044 4.94
sqrt::1024      d::10  ->  d::1 0.0009 0.0009 1.03
sqrt::2048      d::10  ->  d::1 0.0017 0.0017 1.01
sqrt::4096      d::10  ->  d::1 0.0034 0.0034 1.0
sqrt::1024      d::10  ->  d::2 0.0054 0.0009 6.32
sqrt::2048      d::10  ->  d::2 0.0108 0.0017 6.34
sqrt::4096      d::10  ->  d::2 0.0216 0.0034 6.33
sqrt::1024      d::10  ->  d::10 0.0054 0.0014 3.86
sqrt::2048      d::10  ->  d::10 0.0108 0.0028 3.87
sqrt::4096      d::10  ->  d::10 0.0217 0.0056 3.87
sqrt::1024      f::1  ->  f::1 0.0003 0.0002 1.52
sqrt::2048      f::1  ->  f::1 0.0005 0.0004 1.26
sqrt::4096      f::1  ->  f::1 0.0009 0.0009 1.1
sqrt::1024      f::1  ->  f::2 0.0037 0.0003 12.14
sqrt::2048      f::1  ->  f::2 0.0074 0.0006 12.54
sqrt::4096      f::1  ->  f::2 0.0148 0.0014 10.56
sqrt::1024      f::1  ->  f::10 0.0037 0.0007 5.51
sqrt::2048      f::1  ->  f::10 0.0074 0.0013 5.62
sqrt::4096      f::1  ->  f::10 0.0148 0.0026 5.66
sqrt::1024      f::2  ->  f::1 0.0004 0.0003 1.5
sqrt::2048      f::2  ->  f::1 0.0006 0.0005 1.27
sqrt::4096      f::2  ->  f::1 0.0013 0.0012 1.12
sqrt::1024      f::2  ->  f::2 0.0037 0.0005 7.36
sqrt::2048      f::2  ->  f::2 0.0074 0.0010 7.32
sqrt::4096      f::2  ->  f::2 0.0148 0.0020 7.33
sqrt::1024      f::2  ->  f::10 0.0037 0.0007 5.12
sqrt::2048      f::2  ->  f::10 0.0074 0.0014 5.21
sqrt::4096      f::2  ->  f::10 0.0148 0.0028 5.35
sqrt::1024      f::10  ->  f::1 0.0004 0.0004 1.07
sqrt::2048      f::10  ->  f::1 0.0008 0.0008 0.98
sqrt::4096      f::10  ->  f::1 0.0015 0.0016 0.92
sqrt::1024      f::10  ->  f::2 0.0037 0.0006 6.42
sqrt::2048      f::10  ->  f::2 0.0074 0.0011 6.56
sqrt::4096      f::10  ->  f::2 0.0148 0.0023 6.47
sqrt::1024      f::10  ->  f::10 0.0037 0.0008 4.64
sqrt::2048      f::10  ->  f::10 0.0074 0.0016 4.66
sqrt::4096      f::10  ->  f::10 0.0148 0.0032 4.63
square::1024      d::1  ->  d::1 0.0004 0.0002 2.09
square::2048      d::1  ->  d::1 0.0006 0.0004 1.66
square::4096      d::1  ->  d::1 0.0011 0.0009 1.27
square::1024      d::1  ->  d::2 0.0009 0.0003 2.86
square::2048      d::1  ->  d::2 0.0017 0.0009 1.91
square::4096      d::1  ->  d::2 0.0034 0.0018 1.93
square::1024      d::1  ->  d::10 0.0010 0.0010 0.99
square::2048      d::1  ->  d::10 0.0020 0.0020 0.99
square::4096      d::1  ->  d::10 0.0040 0.0041 0.99
square::1024      d::2  ->  d::1 0.0006 0.0003 2.28
square::2048      d::2  ->  d::1 0.0011 0.0007 1.64
square::4096      d::2  ->  d::1 0.0020 0.0013 1.56
square::1024      d::2  ->  d::2 0.0009 0.0004 2.38
square::2048      d::2  ->  d::2 0.0017 0.0009 1.93
square::4096      d::2  ->  d::2 0.0034 0.0018 1.93
square::1024      d::2  ->  d::10 0.0011 0.0011 0.99
square::2048      d::2  ->  d::10 0.0022 0.0022 1.0
square::4096      d::2  ->  d::10 0.0044 0.0043 1.0
square::1024      d::10  ->  d::1 0.0006 0.0005 1.16
square::2048      d::10  ->  d::1 0.0012 0.0011 1.1
square::4096      d::10  ->  d::1 0.0022 0.0021 1.01
square::1024      d::10  ->  d::2 0.0009 0.0007 1.25
square::2048      d::10  ->  d::2 0.0017 0.0014 1.23
square::4096      d::10  ->  d::2 0.0034 0.0028 1.22
square::1024      d::10  ->  d::10 0.0014 0.0014 1.02
square::2048      d::10  ->  d::10 0.0029 0.0028 1.02
square::4096      d::10  ->  d::10 0.0058 0.0062 0.94
square::1024      f::1  ->  f::1 0.0003 0.0001 2.72
square::2048      f::1  ->  f::1 0.0005 0.0002 2.05
square::4096      f::1  ->  f::1 0.0008 0.0004 1.91
square::1024      f::1  ->  f::2 0.0009 0.0003 2.88
square::2048      f::1  ->  f::2 0.0017 0.0006 2.91
square::4096      f::1  ->  f::2 0.0034 0.0014 2.42
square::1024      f::1  ->  f::10 0.0009 0.0007 1.31
square::2048      f::1  ->  f::10 0.0017 0.0013 1.28
square::4096      f::1  ->  f::10 0.0034 0.0026 1.32
square::1024      f::2  ->  f::1 0.0004 0.0002 1.48
square::2048      f::2  ->  f::1 0.0006 0.0005 1.25
square::4096      f::2  ->  f::1 0.0013 0.0012 1.09
square::1024      f::2  ->  f::2 0.0009 0.0005 1.86
square::2048      f::2  ->  f::2 0.0017 0.0009 1.84
square::4096      f::2  ->  f::2 0.0034 0.0018 1.89
square::1024      f::2  ->  f::10 0.0009 0.0007 1.17
square::2048      f::2  ->  f::10 0.0017 0.0014 1.19
square::4096      f::2  ->  f::10 0.0034 0.0028 1.23
square::1024      f::10  ->  f::1 0.0004 0.0004 1.02
square::2048      f::10  ->  f::1 0.0008 0.0008 0.96
square::4096      f::10  ->  f::1 0.0015 0.0016 0.93
square::1024      f::10  ->  f::2 0.0009 0.0005 1.58
square::2048      f::10  ->  f::2 0.0017 0.0011 1.6
square::4096      f::10  ->  f::2 0.0034 0.0021 1.64
square::1024      f::10  ->  f::10 0.0009 0.0008 1.08
square::2048      f::10  ->  f::10 0.0017 0.0016 1.06
square::4096      f::10  ->  f::10 0.0034 0.0032 1.06
SSE3 - Contiguous only

metric: gmean, units: ms

name of test before_sse3 after_sse3 after_sse3 vs before_sse3
absolute::1024      d::1  ->  d::1 0.0004 0.0002 2.28
absolute::2048      d::1  ->  d::1 0.0006 0.0004 1.59
absolute::4096      d::1  ->  d::1 0.0011 0.0009 1.19
absolute::1024      f::1  ->  f::1 0.0003 0.0001 3.33
absolute::2048      f::1  ->  f::1 0.0004 0.0002 2.35
absolute::4096      f::1  ->  f::1 0.0006 0.0004 1.48
reciprocal::1024      d::1  ->  d::1 0.0006 0.0006 1.14
reciprocal::2048      d::1  ->  d::1 0.0012 0.0011 1.06
reciprocal::4096      d::1  ->  d::1 0.0023 0.0023 1.01
reciprocal::1024      f::1  ->  f::1 0.0003 0.0002 1.55
reciprocal::2048      f::1  ->  f::1 0.0005 0.0004 1.16
reciprocal::4096      f::1  ->  f::1 0.0010 0.0009 1.16
sqrt::1024      d::1  ->  d::1 0.0009 0.0009 1.03
sqrt::2048      d::1  ->  d::1 0.0017 0.0017 1.01
sqrt::4096      d::1  ->  d::1 0.0034 0.0034 1.0
sqrt::1024      f::1  ->  f::1 0.0003 0.0002 1.55
sqrt::2048      f::1  ->  f::1 0.0005 0.0004 1.26
sqrt::4096      f::1  ->  f::1 0.0009 0.0009 1.1
square::1024      d::1  ->  d::1 0.0004 0.0002 2.11
square::2048      d::1  ->  d::1 0.0006 0.0004 1.65
square::4096      d::1  ->  d::1 0.0011 0.0009 1.28
square::1024      f::1  ->  f::1 0.0003 0.0001 2.76
square::2048      f::1  ->  f::1 0.0005 0.0002 2.04
square::4096      f::1  ->  f::1 0.0008 0.0004 1.87
SSE3

metric: gmean, units: ms

name of test before_sse3 after_sse3 after_sse3 vs before_sse3
absolute::1024      d::1  ->  d::1 0.0004 0.0002 2.23
absolute::2048      d::1  ->  d::1 0.0006 0.0004 1.58
absolute::4096      d::1  ->  d::1 0.0011 0.0009 1.24
absolute::1024      d::1  ->  d::2 0.0012 0.0003 3.78
absolute::2048      d::1  ->  d::2 0.0023 0.0009 2.6
absolute::4096      d::1  ->  d::2 0.0046 0.0018 2.59
absolute::1024      d::1  ->  d::10 0.0012 0.0010 1.11
absolute::2048      d::1  ->  d::10 0.0023 0.0020 1.12
absolute::4096      d::1  ->  d::10 0.0046 0.0041 1.13
absolute::1024      d::2  ->  d::1 0.0006 0.0002 2.63
absolute::2048      d::2  ->  d::1 0.0011 0.0007 1.64
absolute::4096      d::2  ->  d::1 0.0021 0.0013 1.6
absolute::1024      d::2  ->  d::2 0.0012 0.0005 2.47
absolute::2048      d::2  ->  d::2 0.0023 0.0009 2.44
absolute::4096      d::2  ->  d::2 0.0046 0.0018 2.48
absolute::1024      d::2  ->  d::10 0.0012 0.0011 1.07
absolute::2048      d::2  ->  d::10 0.0024 0.0022 1.06
absolute::4096      d::2  ->  d::10 0.0048 0.0044 1.09
absolute::1024      d::10  ->  d::1 0.0006 0.0005 1.2
absolute::2048      d::10  ->  d::1 0.0011 0.0011 1.07
absolute::4096      d::10  ->  d::1 0.0022 0.0021 1.04
absolute::1024      d::10  ->  d::2 0.0012 0.0007 1.65
absolute::2048      d::10  ->  d::2 0.0023 0.0014 1.65
absolute::4096      d::10  ->  d::2 0.0046 0.0028 1.64
absolute::1024      d::10  ->  d::10 0.0015 0.0014 1.07
absolute::2048      d::10  ->  d::10 0.0029 0.0028 1.04
absolute::4096      d::10  ->  d::10 0.0058 0.0056 1.04
absolute::1024      f::1  ->  f::1 0.0003 0.0001 3.83
absolute::2048      f::1  ->  f::1 0.0004 0.0002 2.36
absolute::4096      f::1  ->  f::1 0.0006 0.0004 1.48
absolute::1024      f::1  ->  f::2 0.0009 0.0003 2.86
absolute::2048      f::1  ->  f::2 0.0017 0.0006 2.89
absolute::4096      f::1  ->  f::2 0.0034 0.0014 2.46
absolute::1024      f::1  ->  f::10 0.0009 0.0007 1.26
absolute::2048      f::1  ->  f::10 0.0017 0.0013 1.31
absolute::4096      f::1  ->  f::10 0.0034 0.0026 1.3
absolute::1024      f::2  ->  f::1 0.0004 0.0003 1.5
absolute::2048      f::2  ->  f::1 0.0006 0.0005 1.24
absolute::4096      f::2  ->  f::1 0.0013 0.0012 1.11
absolute::1024      f::2  ->  f::2 0.0009 0.0005 1.94
absolute::2048      f::2  ->  f::2 0.0017 0.0009 1.89
absolute::4096      f::2  ->  f::2 0.0034 0.0018 1.95
absolute::1024      f::2  ->  f::10 0.0009 0.0007 1.2
absolute::2048      f::2  ->  f::10 0.0017 0.0014 1.22
absolute::4096      f::2  ->  f::10 0.0034 0.0028 1.24
absolute::1024      f::10  ->  f::1 0.0005 0.0004 1.08
absolute::2048      f::10  ->  f::1 0.0008 0.0008 1.0
absolute::4096      f::10  ->  f::1 0.0015 0.0016 0.94
absolute::1024      f::10  ->  f::2 0.0009 0.0005 1.61
absolute::2048      f::10  ->  f::2 0.0017 0.0011 1.63
absolute::4096      f::10  ->  f::2 0.0034 0.0021 1.65
absolute::1024      f::10  ->  f::10 0.0009 0.0008 1.1
absolute::2048      f::10  ->  f::10 0.0017 0.0016 1.1
absolute::4096      f::10  ->  f::10 0.0035 0.0032 1.1
reciprocal::1024      d::1  ->  d::1 0.0006 0.0006 1.14
reciprocal::2048      d::1  ->  d::1 0.0012 0.0011 1.05
reciprocal::4096      d::1  ->  d::1 0.0023 0.0023 1.02
reciprocal::1024      d::1  ->  d::2 0.0011 0.0006 2.0
reciprocal::2048      d::1  ->  d::2 0.0023 0.0011 1.99
reciprocal::4096      d::1  ->  d::2 0.0046 0.0023 2.0
reciprocal::1024      d::1  ->  d::10 0.0011 0.0010 1.13
reciprocal::2048      d::1  ->  d::10 0.0023 0.0020 1.13
reciprocal::4096      d::1  ->  d::10 0.0046 0.0041 1.12
reciprocal::1024      d::2  ->  d::1 0.0007 0.0006 1.15
reciprocal::2048      d::2  ->  d::1 0.0012 0.0011 1.06
reciprocal::4096      d::2  ->  d::1 0.0023 0.0023 1.01
reciprocal::1024      d::2  ->  d::2 0.0011 0.0006 1.97
reciprocal::2048      d::2  ->  d::2 0.0023 0.0011 1.98
reciprocal::4096      d::2  ->  d::2 0.0046 0.0023 1.99
reciprocal::1024      d::2  ->  d::10 0.0011 0.0011 1.06
reciprocal::2048      d::2  ->  d::10 0.0023 0.0022 1.05
reciprocal::4096      d::2  ->  d::10 0.0046 0.0044 1.05
reciprocal::1024      d::10  ->  d::1 0.0007 0.0006 1.16
reciprocal::2048      d::10  ->  d::1 0.0012 0.0015 0.82
reciprocal::4096      d::10  ->  d::1 0.0024 0.0023 1.03
reciprocal::1024      d::10  ->  d::2 0.0011 0.0007 1.63
reciprocal::2048      d::10  ->  d::2 0.0023 0.0014 1.64
reciprocal::4096      d::10  ->  d::2 0.0046 0.0028 1.63
reciprocal::1024      d::10  ->  d::10 0.0014 0.0014 1.01
reciprocal::2048      d::10  ->  d::10 0.0028 0.0028 1.02
reciprocal::4096      d::10  ->  d::10 0.0059 0.0056 1.04
reciprocal::1024      f::1  ->  f::1 0.0003 0.0002 1.53
reciprocal::2048      f::1  ->  f::1 0.0005 0.0004 1.17
reciprocal::4096      f::1  ->  f::1 0.0010 0.0009 1.18
reciprocal::1024      f::1  ->  f::2 0.0009 0.0003 2.78
reciprocal::2048      f::1  ->  f::2 0.0017 0.0006 2.89
reciprocal::4096      f::1  ->  f::2 0.0034 0.0014 2.45
reciprocal::1024      f::1  ->  f::10 0.0009 0.0007 1.3
reciprocal::2048      f::1  ->  f::10 0.0017 0.0013 1.29
reciprocal::4096      f::1  ->  f::10 0.0034 0.0026 1.33
reciprocal::1024      f::2  ->  f::1 0.0004 0.0003 1.44
reciprocal::2048      f::2  ->  f::1 0.0006 0.0005 1.2
reciprocal::4096      f::2  ->  f::1 0.0013 0.0012 1.11
reciprocal::1024      f::2  ->  f::2 0.0009 0.0005 1.74
reciprocal::2048      f::2  ->  f::2 0.0017 0.0010 1.7
reciprocal::4096      f::2  ->  f::2 0.0034 0.0020 1.7
reciprocal::1024      f::2  ->  f::10 0.0009 0.0007 1.18
reciprocal::2048      f::2  ->  f::10 0.0017 0.0014 1.2
reciprocal::4096      f::2  ->  f::10 0.0034 0.0028 1.2
reciprocal::1024      f::10  ->  f::1 0.0004 0.0004 1.06
reciprocal::2048      f::10  ->  f::1 0.0008 0.0008 0.97
reciprocal::4096      f::10  ->  f::1 0.0015 0.0017 0.89
reciprocal::1024      f::10  ->  f::2 0.0009 0.0006 1.5
reciprocal::2048      f::10  ->  f::2 0.0017 0.0011 1.51
reciprocal::4096      f::10  ->  f::2 0.0034 0.0022 1.53
reciprocal::1024      f::10  ->  f::10 0.0009 0.0008 1.09
reciprocal::2048      f::10  ->  f::10 0.0017 0.0016 1.08
reciprocal::4096      f::10  ->  f::10 0.0034 0.0032 1.07
sqrt::1024      d::1  ->  d::1 0.0009 0.0009 1.03
sqrt::2048      d::1  ->  d::1 0.0017 0.0017 1.01
sqrt::4096      d::1  ->  d::1 0.0052 0.0034 1.53
sqrt::1024      d::1  ->  d::2 0.0054 0.0009 6.33
sqrt::2048      d::1  ->  d::2 0.0108 0.0017 6.32
sqrt::4096      d::1  ->  d::2 0.0217 0.0034 6.35
sqrt::1024      d::1  ->  d::10 0.0054 0.0010 5.15
sqrt::2048      d::1  ->  d::10 0.0108 0.0020 5.3
sqrt::4096      d::1  ->  d::10 0.0216 0.0041 5.33
sqrt::1024      d::2  ->  d::1 0.0009 0.0009 1.03
sqrt::2048      d::2  ->  d::1 0.0017 0.0017 1.01
sqrt::4096      d::2  ->  d::1 0.0034 0.0034 1.0
sqrt::1024      d::2  ->  d::2 0.0054 0.0009 6.33
sqrt::2048      d::2  ->  d::2 0.0131 0.0017 7.67
sqrt::4096      d::2  ->  d::2 0.0216 0.0034 6.34
sqrt::1024      d::2  ->  d::10 0.0054 0.0011 4.92
sqrt::2048      d::2  ->  d::10 0.0108 0.0022 4.97
sqrt::4096      d::2  ->  d::10 0.0217 0.0044 4.94
sqrt::1024      d::10  ->  d::1 0.0009 0.0009 1.03
sqrt::2048      d::10  ->  d::1 0.0017 0.0017 1.01
sqrt::4096      d::10  ->  d::1 0.0034 0.0034 1.0
sqrt::1024      d::10  ->  d::2 0.0054 0.0009 6.32
sqrt::2048      d::10  ->  d::2 0.0108 0.0017 6.34
sqrt::4096      d::10  ->  d::2 0.0216 0.0034 6.33
sqrt::1024      d::10  ->  d::10 0.0054 0.0014 3.86
sqrt::2048      d::10  ->  d::10 0.0108 0.0028 3.87
sqrt::4096      d::10  ->  d::10 0.0217 0.0056 3.87
sqrt::1024      f::1  ->  f::1 0.0003 0.0002 1.54
sqrt::2048      f::1  ->  f::1 0.0005 0.0004 1.25
sqrt::4096      f::1  ->  f::1 0.0009 0.0009 1.1
sqrt::1024      f::1  ->  f::2 0.0037 0.0003 12.14
sqrt::2048      f::1  ->  f::2 0.0074 0.0006 12.51
sqrt::4096      f::1  ->  f::2 0.0148 0.0014 10.57
sqrt::1024      f::1  ->  f::10 0.0037 0.0007 5.51
sqrt::2048      f::1  ->  f::10 0.0074 0.0013 5.62
sqrt::4096      f::1  ->  f::10 0.0148 0.0026 5.66
sqrt::1024      f::2  ->  f::1 0.0004 0.0003 1.5
sqrt::2048      f::2  ->  f::1 0.0006 0.0005 1.25
sqrt::4096      f::2  ->  f::1 0.0013 0.0012 1.12
sqrt::1024      f::2  ->  f::2 0.0037 0.0005 7.36
sqrt::2048      f::2  ->  f::2 0.0074 0.0010 7.32
sqrt::4096      f::2  ->  f::2 0.0148 0.0020 7.33
sqrt::1024      f::2  ->  f::10 0.0037 0.0007 5.12
sqrt::2048      f::2  ->  f::10 0.0074 0.0014 5.21
sqrt::4096      f::2  ->  f::10 0.0148 0.0028 5.35
sqrt::1024      f::10  ->  f::1 0.0004 0.0004 1.08
sqrt::2048      f::10  ->  f::1 0.0008 0.0008 0.98
sqrt::4096      f::10  ->  f::1 0.0015 0.0016 0.93
sqrt::1024      f::10  ->  f::2 0.0037 0.0006 6.42
sqrt::2048      f::10  ->  f::2 0.0074 0.0011 6.56
sqrt::4096      f::10  ->  f::2 0.0148 0.0023 6.47
sqrt::1024      f::10  ->  f::10 0.0037 0.0008 4.64
sqrt::2048      f::10  ->  f::10 0.0074 0.0016 4.66
sqrt::4096      f::10  ->  f::10 0.0148 0.0032 4.63
square::1024      d::1  ->  d::1 0.0004 0.0002 2.12
square::2048      d::1  ->  d::1 0.0006 0.0004 1.62
square::4096      d::1  ->  d::1 0.0011 0.0009 1.27
square::1024      d::1  ->  d::2 0.0009 0.0003 2.86
square::2048      d::1  ->  d::2 0.0017 0.0009 1.91
square::4096      d::1  ->  d::2 0.0034 0.0018 1.93
square::1024      d::1  ->  d::10 0.0010 0.0010 1.01
square::2048      d::1  ->  d::10 0.0020 0.0020 1.0
square::4096      d::1  ->  d::10 0.0041 0.0041 0.99
square::1024      d::2  ->  d::1 0.0006 0.0003 2.31
square::2048      d::2  ->  d::1 0.0011 0.0007 1.61
square::4096      d::2  ->  d::1 0.0021 0.0013 1.59
square::1024      d::2  ->  d::2 0.0009 0.0004 2.38
square::2048      d::2  ->  d::2 0.0017 0.0009 1.93
square::4096      d::2  ->  d::2 0.0034 0.0018 1.93
square::1024      d::2  ->  d::10 0.0011 0.0011 0.99
square::2048      d::2  ->  d::10 0.0022 0.0022 1.0
square::4096      d::2  ->  d::10 0.0043 0.0043 0.99
square::1024      d::10  ->  d::1 0.0006 0.0005 1.18
square::2048      d::10  ->  d::1 0.0011 0.0011 1.05
square::4096      d::10  ->  d::1 0.0022 0.0021 1.04
square::1024      d::10  ->  d::2 0.0009 0.0007 1.25
square::2048      d::10  ->  d::2 0.0017 0.0014 1.23
square::4096      d::10  ->  d::2 0.0034 0.0028 1.22
square::1024      d::10  ->  d::10 0.0014 0.0014 1.02
square::2048      d::10  ->  d::10 0.0029 0.0028 1.02
square::4096      d::10  ->  d::10 0.0058 0.0062 0.93
square::1024      f::1  ->  f::1 0.0003 0.0001 2.68
square::2048      f::1  ->  f::1 0.0005 0.0002 2.04
square::4096      f::1  ->  f::1 0.0008 0.0004 1.92
square::1024      f::1  ->  f::2 0.0009 0.0003 2.88
square::2048      f::1  ->  f::2 0.0017 0.0006 2.91
square::4096      f::1  ->  f::2 0.0034 0.0014 2.42
square::1024      f::1  ->  f::10 0.0009 0.0007 1.31
square::2048      f::1  ->  f::10 0.0017 0.0013 1.28
square::4096      f::1  ->  f::10 0.0034 0.0026 1.32
square::1024      f::2  ->  f::1 0.0004 0.0002 1.49
square::2048      f::2  ->  f::1 0.0006 0.0005 1.25
square::4096      f::2  ->  f::1 0.0013 0.0012 1.09
square::1024      f::2  ->  f::2 0.0009 0.0005 1.86
square::2048      f::2  ->  f::2 0.0017 0.0009 1.84
square::4096      f::2  ->  f::2 0.0034 0.0018 1.89
square::1024      f::2  ->  f::10 0.0009 0.0007 1.17
square::2048      f::2  ->  f::10 0.0017 0.0014 1.19
square::4096      f::2  ->  f::10 0.0034 0.0028 1.23
square::1024      f::10  ->  f::1 0.0004 0.0004 1.03
square::2048      f::10  ->  f::1 0.0008 0.0008 0.94
square::4096      f::10  ->  f::1 0.0015 0.0016 0.94
square::1024      f::10  ->  f::2 0.0009 0.0005 1.58
square::2048      f::10  ->  f::2 0.0017 0.0011 1.6
square::4096      f::10  ->  f::2 0.0034 0.0021 1.64
square::1024      f::10  ->  f::10 0.0009 0.0008 1.08
square::2048      f::10  ->  f::10 0.0017 0.0016 1.06
square::4096      f::10  ->  f::10 0.0034 0.0032 1.06

ARM8 64-bit

CPU
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           1
Model name:                      Neoverse-N1
Stepping:                        r3p1
BogoMIPS:                        243.75
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        4 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

OS
Linux ip-172-31-6-63 5.4.0-1024-aws #24-Ubuntu SMP Sat Sep 5 06:17:48 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
gcc-7 (Ubuntu/Linaro 7.5.0-6ubuntu2) 7.5.0

Benchmark

ASIMD - Contiguous only

metric: gmean, units: ms

name of test before_contig after_contig after_contig vs before_contig
absolute::1024      d::1  ->  d::1 0.0011 0.0002 4.93
absolute::2048      d::1  ->  d::1 0.0023 0.0005 4.77
absolute::4096      d::1  ->  d::1 0.0045 0.0009 5.0
absolute::1024      f::1  ->  f::1 0.0011 0.0001 8.9
absolute::2048      f::1  ->  f::1 0.0023 0.0002 9.44
absolute::4096      f::1  ->  f::1 0.0045 0.0005 9.62
reciprocal::1024      d::1  ->  d::1 0.0020 0.0020 1.0
reciprocal::2048      d::1  ->  d::1 0.0041 0.0041 1.0
reciprocal::4096      d::1  ->  d::1 0.0082 0.0082 1.0
reciprocal::1024      f::1  ->  f::1 0.0006 0.0006 1.0
reciprocal::2048      f::1  ->  f::1 0.0012 0.0012 1.0
reciprocal::4096      f::1  ->  f::1 0.0025 0.0025 1.0
sqrt::1024      d::1  ->  d::1 0.0029 0.0029 1.0
sqrt::2048      d::1  ->  d::1 0.0057 0.0057 1.0
sqrt::4096      d::1  ->  d::1 0.0115 0.0115 1.0
sqrt::1024      f::1  ->  f::1 0.0010 0.0007 1.44
sqrt::2048      f::1  ->  f::1 0.0021 0.0014 1.44
sqrt::4096      f::1  ->  f::1 0.0041 0.0029 1.43
square::1024      d::1  ->  d::1 0.0004 0.0002 1.58
square::2048      d::1  ->  d::1 0.0007 0.0004 1.63
square::4096      d::1  ->  d::1 0.0014 0.0009 1.61
square::1024      f::1  ->  f::1 0.0002 0.0001 1.77
square::2048      f::1  ->  f::1 0.0004 0.0002 1.74
square::4096      f::1  ->  f::1 0.0008 0.0005 1.73
ASIMD

metric: gmean, units: ms

name of test before after after vs before
absolute::1024      d::1  ->  d::1 0.0011 0.0002 4.93
absolute::2048      d::1  ->  d::1 0.0023 0.0005 5.01
absolute::4096      d::1  ->  d::1 0.0045 0.0010 4.71
absolute::1024      d::1  ->  d::2 0.0011 0.0004 2.7
absolute::2048      d::1  ->  d::2 0.0023 0.0008 2.67
absolute::4096      d::1  ->  d::2 0.0047 0.0021 2.26
absolute::1024      d::1  ->  d::10 0.0015 0.0011 1.42
absolute::2048      d::1  ->  d::10 0.0029 0.0028 1.01
absolute::4096      d::1  ->  d::10 0.0056 0.0061 0.92
absolute::1024      d::2  ->  d::1 0.0011 0.0005 2.17
absolute::2048      d::2  ->  d::1 0.0023 0.0010 2.18
absolute::4096      d::2  ->  d::1 0.0046 0.0021 2.16
absolute::1024      d::2  ->  d::2 0.0012 0.0007 1.76
absolute::2048      d::2  ->  d::2 0.0023 0.0013 1.77
absolute::4096      d::2  ->  d::2 0.0047 0.0030 1.57
absolute::1024      d::2  ->  d::10 0.0015 0.0014 1.08
absolute::2048      d::2  ->  d::10 0.0030 0.0028 1.07
absolute::4096      d::2  ->  d::10 0.0060 0.0055 1.09
absolute::1024      d::10  ->  d::1 0.0015 0.0007 2.23
absolute::2048      d::10  ->  d::1 0.0030 0.0021 1.45
absolute::4096      d::10  ->  d::1 0.0064 0.0041 1.56
absolute::1024      d::10  ->  d::2 0.0017 0.0014 1.21
absolute::2048      d::10  ->  d::2 0.0033 0.0028 1.19
absolute::4096      d::10  ->  d::2 0.0066 0.0056 1.19
absolute::1024      d::10  ->  d::10 0.0024 0.0021 1.14
absolute::2048      d::10  ->  d::10 0.0047 0.0042 1.13
absolute::4096      d::10  ->  d::10 0.0101 0.0084 1.2
absolute::1024      f::1  ->  f::1 0.0011 0.0001 8.89
absolute::2048      f::1  ->  f::1 0.0023 0.0002 9.47
absolute::4096      f::1  ->  f::1 0.0045 0.0005 9.63
absolute::1024      f::1  ->  f::2 0.0011 0.0004 2.85
absolute::2048      f::1  ->  f::2 0.0023 0.0008 2.88
absolute::4096      f::1  ->  f::2 0.0045 0.0016 2.88
absolute::1024      f::1  ->  f::10 0.0011 0.0005 2.52
absolute::2048      f::1  ->  f::10 0.0027 0.0019 1.39
absolute::4096      f::1  ->  f::10 0.0049 0.0038 1.28
absolute::1024      f::2  ->  f::1 0.0011 0.0003 3.74
absolute::2048      f::2  ->  f::1 0.0023 0.0006 3.86
absolute::4096      f::2  ->  f::1 0.0045 0.0012 3.93
absolute::1024      f::2  ->  f::2 0.0011 0.0006 1.96
absolute::2048      f::2  ->  f::2 0.0023 0.0011 1.98
absolute::4096      f::2  ->  f::2 0.0046 0.0023 1.98
absolute::1024      f::2  ->  f::10 0.0012 0.0006 1.95
absolute::2048      f::2  ->  f::10 0.0025 0.0019 1.32
absolute::4096      f::2  ->  f::10 0.0050 0.0037 1.34
absolute::1024      f::10  ->  f::1 0.0011 0.0007 1.7
absolute::2048      f::10  ->  f::1 0.0024 0.0016 1.55
absolute::4096      f::10  ->  f::1 0.0048 0.0031 1.57
absolute::1024      f::10  ->  f::2 0.0011 0.0009 1.27
absolute::2048      f::10  ->  f::2 0.0025 0.0020 1.2
absolute::4096      f::10  ->  f::2 0.0049 0.0040 1.22
absolute::1024      f::10  ->  f::10 0.0014 0.0012 1.13
absolute::2048      f::10  ->  f::10 0.0029 0.0028 1.03
absolute::4096      f::10  ->  f::10 0.0058 0.0056 1.03
reciprocal::1024      d::1  ->  d::1 0.0020 0.0020 1.0
reciprocal::2048      d::1  ->  d::1 0.0041 0.0041 1.0
reciprocal::4096      d::1  ->  d::1 0.0082 0.0082 1.0
reciprocal::1024      d::1  ->  d::2 0.0020 0.0020 1.0
reciprocal::2048      d::1  ->  d::2 0.0041 0.0041 1.0
reciprocal::4096      d::1  ->  d::2 0.0082 0.0082 1.0
reciprocal::1024      d::1  ->  d::10 0.0020 0.0020 1.0
reciprocal::2048      d::1  ->  d::10 0.0041 0.0041 1.0
reciprocal::4096      d::1  ->  d::10 0.0082 0.0082 1.0
reciprocal::1024      d::2  ->  d::1 0.0020 0.0020 1.0
reciprocal::2048      d::2  ->  d::1 0.0041 0.0041 1.0
reciprocal::4096      d::2  ->  d::1 0.0082 0.0082 1.0
reciprocal::1024      d::2  ->  d::2 0.0020 0.0020 1.0
reciprocal::2048      d::2  ->  d::2 0.0041 0.0041 1.0
reciprocal::4096      d::2  ->  d::2 0.0082 0.0082 1.0
reciprocal::1024      d::2  ->  d::10 0.0021 0.0021 1.0
reciprocal::2048      d::2  ->  d::10 0.0041 0.0041 1.0
reciprocal::4096      d::2  ->  d::10 0.0082 0.0082 1.0
reciprocal::1024      d::10  ->  d::1 0.0020 0.0020 1.0
reciprocal::2048      d::10  ->  d::1 0.0041 0.0041 1.0
reciprocal::4096      d::10  ->  d::1 0.0082 0.0082 1.0
reciprocal::1024      d::10  ->  d::2 0.0020 0.0021 1.0
reciprocal::2048      d::10  ->  d::2 0.0041 0.0041 1.0
reciprocal::4096      d::10  ->  d::2 0.0082 0.0083 0.99
reciprocal::1024      d::10  ->  d::10 0.0022 0.0021 1.04
reciprocal::2048      d::10  ->  d::10 0.0044 0.0043 1.03
reciprocal::4096      d::10  ->  d::10 0.0097 0.0087 1.11
reciprocal::1024      f::1  ->  f::1 0.0006 0.0006 1.0
reciprocal::2048      f::1  ->  f::1 0.0012 0.0012 1.0
reciprocal::4096      f::1  ->  f::1 0.0025 0.0025 1.0
reciprocal::1024      f::1  ->  f::2 0.0008 0.0006 1.35
reciprocal::2048      f::1  ->  f::2 0.0016 0.0012 1.34
reciprocal::4096      f::1  ->  f::2 0.0033 0.0025 1.33
reciprocal::1024      f::1  ->  f::10 0.0008 0.0006 1.35
reciprocal::2048      f::1  ->  f::10 0.0019 0.0020 0.97
reciprocal::4096      f::1  ->  f::10 0.0036 0.0039 0.94
reciprocal::1024      f::2  ->  f::1 0.0008 0.0006 1.35
reciprocal::2048      f::2  ->  f::1 0.0016 0.0012 1.34
reciprocal::4096      f::2  ->  f::1 0.0033 0.0025 1.33
reciprocal::1024      f::2  ->  f::2 0.0008 0.0007 1.27
reciprocal::2048      f::2  ->  f::2 0.0016 0.0012 1.32
reciprocal::4096      f::2  ->  f::2 0.0033 0.0025 1.32
reciprocal::1024      f::2  ->  f::10 0.0008 0.0006 1.3
reciprocal::2048      f::2  ->  f::10 0.0019 0.0020 0.97
reciprocal::4096      f::2  ->  f::10 0.0038 0.0039 0.96
reciprocal::1024      f::10  ->  f::1 0.0008 0.0008 1.11
reciprocal::2048      f::10  ->  f::1 0.0017 0.0017 1.02
reciprocal::4096      f::10  ->  f::1 0.0034 0.0033 1.03
reciprocal::1024      f::10  ->  f::2 0.0008 0.0010 0.86
reciprocal::2048      f::10  ->  f::2 0.0018 0.0022 0.85
reciprocal::4096      f::10  ->  f::2 0.0037 0.0044 0.85
reciprocal::1024      f::10  ->  f::10 0.0012 0.0013 0.92
reciprocal::2048      f::10  ->  f::10 0.0028 0.0029 0.98
reciprocal::4096      f::10  ->  f::10 0.0056 0.0059 0.96
sqrt::1024      d::1  ->  d::1 0.0029 0.0029 1.0
sqrt::2048      d::1  ->  d::1 0.0057 0.0057 1.0
sqrt::4096      d::1  ->  d::1 0.0115 0.0115 1.0
sqrt::1024      d::1  ->  d::2 0.0029 0.0029 1.0
sqrt::2048      d::1  ->  d::2 0.0057 0.0057 1.0
sqrt::4096      d::1  ->  d::2 0.0115 0.0115 1.0
sqrt::1024      d::1  ->  d::10 0.0029 0.0029 1.0
sqrt::2048      d::1  ->  d::10 0.0057 0.0057 1.0
sqrt::4096      d::1  ->  d::10 0.0115 0.0115 1.0
sqrt::1024      d::2  ->  d::1 0.0029 0.0029 1.0
sqrt::2048      d::2  ->  d::1 0.0057 0.0057 1.0
sqrt::4096      d::2  ->  d::1 0.0115 0.0115 1.0
sqrt::1024      d::2  ->  d::2 0.0029 0.0029 1.0
sqrt::2048      d::2  ->  d::2 0.0057 0.0057 1.0
sqrt::4096      d::2  ->  d::2 0.0115 0.0115 1.0
sqrt::1024      d::2  ->  d::10 0.0029 0.0029 1.0
sqrt::2048      d::2  ->  d::10 0.0057 0.0057 1.0
sqrt::4096      d::2  ->  d::10 0.0115 0.0115 1.0
sqrt::1024      d::10  ->  d::1 0.0029 0.0029 1.0
sqrt::2048      d::10  ->  d::1 0.0057 0.0057 1.0
sqrt::4096      d::10  ->  d::1 0.0118 0.0116 1.02
sqrt::1024      d::10  ->  d::2 0.0029 0.0029 1.0
sqrt::2048      d::10  ->  d::2 0.0057 0.0057 1.0
sqrt::4096      d::10  ->  d::2 0.0115 0.0115 1.0
sqrt::1024      d::10  ->  d::10 0.0029 0.0029 1.0
sqrt::2048      d::10  ->  d::10 0.0057 0.0057 1.0
sqrt::4096      d::10  ->  d::10 0.0118 0.0118 1.0
sqrt::1024      f::1  ->  f::1 0.0010 0.0007 1.45
sqrt::2048      f::1  ->  f::1 0.0021 0.0014 1.44
sqrt::4096      f::1  ->  f::1 0.0041 0.0029 1.43
sqrt::1024      f::1  ->  f::2 0.0010 0.0007 1.44
sqrt::2048      f::1  ->  f::2 0.0021 0.0014 1.44
sqrt::4096      f::1  ->  f::2 0.0041 0.0029 1.43
sqrt::1024      f::1  ->  f::10 0.0010 0.0007 1.45
sqrt::2048      f::1  ->  f::10 0.0024 0.0019 1.25
sqrt::4096      f::1  ->  f::10 0.0043 0.0039 1.12
sqrt::1024      f::2  ->  f::1 0.0010 0.0007 1.45
sqrt::2048      f::2  ->  f::1 0.0021 0.0014 1.44
sqrt::4096      f::2  ->  f::1 0.0041 0.0029 1.43
sqrt::1024      f::2  ->  f::2 0.0010 0.0008 1.33
sqrt::2048      f::2  ->  f::2 0.0021 0.0015 1.33
sqrt::4096      f::2  ->  f::2 0.0041 0.0031 1.34
sqrt::1024      f::2  ->  f::10 0.0010 0.0008 1.34
sqrt::2048      f::2  ->  f::10 0.0022 0.0020 1.13
sqrt::4096      f::2  ->  f::10 0.0044 0.0039 1.13
sqrt::1024      f::10  ->  f::1 0.0010 0.0008 1.27
sqrt::2048      f::10  ->  f::1 0.0021 0.0017 1.22
sqrt::4096      f::10  ->  f::1 0.0042 0.0034 1.22
sqrt::1024      f::10  ->  f::2 0.0010 0.0010 1.04
sqrt::2048      f::10  ->  f::2 0.0021 0.0022 0.98
sqrt::4096      f::10  ->  f::2 0.0043 0.0044 0.97
sqrt::1024      f::10  ->  f::10 0.0013 0.0014 0.95
sqrt::2048      f::10  ->  f::10 0.0028 0.0029 0.99
sqrt::4096      f::10  ->  f::10 0.0057 0.0057 0.99
square::1024      d::1  ->  d::1 0.0004 0.0002 1.58
square::2048      d::1  ->  d::1 0.0007 0.0005 1.57
square::4096      d::1  ->  d::1 0.0014 0.0009 1.61
square::1024      d::1  ->  d::2 0.0008 0.0004 1.99
square::2048      d::1  ->  d::2 0.0017 0.0008 1.96
square::4096      d::1  ->  d::2 0.0033 0.0022 1.53
square::1024      d::1  ->  d::10 0.0011 0.0010 1.07
square::2048      d::1  ->  d::10 0.0025 0.0025 1.01
square::4096      d::1  ->  d::10 0.0051 0.0055 0.93
square::1024      d::2  ->  d::1 0.0008 0.0005 1.62
square::2048      d::2  ->  d::1 0.0017 0.0010 1.63
square::4096      d::2  ->  d::1 0.0033 0.0021 1.58
square::1024      d::2  ->  d::2 0.0008 0.0007 1.28
square::2048      d::2  ->  d::2 0.0017 0.0013 1.25
square::4096      d::2  ->  d::2 0.0033 0.0030 1.1
square::1024      d::2  ->  d::10 0.0013 0.0013 0.99
square::2048      d::2  ->  d::10 0.0027 0.0027 0.99
square::4096      d::2  ->  d::10 0.0053 0.0053 0.99
square::1024      d::10  ->  d::1 0.0010 0.0005 1.96
square::2048      d::10  ->  d::1 0.0024 0.0021 1.15
square::4096      d::10  ->  d::1 0.0046 0.0040 1.15
square::1024      d::10  ->  d::2 0.0015 0.0014 1.11
square::2048      d::10  ->  d::2 0.0030 0.0028 1.09
square::4096      d::10  ->  d::2 0.0062 0.0055 1.12
square::1024      d::10  ->  d::10 0.0022 0.0021 1.06
square::2048      d::10  ->  d::10 0.0044 0.0042 1.05
square::4096      d::10  ->  d::10 0.0096 0.0091 1.05
square::1024      f::1  ->  f::1 0.0002 0.0001 1.77
square::2048      f::1  ->  f::1 0.0004 0.0002 1.74
square::4096      f::1  ->  f::1 0.0008 0.0005 1.72
square::1024      f::1  ->  f::2 0.0008 0.0004 2.07
square::2048      f::1  ->  f::2 0.0017 0.0008 2.08
square::4096      f::1  ->  f::2 0.0033 0.0016 2.07
square::1024      f::1  ->  f::10 0.0008 0.0004 1.89
square::2048      f::1  ->  f::10 0.0018 0.0019 0.95
square::4096      f::1  ->  f::10 0.0036 0.0038 0.94
square::1024      f::2  ->  f::1 0.0008 0.0003 2.79
square::2048      f::2  ->  f::1 0.0017 0.0006 2.85
square::4096      f::2  ->  f::1 0.0033 0.0011 2.88
square::1024      f::2  ->  f::2 0.0008 0.0006 1.43
square::2048      f::2  ->  f::2 0.0017 0.0011 1.45
square::4096      f::2  ->  f::2 0.0033 0.0023 1.43
square::1024      f::2  ->  f::10 0.0008 0.0006 1.42
square::2048      f::2  ->  f::10 0.0019 0.0018 1.03
square::4096      f::2  ->  f::10 0.0037 0.0037 1.01
square::1024      f::10  ->  f::1 0.0008 0.0007 1.22
square::2048      f::10  ->  f::1 0.0017 0.0016 1.08
square::4096      f::10  ->  f::1 0.0033 0.0031 1.07
square::1024      f::10  ->  f::2 0.0008 0.0009 0.93
square::2048      f::10  ->  f::2 0.0018 0.0020 0.87
square::4096      f::10  ->  f::2 0.0035 0.0041 0.87
square::1024      f::10  ->  f::10 0.0012 0.0012 0.93
square::2048      f::10  ->  f::10 0.0028 0.0028 0.98
square::4096      f::10  ->  f::10 0.0055 0.0057 0.97

Power little-endian

CPU
Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       8
NUMA node(s):                    1
Model:                           2.2 (pvr 004e 1202)
Model name:                      POWER9 (architected), altivec supported
L1d cache:                       256 KiB
L1i cache:                       256 KiB
NUMA node0 CPU(s):               0-7
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Mitigation; RFI Flush
Vulnerability Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio)
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable

processor	: 7
cpu		: POWER9 (architected), altivec supported
clock		: 2200.000000MHz
revision	: 2.2 (pvr 004e 1202)

timebase	: 512000000
platform	: pSeries
model		: IBM pSeries (emulated by qemu)
machine		: CHRP IBM pSeries (emulated by qemu)
MMU		: Radix


OS
Linux 8b2db3b0dfac 4.19.0-2-powerpc64le
gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2) 

Benchmark

VSX2(ISA >= 2.07) - Contiguous only

metric: gmean, units: ms

name of test before_contig after_contig after_contig vs before_contig
absolute::1024      d::1  ->  d::1 0.0009 0.0003 3.11
absolute::2048      d::1  ->  d::1 0.0017 0.0006 2.88
absolute::4096      d::1  ->  d::1 0.0033 0.0011 2.98
absolute::1024      f::1  ->  f::1 0.0011 0.0002 6.5
absolute::2048      f::1  ->  f::1 0.0021 0.0003 6.96
absolute::4096      f::1  ->  f::1 0.0041 0.0006 6.92
reciprocal::1024      d::1  ->  d::1 0.0017 0.0016 1.05
reciprocal::2048      d::1  ->  d::1 0.0033 0.0033 1.02
reciprocal::4096      d::1  ->  d::1 0.0068 0.0065 1.04
reciprocal::1024      f::1  ->  f::1 0.0008 0.0007 1.05
reciprocal::2048      f::1  ->  f::1 0.0015 0.0014 1.08
reciprocal::4096      f::1  ->  f::1 0.0032 0.0029 1.13
sqrt::1024      d::1  ->  d::1 0.0028 0.0022 1.26
sqrt::2048      d::1  ->  d::1 0.0054 0.0045 1.21
sqrt::4096      d::1  ->  d::1 0.0109 0.0090 1.22
sqrt::1024      f::1  ->  f::1 0.0021 0.0008 2.55
sqrt::2048      f::1  ->  f::1 0.0042 0.0016 2.55
sqrt::4096      f::1  ->  f::1 0.0083 0.0033 2.54
square::1024      d::1  ->  d::1 0.0006 0.0003 1.89
square::2048      d::1  ->  d::1 0.0011 0.0006 1.83
square::4096      d::1  ->  d::1 0.0023 0.0012 1.87
square::1024      f::1  ->  f::1 0.0003 0.0002 1.76
square::2048      f::1  ->  f::1 0.0006 0.0003 1.92
square::4096      f::1  ->  f::1 0.0011 0.0006 1.84
VSX2(ISA >= 2.07)

metric: gmean, units: ms

name of test before after after vs before
absolute::1024      d::1  ->  d::1 0.0008 0.0003 2.9
absolute::2048      d::1  ->  d::1 0.0016 0.0006 2.81
absolute::4096      d::1  ->  d::1 0.0032 0.0011 2.79
absolute::1024      d::1  ->  d::2 0.0008 0.0004 1.87
absolute::2048      d::1  ->  d::2 0.0016 0.0009 1.88
absolute::4096      d::1  ->  d::2 0.0032 0.0017 1.82
absolute::1024      d::1  ->  d::10 0.0008 0.0004 1.86
absolute::2048      d::1  ->  d::10 0.0016 0.0009 1.88
absolute::4096      d::1  ->  d::10 0.0032 0.0017 1.82
absolute::1024      d::2  ->  d::1 0.0008 0.0004 2.06
absolute::2048      d::2  ->  d::1 0.0016 0.0008 1.97
absolute::4096      d::2  ->  d::1 0.0032 0.0016 2.04
absolute::1024      d::2  ->  d::2 0.0008 0.0007 1.24
absolute::2048      d::2  ->  d::2 0.0016 0.0013 1.22
absolute::4096      d::2  ->  d::2 0.0032 0.0025 1.26
absolute::1024      d::2  ->  d::10 0.0008 0.0007 1.28
absolute::2048      d::2  ->  d::10 0.0016 0.0013 1.23
absolute::4096      d::2  ->  d::10 0.0033 0.0025 1.29
absolute::1024      d::10  ->  d::1 0.0009 0.0006 1.53
absolute::2048      d::10  ->  d::1 0.0018 0.0012 1.54
absolute::4096      d::10  ->  d::1 0.0035 0.0023 1.53
absolute::1024      d::10  ->  d::2 0.0009 0.0008 1.16
absolute::2048      d::10  ->  d::2 0.0018 0.0016 1.1
absolute::4096      d::10  ->  d::2 0.0035 0.0031 1.1
absolute::1024      d::10  ->  d::10 0.0009 0.0008 1.11
absolute::2048      d::10  ->  d::10 0.0018 0.0016 1.1
absolute::4096      d::10  ->  d::10 0.0072 0.0072 1.0
absolute::1024      f::1  ->  f::1 0.0010 0.0002 6.01
absolute::2048      f::1  ->  f::1 0.0019 0.0003 6.4
absolute::4096      f::1  ->  f::1 0.0038 0.0006 6.37
absolute::1024      f::1  ->  f::2 0.0010 0.0004 2.62
absolute::2048      f::1  ->  f::2 0.0019 0.0007 2.69
absolute::4096      f::1  ->  f::2 0.0039 0.0014 2.85
absolute::1024      f::1  ->  f::10 0.0010 0.0004 2.66
absolute::2048      f::1  ->  f::10 0.0019 0.0007 2.77
absolute::4096      f::1  ->  f::10 0.0037 0.0013 2.82
absolute::1024      f::2  ->  f::1 0.0010 0.0004 2.7
absolute::2048      f::2  ->  f::1 0.0019 0.0007 2.81
absolute::4096      f::2  ->  f::1 0.0039 0.0014 2.83
absolute::1024      f::2  ->  f::2 0.0010 0.0006 1.56
absolute::2048      f::2  ->  f::2 0.0019 0.0012 1.63
absolute::4096      f::2  ->  f::2 0.0038 0.0023 1.62
absolute::1024      f::2  ->  f::10 0.0010 0.0006 1.61
absolute::2048      f::2  ->  f::10 0.0019 0.0012 1.63
absolute::4096      f::2  ->  f::10 0.0038 0.0023 1.62
absolute::1024      f::10  ->  f::1 0.0010 0.0005 2.04
absolute::2048      f::10  ->  f::1 0.0019 0.0009 2.16
absolute::4096      f::10  ->  f::1 0.0037 0.0017 2.2
absolute::1024      f::10  ->  f::2 0.0010 0.0006 1.54
absolute::2048      f::10  ->  f::2 0.0019 0.0012 1.58
absolute::4096      f::10  ->  f::2 0.0037 0.0024 1.57
absolute::1024      f::10  ->  f::10 0.0010 0.0006 1.54
absolute::2048      f::10  ->  f::10 0.0019 0.0012 1.56
absolute::4096      f::10  ->  f::10 0.0037 0.0024 1.57
reciprocal::1024      d::1  ->  d::1 0.0016 0.0016 1.0
reciprocal::2048      d::1  ->  d::1 0.0033 0.0033 1.0
reciprocal::4096      d::1  ->  d::1 0.0065 0.0065 1.0
reciprocal::1024      d::1  ->  d::2 0.0021 0.0016 1.26
reciprocal::2048      d::1  ->  d::2 0.0042 0.0033 1.27
reciprocal::4096      d::1  ->  d::2 0.0083 0.0066 1.27
reciprocal::1024      d::1  ->  d::10 0.0021 0.0016 1.26
reciprocal::2048      d::1  ->  d::10 0.0042 0.0033 1.27
reciprocal::4096      d::1  ->  d::10 0.0083 0.0066 1.27
reciprocal::1024      d::2  ->  d::1 0.0021 0.0016 1.28
reciprocal::2048      d::2  ->  d::1 0.0042 0.0033 1.28
reciprocal::4096      d::2  ->  d::1 0.0083 0.0068 1.23
reciprocal::1024      d::2  ->  d::2 0.0022 0.0017 1.31
reciprocal::2048      d::2  ->  d::2 0.0040 0.0033 1.2
reciprocal::4096      d::2  ->  d::2 0.0080 0.0067 1.2
reciprocal::1024      d::2  ->  d::10 0.0021 0.0017 1.24
reciprocal::2048      d::2  ->  d::10 0.0042 0.0033 1.25
reciprocal::4096      d::2  ->  d::10 0.0083 0.0067 1.25
reciprocal::1024      d::10  ->  d::1 0.0021 0.0016 1.28
reciprocal::2048      d::10  ->  d::1 0.0043 0.0033 1.32
reciprocal::4096      d::10  ->  d::1 0.0085 0.0065 1.29
reciprocal::1024      d::10  ->  d::2 0.0021 0.0017 1.24
reciprocal::2048      d::10  ->  d::2 0.0042 0.0034 1.24
reciprocal::4096      d::10  ->  d::2 0.0083 0.0067 1.25
reciprocal::1024      d::10  ->  d::10 0.0020 0.0017 1.2
reciprocal::2048      d::10  ->  d::10 0.0040 0.0033 1.19
reciprocal::4096      d::10  ->  d::10 0.0081 0.0071 1.15
reciprocal::1024      f::1  ->  f::1 0.0007 0.0007 1.02
reciprocal::2048      f::1  ->  f::1 0.0015 0.0014 1.03
reciprocal::4096      f::1  ->  f::1 0.0030 0.0029 1.05
reciprocal::1024      f::1  ->  f::2 0.0017 0.0008 2.06
reciprocal::2048      f::1  ->  f::2 0.0034 0.0016 2.07
reciprocal::4096      f::1  ->  f::2 0.0068 0.0033 2.08
reciprocal::1024      f::1  ->  f::10 0.0017 0.0008 2.06
reciprocal::2048      f::1  ->  f::10 0.0034 0.0018 1.91
reciprocal::4096      f::1  ->  f::10 0.0068 0.0033 2.05
reciprocal::1024      f::2  ->  f::1 0.0017 0.0007 2.38
reciprocal::2048      f::2  ->  f::1 0.0034 0.0014 2.38
reciprocal::4096      f::2  ->  f::1 0.0068 0.0029 2.38
reciprocal::1024      f::2  ->  f::2 0.0017 0.0009 1.87
reciprocal::2048      f::2  ->  f::2 0.0034 0.0018 1.88
reciprocal::4096      f::2  ->  f::2 0.0068 0.0036 1.88
reciprocal::1024      f::2  ->  f::10 0.0017 0.0009 1.87
reciprocal::2048      f::2  ->  f::10 0.0034 0.0018 1.9
reciprocal::4096      f::2  ->  f::10 0.0068 0.0036 1.88
reciprocal::1024      f::10  ->  f::1 0.0017 0.0007 2.37
reciprocal::2048      f::10  ->  f::1 0.0034 0.0014 2.38
reciprocal::4096      f::10  ->  f::1 0.0068 0.0029 2.38
reciprocal::1024      f::10  ->  f::2 0.0017 0.0009 1.83
reciprocal::2048      f::10  ->  f::2 0.0034 0.0019 1.84
reciprocal::4096      f::10  ->  f::2 0.0068 0.0037 1.86
reciprocal::1024      f::10  ->  f::10 0.0017 0.0009 1.83
reciprocal::2048      f::10  ->  f::10 0.0034 0.0019 1.85
reciprocal::4096      f::10  ->  f::10 0.0068 0.0037 1.86
sqrt::1024      d::1  ->  d::1 0.0026 0.0022 1.14
sqrt::2048      d::1  ->  d::1 0.0051 0.0045 1.14
sqrt::4096      d::1  ->  d::1 0.0101 0.0090 1.13
sqrt::1024      d::1  ->  d::2 0.0026 0.0023 1.11
sqrt::2048      d::1  ->  d::2 0.0051 0.0046 1.11
sqrt::4096      d::1  ->  d::2 0.0102 0.0092 1.11
sqrt::1024      d::1  ->  d::10 0.0026 0.0023 1.11
sqrt::2048      d::1  ->  d::10 0.0051 0.0046 1.1
sqrt::4096      d::1  ->  d::10 0.0102 0.0092 1.1
sqrt::1024      d::2  ->  d::1 0.0026 0.0023 1.13
sqrt::2048      d::2  ->  d::1 0.0051 0.0045 1.14
sqrt::4096      d::2  ->  d::1 0.0102 0.0090 1.13
sqrt::1024      d::2  ->  d::2 0.0026 0.0023 1.13
sqrt::2048      d::2  ->  d::2 0.0051 0.0045 1.13
sqrt::4096      d::2  ->  d::2 0.0102 0.0091 1.12
sqrt::1024      d::2  ->  d::10 0.0026 0.0023 1.13
sqrt::2048      d::2  ->  d::10 0.0051 0.0045 1.12
sqrt::4096      d::2  ->  d::10 0.0102 0.0091 1.12
sqrt::1024      d::10  ->  d::1 0.0026 0.0022 1.15
sqrt::2048      d::10  ->  d::1 0.0051 0.0045 1.14
sqrt::4096      d::10  ->  d::1 0.0102 0.0090 1.13
sqrt::1024      d::10  ->  d::2 0.0026 0.0023 1.13
sqrt::2048      d::10  ->  d::2 0.0051 0.0046 1.12
sqrt::4096      d::10  ->  d::2 0.0102 0.0091 1.12
sqrt::1024      d::10  ->  d::10 0.0026 0.0023 1.13
sqrt::2048      d::10  ->  d::10 0.0051 0.0048 1.07
sqrt::4096      d::10  ->  d::10 0.0102 0.0091 1.12
sqrt::1024      f::1  ->  f::1 0.0021 0.0008 2.53
sqrt::2048      f::1  ->  f::1 0.0041 0.0016 2.52
sqrt::4096      f::1  ->  f::1 0.0082 0.0033 2.51
sqrt::1024      f::1  ->  f::2 0.0021 0.0009 2.26
sqrt::2048      f::1  ->  f::2 0.0041 0.0018 2.27
sqrt::4096      f::1  ->  f::2 0.0082 0.0036 2.27
sqrt::1024      f::1  ->  f::10 0.0021 0.0009 2.28
sqrt::2048      f::1  ->  f::10 0.0041 0.0018 2.27
sqrt::4096      f::1  ->  f::10 0.0086 0.0036 2.39
sqrt::1024      f::2  ->  f::1 0.0021 0.0008 2.52
sqrt::2048      f::2  ->  f::1 0.0041 0.0016 2.51
sqrt::4096      f::2  ->  f::1 0.0082 0.0033 2.51
sqrt::1024      f::2  ->  f::2 0.0021 0.0010 2.06
sqrt::2048      f::2  ->  f::2 0.0041 0.0020 2.07
sqrt::4096      f::2  ->  f::2 0.0082 0.0041 2.0
sqrt::1024      f::2  ->  f::10 0.0021 0.0010 2.03
sqrt::2048      f::2  ->  f::10 0.0041 0.0020 2.07
sqrt::4096      f::2  ->  f::10 0.0082 0.0040 2.06
sqrt::1024      f::10  ->  f::1 0.0021 0.0008 2.52
sqrt::2048      f::10  ->  f::1 0.0041 0.0017 2.44
sqrt::4096      f::10  ->  f::1 0.0082 0.0033 2.48
sqrt::1024      f::10  ->  f::2 0.0021 0.0010 2.03
sqrt::2048      f::10  ->  f::2 0.0041 0.0020 2.04
sqrt::4096      f::10  ->  f::2 0.0082 0.0040 2.05
sqrt::1024      f::10  ->  f::10 0.0021 0.0011 1.94
sqrt::2048      f::10  ->  f::10 0.0041 0.0021 1.93
sqrt::4096      f::10  ->  f::10 0.0082 0.0041 2.03
square::1024      d::1  ->  d::1 0.0005 0.0003 1.8
square::2048      d::1  ->  d::1 0.0011 0.0006 1.75
square::4096      d::1  ->  d::1 0.0021 0.0012 1.76
square::1024      d::1  ->  d::2 0.0007 0.0005 1.43
square::2048      d::1  ->  d::2 0.0013 0.0009 1.44
square::4096      d::1  ->  d::2 0.0027 0.0019 1.44
square::1024      d::1  ->  d::10 0.0007 0.0005 1.43
square::2048      d::1  ->  d::10 0.0013 0.0009 1.44
square::4096      d::1  ->  d::10 0.0027 0.0019 1.4
square::1024      d::2  ->  d::1 0.0007 0.0004 1.73
square::2048      d::2  ->  d::1 0.0015 0.0008 1.84
square::4096      d::2  ->  d::1 0.0030 0.0016 1.89
square::1024      d::2  ->  d::2 0.0008 0.0007 1.09
square::2048      d::2  ->  d::2 0.0015 0.0014 1.06
square::4096      d::2  ->  d::2 0.0030 0.0027 1.1
square::1024      d::2  ->  d::10 0.0008 0.0007 1.09
square::2048      d::2  ->  d::10 0.0015 0.0014 1.09
square::4096      d::2  ->  d::10 0.0029 0.0028 1.05
square::1024      d::10  ->  d::1 0.0009 0.0006 1.47
square::2048      d::10  ->  d::1 0.0017 0.0012 1.48
square::4096      d::10  ->  d::1 0.0033 0.0023 1.45
square::1024      d::10  ->  d::2 0.0008 0.0008 0.99
square::2048      d::10  ->  d::2 0.0016 0.0016 0.98
square::4096      d::10  ->  d::2 0.0032 0.0033 0.99
square::1024      d::10  ->  d::10 0.0008 0.0008 0.98
square::2048      d::10  ->  d::10 0.0016 0.0016 0.98
square::4096      d::10  ->  d::10 0.0071 0.0071 1.0
square::1024      f::1  ->  f::1 0.0003 0.0002 1.74
square::2048      f::1  ->  f::1 0.0005 0.0003 1.77
square::4096      f::1  ->  f::1 0.0011 0.0006 1.82
square::1024      f::1  ->  f::2 0.0008 0.0004 2.14
square::2048      f::1  ->  f::2 0.0016 0.0007 2.18
square::4096      f::1  ->  f::2 0.0032 0.0014 2.2
square::1024      f::1  ->  f::10 0.0008 0.0004 2.15
square::2048      f::1  ->  f::10 0.0016 0.0007 2.19
square::4096      f::1  ->  f::10 0.0032 0.0014 2.21
square::1024      f::2  ->  f::1 0.0008 0.0003 2.32
square::2048      f::2  ->  f::1 0.0016 0.0007 2.39
square::4096      f::2  ->  f::1 0.0032 0.0013 2.41
square::1024      f::2  ->  f::2 0.0008 0.0006 1.27
square::2048      f::2  ->  f::2 0.0016 0.0013 1.28
square::4096      f::2  ->  f::2 0.0032 0.0025 1.29
square::1024      f::2  ->  f::10 0.0008 0.0006 1.27
square::2048      f::2  ->  f::10 0.0016 0.0013 1.28
square::4096      f::2  ->  f::10 0.0034 0.0025 1.34
square::1024      f::10  ->  f::1 0.0008 0.0004 1.93
square::2048      f::10  ->  f::1 0.0016 0.0008 1.94
square::4096      f::10  ->  f::1 0.0032 0.0017 1.95
square::1024      f::10  ->  f::2 0.0008 0.0007 1.23
square::2048      f::10  ->  f::2 0.0016 0.0013 1.25
square::4096      f::10  ->  f::2 0.0032 0.0026 1.25
square::1024      f::10  ->  f::10 0.0008 0.0007 1.23
square::2048      f::10  ->  f::10 0.0016 0.0013 1.26
square::4096      f::10  ->  f::10 0.0032 0.0026 1.25

Binary size of _multiarray_umath.cpython-ver-arch-linux-gnu.so in kbytes

Note: Debugging symbols are striped

arch before after after vs before
x86 3564 3568 1.0011
ppc64le 4216 4224 1.0018
aarch64 3488 3500 1.0034

EDIT: left some notes

@seiko2plus seiko2plus force-pushed the to_npyv_unaryfp_g0 branch 2 times, most recently from 225ef80 to f1759d6 Compare May 17, 2020 07:23
@mattip mattip added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Aug 19, 2020
@mattip
Copy link
Member

mattip commented Aug 19, 2020

@seiko2plus this needs a redo now that the infrastructure is in place.

@charris
Copy link
Member

charris commented Sep 8, 2020

@seiko2plus Ping. Looks like the recent header change merges reguires a rebase. Any possibility of breaking out the fixes required for the 32 bit wheel builds?

@seiko2plus
Copy link
Member Author

@charris, I'm going to rebase it and push the local changes just after getting done from #16782,
since this pr and also #16960 requires implementing a lot of intrinsics that deal with non-contiguous memory access,
and honestly, I wouldn't trust my code without a proper testing unit that works across all supported SIMD extensions.

@seiko2plus
Copy link
Member Author

seiko2plus commented Sep 8, 2020

@charris, Any possibility of breaking out the fixes required for the 32 bit wheel builds?

no, I have to implement a SIMD kernel that handles scalars as well as vectors, just give me a week.

@seiko2plus seiko2plus force-pushed the to_npyv_unaryfp_g0 branch 5 times, most recently from f128eca to 2688270 Compare September 21, 2020 14:03
@seiko2plus seiko2plus force-pushed the to_npyv_unaryfp_g0 branch 9 times, most recently from 9387b54 to a75caf1 Compare September 24, 2020 17:39
@seiko2plus seiko2plus force-pushed the to_npyv_unaryfp_g0 branch 3 times, most recently from bc8f69e to 5ca9489 Compare September 25, 2020 05:19
// a * (1/√a)
npyv_f32 sqrt = vmulq_f32(a, rsqrte);
// return zero if the a is zero
npyv_u32 bits = vbicq_u32(vreinterpretq_u32_f32(sqrt), vceqq_f32(a, zero));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenCV simd sqrt is here. but you are right, if we didn't check zero, the result will become a nan.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, to #16782, I catched two issues here positive infinity(same as zero) and precision.
unfortunately, I had to add a third Newton-Raphson iteration to provides acceptable precision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for that failure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a check to all sqrt special cases, see:
https://github.com/numpy/numpy/blob/1f872658984b2f8b0fda7022e72ad333a62864f3/numpy/core/tests/test_simd.py#L184-L188

I made another fix to npyv_sqrt_f32() within the last push, which fixes floating-point division-by-zero error that
raised by vrsqrteq_f32(x) when x is zero.

Now all the tests passed on armhf.

TD(O, f='Py_square'),
),
'reciprocal':
Ufunc(1, 1, None,
docstrings.get('numpy.core.umath.reciprocal'),
None,
TD(ints+inexact, simd=[('avx2', ints), ('fma', 'fd'), ('avx512f','fd')]),
TD(ints+inexact, simd=[('avx2', ints)], dispatch=[('loops_unary_fp', 'fd')]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix the failing 32-bit wheel build?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change activates the new dispatcher.
32-bit wheel build fails due to aggressive optimization gcc made that doesn't respect zero division,
this issue was exist also on 64-bit when AVX2 and AVX512F aren't enabled.

the fix mainly here:
https://github.com/numpy/numpy/blob/1f872658984b2f8b0fda7022e72ad333a62864f3/numpy/core/src/umath/loops_unary_fp.dispatch.c.src#L129-L133

where partial load intrinsic npyv_load_till_* and npyv_loadn_till_* guarantee adding "one" to the tail of the vector.

I also made a slight change in the last push to generate using NPYV version for overlapped arrays, also to guarantee the same precsion on armhf.

https://github.com/numpy/numpy/blob/1f872658984b2f8b0fda7022e72ad333a62864f3/numpy/core/src/umath/loops_unary_fp.dispatch.c.src#L202-L206

@seiko2plus seiko2plus force-pushed the to_npyv_unaryfp_g0 branch 2 times, most recently from ac9540f to 1f87265 Compare October 31, 2020 20:03
Comment on lines +1 to +11
/*@targets
** $maxopt baseline
** sse2 vsx2 neon
**/
/**
* Force use SSE only on x86, even if AVX2 or AVX512F are enabled
* through the baseline, since scatter(AVX512F) and gather very costly
* to handle non-contiguous memory access comparing with SSE for
* such small operations that this file covers.
*/
#define NPY_SIMD_FORCE_128
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r-devulap, I only enabled SSE and dropped AVX2 and AVX512F since there's no performance gain for contiguous arrays,
also, the emulated version of partial and non-contiguous memory load/store intrinsics show better performance
comparing with the gather/scatter(AVX512F) intrinsics, especially when I unroll by x2/x4.

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should run benchmarks to compare this to the current code. I would expect x86_64 to be no slower, and arm64 to be faster. @Qiyu8 any thoughts?

#include <Python.h> // for PyObject
#include "numpy/numpyconfig.h" // for NPY_VISIBILITY_HIDDEN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why the shift in order is needed, usually Python.h comes first

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to suppress warning 'declaration of 'struct timespec*', this compiler warnning raised when math.h get included before
Python.h.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have some more info about this you can link to?

NPY_FINLINE npyv_f64 npyv_square_f64(npyv_f64 a)
{ return _mm256_mul_pd(a, a); }

#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the new intrinsics need to be added to the _simd module explicitly or is it automatic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return !(nomemoverlap((char*)src, src_step*len, (char*)dst, dst_step*len));
}

#endif // _NPY_UMATH_LOOPS_UTILS_H_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have solve_may_share_memory, in common/mem_overlap.c, can we use this opportunity to refactor the code to use it? If not, then this function should be moved to that file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently solve may share_memory() is used to resolve arrays overlapping from Python level,
while what we do here is to avoid perform SIMD vector operations on overlapped arrays,
since the user expected scalar by scalar overlap which acts differently than unrolling,
also, we use sccatter/gather operations which may lead to undefined behaviour.

If not, then this function should be moved to that file.

I don't think this function is related to the content of common/mem_overlap.c.

data_recip = self.load([1/x for x in data]) # load to truncate precision
recip = self.recip(vdata)
assert recip == data_recip

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, the power of _simd to show the intrinsics are correct is compelling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be very helpful for the upcoming architectures.

@seiko2plus
Copy link
Member Author

seiko2plus commented Nov 1, 2020

@mattip,

I guess we should run benchmarks to compare this to the current code. I would expect x86_64 to be no slower, and arm64 to be faster. @Qiyu8 any thoughts?

I'm already provided good benchmarks in the description of this pull-request based on #15987, the last changes I made
wouldn't affect the results. NOTE: armhf's benchmark isn't included only aarch64(ARMv8 64-bit)

@Qiyu8
Copy link
Member

Qiyu8 commented Nov 2, 2020

IMO,A standalone benchmark script for universal intrinsics is the performance assurance in technical perspective, but what @mattip cared about is the final performance that numpy user can awared, which should be measured by asv benchmark script.

@mattip
Copy link
Member

mattip commented Nov 3, 2020

@seiko2plus this now has a merge conflict

what @mattip cared about is the final performance .. measured by asv

Yes, this is what we now need to verify these changes have not impacted x86_64 in a bad way. I would do it, but I don't seem to be able to stabilize my machine enough to get consistent results.

  this patch also improves division precision for NEON/A32
   - only covers sqrt, absolute, square and reciprocal
   - fix SIMD memory overlap check for aliasing(same ptr & stride)
   - unify fp/domain errors for both scalars and vectors
@seiko2plus
Copy link
Member Author

seiko2plus commented Nov 3, 2020

@mattip,

Yes, this is what we now need to verify these changes have not impacted x86_64 in a bad way.

That actually the exact reason behind #15987, it only compares the inner loops of ufunc which reduce the number of outliers
that may be caused by the Python API.

I would do it, but I don't seem to be able to stabilize my machine enough to get consistent results.

try to stabilize your system via pyperf module

sudo python -m pyperf system tune

You will need to compare against multiple dispatched targets, you gonna have to use the environment variable NPY_DISABLE_CPU_FEATURES.

for example:

# disable AVX512F to benchmarking AVX2
export NPY_DISABLE_CPU_FEATURES="AVX512F"
# run asv or #15987 
# disable AVX512F and AVX2 to benchmarking SSE
export NPY_DISABLE_CPU_FEATURES="AVX2 AVX512F"
# run asv or #15987 

@mattip
Copy link
Member

mattip commented Nov 4, 2020

sudo python -m pyperf system tune

Doesn't do much on an AMD machine

@mattip mattip added the triage review Issue/PR to be discussed at the next triage meeting label Nov 4, 2020
@mattip
Copy link
Member

mattip commented Nov 4, 2020

@hameerabbasi could you benchmark this?

@mattip mattip removed the triage review Issue/PR to be discussed at the next triage meeting label Nov 4, 2020
@hameerabbasi
Copy link
Contributor

I originally posted this on #15987 by mistake:

I ran this PR on a live environment without a desktop (Ubuntu Server), using the method in the PR description. The noise was around 3% and this PR had a performance impact of ±5%, so not too much of a difference.

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hameerabbasi. So it seems this is good to be merged: performance on x86_64 is unchanged, and this unlocks universal intrinsics for the other architectures (and should solve that pesky test failure on old gcc on 32-bit linux).

@mattip mattip merged commit a20eca2 into numpy:master Nov 10, 2020
@mattip
Copy link
Member

mattip commented Nov 10, 2020

Thanks @seiko2plus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 25 - WIP component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AVX test failures for 32 bit manylinux1 wheels
7 participants