ENH: Vectorizing umath module using AVX-512 (open sourced from Intel Short Vector Math Library, SVML) #19478

r-devulap · 2021-07-14T21:48:35Z

This patch integrates Intel Short Vector Math Library (SVML) into NumPy. SVML provides AVX-512 implementations of 44 math functions: exp, exp2, log, log2, log10, expm1, log1p, cbrt, pow, sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh, asinh, acosh and atanh (both single and double precisions). Some key points to note:

On a Skylake i9-7900X CPU, the library provides a speed up of up-to 55x (for cbrt function). The average speed up this patch provides is 32x and 14x for single and double precision respectively. More detailed numbers are provided below.
The maximum ULP error across all the implementations is 4 ULP.
This patch increases the multiarray shared object file from 29226320 bytes to 30147792 bytes, an increase of 899.875 Kb (compiled on Ubuntu 20.04 using gcc-11 compiler).

Detailed benchmarking numbers:

   before           after         ratio
     [fe3d7170]       [6049dc65]
     <main>           <svml>
-      459±0.05μs         61.2±1μs     0.13  bench_ufunc_strides.Unary.time_ufunc('exp2', 'd')
-        2.05±0ms          212±4μs     0.10  bench_ufunc_strides.Unary.time_ufunc('tanh', 'd')
-       871±0.8μs         86.4±3μs     0.10  bench_ufunc_strides.Unary.time_ufunc('cos', 'd')
-       856±0.3μs         76.6±8μs     0.09  bench_ufunc_strides.Unary.time_ufunc('log2', 'd')
-        2.66±0ms        238±0.1μs     0.09  bench_ufunc_strides.Unary.time_ufunc('arccosh', 'd')
-        1.33±0ms         107±10μs     0.08  bench_ufunc_strides.Unary.time_ufunc('log1p', 'd')
-     2.98±0.01ms         229±10μs     0.08  bench_ufunc_strides.Unary.time_ufunc('arcsinh', 'd')
-        1.05±0ms         78.5±5μs     0.07  bench_ufunc_strides.Unary.time_ufunc('sin', 'd')
-        1.11±0ms         78.1±6μs     0.07  bench_ufunc_strides.Unary.time_ufunc('log10', 'd')
-        1.66±0ms         115±10μs     0.07  bench_ufunc_strides.Unary.time_ufunc('arccos', 'd')
-        1.24±0ms         79.6±4μs     0.06  bench_ufunc_strides.Unary.time_ufunc('expm1', 'd')
-        1.72±0ms          107±8μs     0.06  bench_ufunc_strides.Unary.time_ufunc('arcsin', 'd')
-        1.33±0ms         81.6±6μs     0.06  bench_ufunc_strides.Unary.time_ufunc('cosh', 'd')
-        1.94±0ms          118±7μs     0.06  bench_ufunc_strides.Unary.time_ufunc('sinh', 'd')
-        2.69±0ms         158±10μs     0.06  bench_ufunc_strides.Unary.time_ufunc('arctanh', 'd')
-       366±0.1μs         20.7±2μs     0.06  bench_ufunc_strides.Unary.time_ufunc('exp2', 'f')
-        2.29±0ms         121±10μs     0.05  bench_ufunc_strides.Unary.time_ufunc('arctan', 'd')
-        2.16±0ms          111±6μs     0.05  bench_ufunc_strides.Unary.time_ufunc('cbrt', 'd')
-      427±0.08μs         21.7±3μs     0.05  bench_ufunc_strides.Unary.time_ufunc('log2', 'f')
-        1.05±0ms         46.4±6μs     0.04  bench_ufunc_strides.Unary.time_ufunc('arccos', 'f')
-        2.36±0ms          103±7μs     0.04  bench_ufunc_strides.Unary.time_ufunc('tan', 'd')
-        1.24±0ms         52.9±6μs     0.04  bench_ufunc_strides.Unary.time_ufunc('log1p', 'f')
-     1.18±0.01ms         48.9±7μs     0.04  bench_ufunc_strides.Unary.time_ufunc('arctan', 'f')
-         974±3μs         38.9±6μs     0.04  bench_ufunc_strides.Unary.time_ufunc('arcsin', 'f')
-        2.32±0ms       89.9±0.5μs     0.04  bench_ufunc_strides.Unary.time_ufunc('arccosh', 'f')
-        2.58±0ms        86.3±10μs     0.03  bench_ufunc_strides.Unary.time_ufunc('arcsinh', 'f')
-        1.59±0ms         50.2±7μs     0.03  bench_ufunc_strides.Unary.time_ufunc('tan', 'f')
-       826±0.1μs         21.8±3μs     0.03  bench_ufunc_strides.Unary.time_ufunc('log10', 'f')
-        1.17±0ms         30.8±4μs     0.03  bench_ufunc_strides.Unary.time_ufunc('expm1', 'f')
-     1.94±0.01ms         49.6±4μs     0.03  bench_ufunc_strides.Unary.time_ufunc('tanh', 'f')
-        1.33±0ms         33.5±4μs     0.03  bench_ufunc_strides.Unary.time_ufunc('cosh', 'f')
-        2.45±0ms         57.3±9μs     0.02  bench_ufunc_strides.Unary.time_ufunc('arctanh', 'f')
-        1.98±0ms         42.1±5μs     0.02  bench_ufunc_strides.Unary.time_ufunc('sinh', 'f')
-        2.09±0ms         38.3±5μs     0.02  bench_ufunc_strides.Unary.time_ufunc('cbrt', 'f')

mattip · 2021-07-15T06:26:02Z

There is a failing test for x**p.

While all the x86 wheels will get the 1MB binary size increase, only AVX512_SKX-capable processors (gen11 ?) will be able to use it. On the other hand, a order of magnitude speedup or more for these traditionally heavy operations is great news.

I see in the library description there is an option to get 1ULP precision. How much does that affect speed? What is the default for other consumers of this library?

mattip · 2021-07-15T06:54:44Z

Maybe worth splitting the validator data and benchmark commits into a separate PR

thiagomacieira · 2021-07-15T14:25:22Z

There is a failing test for x**p.

While all the x86 wheels will get the 1MB binary size increase, only AVX512_SKX-capable processors (gen11 ?) will be able to use it. On the other hand, a order of magnitude speedup or more for these traditionally heavy operations is great news.

Quick correction on terms:

AVX512 has been present on processors since the Skylake microarchitecture, which was first made available in the Xeon Scalable products and later in the 7th Generation Core i9 Extreme (the X suffix). That's the one that this was tested on (i9-7900X)
AVX512 is also present in laptops since the Ice Lake processors, which was launched as "10th Generation Core". The 11th Generation is Tiger Lake.
Intel Graphics also use "gen" for their non-marketing names, but their numbering is slightly off: the GPU in Skylake was gen9; the one in Ice Lake is gen11 and the one in Tiger Lake is gen12.

Therefore, there are a lot of processors out there, especially in the cloud, where the new routines will be beneficial. For those of us testing on our own laptops, AVX512 has been available since 2020.

r-devulap · 2021-07-15T16:13:21Z

Maybe worth splitting the validator data and benchmark commits into a separate PR

Split them into separate PR's: see #19485

r-devulap · 2021-07-15T16:17:39Z

There is a failing test for x**p.

Looking into it..

I see in the library description there is an option to get 1ULP precision. How much does that affect speed? What is the default for other consumers of this library?

I haven't benchmarked these but I would think they would be considerably slower than 4ULP implementations.

seberg · 2021-07-15T23:47:51Z

The maximum ULP error across all the implementations is 4 ULP.

Is that so? The table sounds a lot like the actual ULPs for most functions is much lower? or do these table list "typical" ULP rather then maximum ULP errors!?

It is maybe a bit knee jerk, but the sheer number of lines makes me wonder if there is a way to avoid vendoring all of files?

r-devulap · 2021-07-16T03:46:08Z

The maximum ULP error across all the implementations is 4 ULP.

Is that so? The table sounds a lot like the actual ULPs for most functions is much lower? or do these table list "typical" ULP rather then maximum ULP errors!?

The table does list the maximum ULP error and not an average. And the maximum ULP error for the low accuracy (LA) SVML is 4ULP. Based on the table, double precision tan seems to be the worst offender with a max ULP error of 3.53. But yes, for most other functions it seems to be less than 4.

It is maybe a bit knee jerk, but the sheer number of lines makes me wonder if there is a way to avoid vendoring all of files?

Most of the code in this patch comes from the SVML library which has been in use by its customers for a good few years now (since SkylakeX was introduced). I expect the library itself would be relatively low maintenance. Additionally, PR #19485 adds a lot of relevant tests to ensure good test coverage for the umath module.

rgommers · 2021-07-16T12:13:17Z

Interesting, thanks @r-devulap! It's good to see SVML code under a BSD-3 license.

Performance gains sound relevant. This is a lot of code, so we'll have to do some higher-level benchmarking too I think, not just micro-benchmarks.

I think there'll be many maintainers who, like me, aren't fully up to speed on the SVML story is. It's a vectorized version of (parts of) libm, parts are also upstreamed into GCC (libmvec) I believe, and it's similar to SLEEF and Yeppp (?). Difference between VML and SVML is (?). It's a bit hard to find a clear and comprehensive overview; I found this paper somewhat helpful. If you have some good resources to help educate us, that'd be great.

EDIT: the last (only?) time we discussed this on the mailing list was March 2015, in this thread: https://mail.python.org/pipermail/numpy-discussion/2015-March/072406.html

rgommers · 2021-07-16T12:22:28Z

It is maybe a bit knee jerk, but the sheer number of lines makes me wonder if there is a way to avoid vendoring all of files?

Most of the code in this patch comes from the SVML library which has been in use by its customers for a good few years now (since SkylakeX was introduced). I expect the library itself would be relatively low maintenance. Additionally, PR #19485 adds a lot of relevant tests to ensure good test coverage for the umath module.

I think @seberg's question may be more about where the files live, not about quality or test coverage. Perhaps taking all of numpy/core/src/umath/svml/, putting it in a separate repo (e.g. github.com/numpy/svml) and then using that as a git submodule would not be unreasonable.

seberg · 2021-07-16T15:55:27Z

Adding it to .gitattributes as numpy/core/.../svml linguist-vendored would possibly be just good enough and we should do that anyway. (I am considering naming it svml_vendored or so also, just to make it obvious when browsing the source.) That might even help hide most of the diff in the review I think.

Also should this be in LICENSES_bundled.txt or is retaining the per-file copyright sufficient?

Overlap with SLEEF, etc. would indeed be interesting; And yeah, I don't think anyone really thought about it seriously for a long time, Julian probably did at some point.

rgommers · 2021-07-16T16:05:06Z

Also should this be in LICENSES_bundled.txt or is retaining the per-file copyright sufficient?

Given that there is no separate license file and it's the same BSD-3 license as the rest of NumPy, I don't see a compelling reason to list it in LICENSES_bundled.txt.

akolesov-intel · 2021-07-20T22:29:34Z

I think there'll be many maintainers who, like me, aren't fully up to speed on the SVML story is. It's a vectorized version of (parts of) libm,

Generally, yes. But SVML is not exactly the same as a vectorized LIBM. In most cases SVML algorithms are different from those in LIBM, and the function coverage may be different too. SVML supports several accuracy flavors: high (maximum errors up to 1 ulp), medium (up to 4 ulp), low (about half of the result’s significand bits are correct), and bitwise reproducible versions. LIBM provides only high accuracy implementations (more accurate than in SVML, with maximum errors of up to 0.6 ulp).

parts are also upstreamed into GCC (libmvec)

Exactly, some SSE4.2, AVX and AVX-512 SVML implementations were provided several years ago to GLIBC in libmvec, as Open Source assembly code.

I believe, and it's similar to SLEEF and Yeppp (?)

Correct. Though Yeppp only provides a small number of transcendental math functions. Our analysis of a few years ago showed that SVML provides the better overall implementations for Intel Architecture processors.

Difference between VML and SVML is (?).

SVML is internal library in Intel Compilers. VML (officially VM, or Vector Math) is part of the Intel oneAPI Math Kernel Library (oneMKL). The SVML API operates on SIMD input registers, while the VM API accepts pointers to vectors and the vector length as parameters. In general SVML is for ‘short vectors’, while VM is the most efficient on ‘long vectors’:

SVML: __m256 __svml_exp4(__m256 arg);
VM: vdExp(int length, double* arg, double* res);

r-devulap · 2021-07-21T18:09:24Z

SVML currently works only on linux. Will need some modifications to enable on Windows which I think it might be easier to deal with in a separate PR later.

r-devulap · 2021-07-21T18:21:45Z

Adding it to .gitattributes as numpy/core/.../svml linguist-vendored would possibly be just good enough and we should do that anyway.

Could you elaborate steps on how to do this?

seberg · 2021-07-21T18:44:35Z

Pretty high in that file there are some lines:

# Mark some files as vendored
numpy/linalg/lapack_lite/f2c.c linguist-vendored
numpy/linalg/lapack_lite/f2c.h linguist-vendored
tools/npy_tempita/* linguist-vendored
numpy/core/include/numpy/libdivide/* linguist-vendored

just appending one will do.

matthew-brett · 2021-07-22T08:17:46Z

Does SVML does not work correctly on macOS?

r-devulap · 2021-07-22T17:13:59Z

Does SVML does not work correctly on macOS?

Not the way it is right now. There are a few differences in assembler directives between macOS and linux (see https://stackoverflow.com/questions/19720084/what-is-the-difference-between-assembly-on-mac-and-assembly-on-linux) and the SVML sources need an update to accommodate these. Its definitely doable, but might be easier to do it in a separate patch at a later time.

seiko2plus

Just two points I have here:

Old dispatching mechanism should not be used anymore, the new dispatcher is more efficient supports all kinds of compilers and it can be disabled or managed during the runtime and build time.
Why not mixing SVML with universal intrinsics, it will make the code more robust and leaves the door open for other architectures.

The following suggestions are putting the above points in action without losing performance:

numpy/core/code_generators/generate_umath.py

numpy/core/src/umath/simd.inc.src

numpy/core/src/umath/loops.c.src

r-devulap · 2021-07-27T16:25:37Z

@seiko2plus Thanks for the review. This patch uses universal intrinsics now. You didn't have to write out the code for me :)

r-devulap · 2021-08-04T21:38:35Z

One of the things PR #19485 was supposed to do was add test coverage for raising FP exceptions for the ufuncs SVML vectorizes. I had created this table to ensure full coverage. It would be great if someone could review the table below and let me know if I am missing anything or if I got something wrong.

Ufunc	overflow	underflow	invalid	divide by zero
sin	-	-	x = +/-np.inf	-
cos	-	-	x = +/-np.inf	-
tan	-	-	x = +/-np.inf	-
arcsin	-	-	x < -1.0 x > 1.0	-
arccos	-	-	x < -1.0 x > 1.0	-
arctan	-	-	-	-
sinh	large numbers	-	-	-
cosh	large numbers	-	-	-
tanh	-	-	-	-
arcsinh	-	-	-	-
arccosh	-	-	x < 1.0	-
arctanh	-	-	x < -1.0 x > 1.0	x = +/- 1.0
cbrt	-	-	-	-
exp2	large numbers	some denormals	-	-
expm1	large numbers	-	-	-
log10	-	-	x < 0	x = 0
log1p	-	-	x < -1.0	x = -1
log2	-	-	x < 0	x = 0

seberg · 2021-08-04T21:57:26Z

and let me know if I am missing anything or if I got something wrong.

The table looks right to me (and I bet at least MacOS gets it wrong, so if we would want to add a test we have to deal with it). I like how tan can't reach infinity since you can never have pi/2 as input exact enough :).

The rules are straight forward here:

If NaN is returned, but the input was not a NaN it is an invalid-value error.
If the result should be finite and not 0, but is out of range for floating point, it should be over/underflow error.
If the result is inf but not an overflow and the input is not inf, it is a divide-by-zero error (it can't happen here that an inf is created from a NaN, so no need to kill brain cells thinking about it)

r-devulap · 2021-08-04T22:00:05Z

I ran the SciPy test suite when built with this branch of NumPy and saw 4 tests fails. All failures are related to error tolerance. The tests pass when I slightly bump the tolerance. I am not sure if increasing the tolerance is acceptable to SciPy or not. But here are the changes to the tests that require that change: https://github.com/r-devulap/scipy/pull/1/files

r-devulap · 2021-09-07T17:07:19Z

ping

r-devulap · 2021-09-20T17:19:39Z

Friendly ping. Any thoughts/concerns with this patch?

rgommers · 2021-10-02T20:02:28Z

Friendly ping. Any thoughts/concerns with this patch?

Not any concerns anymore really, I think it's more a "who dares to do final review and push the green button, cause there's a lot of code here and something is likely to break" kind of situation.

r-devulap · 2021-10-05T14:05:51Z

Not any concerns anymore really, I think it's more a "who dares to do final review and push the green button, cause there's a lot of code here and something is likely to break" kind of situation.

will be happy to discuss how we can reinforce PR #19485 with more test coverage, if that helps..

mattip · 2021-10-10T07:33:06Z

Thanks @r-devulap. We have some time before the release to get this to percolate through downstream users who are early adopters. Hopefully that will flush out any concerns about precision.

Most likely as a consequence of numpy now supporting the faster but slightly less precise Intel Short Vector Math Library (SVML), some of our tests are failing, in ways that are not really interesting. So, up the tolerance slightly. See numpy/numpy#19478

Most likely as a consequence of numpy now supporting the faster but slightly less precise Intel Short Vector Math Library (SVML), some of our tests are failing, in ways that are not really interesting. So, up the tolerance slightly. See numpy/numpy#19478 Temporarily pin numpy for matplotlib tests Ensure theta of Gaussian2D is initialized

…y#19478), but only pass for exact 1.22 so we can re-evaluate next version

…y#19478), but only pass for exact 1.22 so we can re-evaluate next version; also includes better re-implementation of XYZ_to_lbd

github-actions bot added the 01 - Enhancement label Jul 14, 2021

r-devulap force-pushed the svml branch from b62d1db to a3c43d9 Compare July 15, 2021 16:12

seberg added the triage review Issue/PR to be discussed at the next triage meeting label Jul 15, 2021

r-devulap mentioned this pull request Jul 16, 2021

log2 way slower than log or log10 #13836

Open

mhvk mentioned this pull request Jul 21, 2021

TST/BENCH: Adding test coverage and benchmarks for floating point umath functions #19485

Merged

r-devulap force-pushed the svml branch from 24ed99e to 32aaa71 Compare July 21, 2021 19:50

mattip added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jul 25, 2021

seiko2plus requested changes Jul 25, 2021

View reviewed changes

seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 25, 2021

mattip merged commit 1eff1c5 into numpy:main Oct 10, 2021

mattip mentioned this pull request Oct 10, 2021

MAINT: move "git submodule update" earlier in docker creation #20084

Merged

rgommers mentioned this pull request Oct 11, 2021

MAINT: Increase tolerances in tests to avoid Nightly CPython3.10 test failures scipy/scipy#14841

Merged

melissawm mentioned this pull request Oct 11, 2021

DOC: Update front page of documentation with Sphinx-Panels #19756

Merged

5 tasks

akesandgren mentioned this pull request Oct 12, 2021

Numpy will have the possibility of using the Intel SVML from the next version easybuilders/easybuild-easyconfigs#14121

Open

hippke mentioned this pull request Oct 31, 2021

Faster tan and arctan possible? hippke/Pandora#23

Closed

mhvk mentioned this pull request Jan 1, 2022

Relax tolerance of some tests for numpy 1.22 astropy/astropy#12682

Merged

10 tasks

rgommers mentioned this pull request Jan 2, 2022

BLD: Add NPY_DISABLE_SVML env var to opt out of SVML #20695

Merged

WilliamJamieson mentioned this pull request Jan 4, 2022

Fixes for failures when using astropy develop. spacetelescope/poppy#482

Merged

pllim mentioned this pull request Jan 4, 2022

TST: Use NPY_DISABLE_CPU_FEATURES for numpy 1.22 astropy/astropy#12684

Closed

10 tasks

jobovy added a commit to jobovy/galpy that referenced this pull request Jan 6, 2022

Deal with precision failures for numpy 1.22 (likely due to numpy/nump…

0fcdf63

…y#19478), but only pass for exact 1.22 so we can re-evaluate next version

bjlittle mentioned this pull request Jan 20, 2022

manual update to py38-linux-64.lock (#4480) SciTools/iris#4491

Merged

8 tasks

stuartarchibald mentioned this pull request May 10, 2022

Support for NumPy 1.22 numba/numba#8027

Merged

gmarkall mentioned this pull request Jun 8, 2022

Numerical differences in latest numba with NumPy 1.21 vs 1.22 numba/numba#8141

Closed

2 tasks

mattip mentioned this pull request Jan 24, 2023

ENH: Add support SLEEF for transcendental functions #23068

Open

gertjanvanzwieten mentioned this pull request Jan 29, 2023

allow numpy 1.24 evalf/nutils#770

Merged

Mousius mentioned this pull request Feb 8, 2023

ENH: Integrate Optimized Routines for AArch64 #23171

Closed

astrofrog mentioned this pull request Feb 26, 2023

Fix continuous integration hyperion-rt/hyperion#230

Merged

Mousius mentioned this pull request May 2, 2023

ENH: float64 sin/cos using Numpy intrinsics #23399

Merged

r-devulap mentioned this pull request Aug 19, 2023

BUG: GCC-Workaround-Intel-Downfall to avoid significant perf-drop #24447

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Vectorizing umath module using AVX-512 (open sourced from Intel Short Vector Math Library, SVML) #19478

ENH: Vectorizing umath module using AVX-512 (open sourced from Intel Short Vector Math Library, SVML) #19478

r-devulap commented Jul 14, 2021

mattip commented Jul 15, 2021

mattip commented Jul 15, 2021

thiagomacieira commented Jul 15, 2021

r-devulap commented Jul 15, 2021

r-devulap commented Jul 15, 2021

seberg commented Jul 15, 2021

r-devulap commented Jul 16, 2021

rgommers commented Jul 16, 2021 •

edited

rgommers commented Jul 16, 2021

seberg commented Jul 16, 2021

rgommers commented Jul 16, 2021

akolesov-intel commented Jul 20, 2021

r-devulap commented Jul 21, 2021

r-devulap commented Jul 21, 2021

seberg commented Jul 21, 2021

matthew-brett commented Jul 22, 2021

r-devulap commented Jul 22, 2021

seiko2plus left a comment •

edited

r-devulap commented Jul 27, 2021

r-devulap commented Aug 4, 2021

seberg commented Aug 4, 2021

r-devulap commented Aug 4, 2021 •

edited

r-devulap commented Sep 7, 2021

r-devulap commented Sep 20, 2021

rgommers commented Oct 2, 2021

r-devulap commented Oct 5, 2021

mattip commented Oct 10, 2021

ENH: Vectorizing umath module using AVX-512 (open sourced from Intel Short Vector Math Library, SVML) #19478

ENH: Vectorizing umath module using AVX-512 (open sourced from Intel Short Vector Math Library, SVML) #19478

Conversation

r-devulap commented Jul 14, 2021

mattip commented Jul 15, 2021

mattip commented Jul 15, 2021

thiagomacieira commented Jul 15, 2021

r-devulap commented Jul 15, 2021

r-devulap commented Jul 15, 2021

seberg commented Jul 15, 2021

r-devulap commented Jul 16, 2021

rgommers commented Jul 16, 2021 • edited

rgommers commented Jul 16, 2021

seberg commented Jul 16, 2021

rgommers commented Jul 16, 2021

akolesov-intel commented Jul 20, 2021

r-devulap commented Jul 21, 2021

r-devulap commented Jul 21, 2021

seberg commented Jul 21, 2021

matthew-brett commented Jul 22, 2021

r-devulap commented Jul 22, 2021

seiko2plus left a comment • edited

Choose a reason for hiding this comment

r-devulap commented Jul 27, 2021

r-devulap commented Aug 4, 2021

seberg commented Aug 4, 2021

r-devulap commented Aug 4, 2021 • edited

r-devulap commented Sep 7, 2021

r-devulap commented Sep 20, 2021

rgommers commented Oct 2, 2021

r-devulap commented Oct 5, 2021

mattip commented Oct 10, 2021

rgommers commented Jul 16, 2021 •

edited

seiko2plus left a comment •

edited

r-devulap commented Aug 4, 2021 •

edited