Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Vectorizing umath module using AVX-512 (open sourced from Intel Short Vector Math Library, SVML) #19478

Merged
merged 17 commits into from Oct 10, 2021

Conversation

r-devulap
Copy link
Member

This patch integrates Intel Short Vector Math Library (SVML) into NumPy. SVML provides AVX-512 implementations of 44 math functions: exp, exp2, log, log2, log10, expm1, log1p, cbrt, pow, sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh, asinh, acosh and atanh (both single and double precisions). Some key points to note:

  1. On a Skylake i9-7900X CPU, the library provides a speed up of up-to 55x (for cbrt function). The average speed up this patch provides is 32x and 14x for single and double precision respectively. More detailed numbers are provided below.
  2. The maximum ULP error across all the implementations is 4 ULP.
  3. This patch increases the multiarray shared object file from 29226320 bytes to 30147792 bytes, an increase of 899.875 Kb (compiled on Ubuntu 20.04 using gcc-11 compiler).

Detailed benchmarking numbers:

   before           after         ratio
     [fe3d7170]       [6049dc65]
     <main>           <svml>
-      459±0.05μs         61.2±1μs     0.13  bench_ufunc_strides.Unary.time_ufunc('exp2', 'd')
-        2.05±0ms          212±4μs     0.10  bench_ufunc_strides.Unary.time_ufunc('tanh', 'd')
-       871±0.8μs         86.4±3μs     0.10  bench_ufunc_strides.Unary.time_ufunc('cos', 'd')
-       856±0.3μs         76.6±8μs     0.09  bench_ufunc_strides.Unary.time_ufunc('log2', 'd')
-        2.66±0ms        238±0.1μs     0.09  bench_ufunc_strides.Unary.time_ufunc('arccosh', 'd')
-        1.33±0ms         107±10μs     0.08  bench_ufunc_strides.Unary.time_ufunc('log1p', 'd')
-     2.98±0.01ms         229±10μs     0.08  bench_ufunc_strides.Unary.time_ufunc('arcsinh', 'd')
-        1.05±0ms         78.5±5μs     0.07  bench_ufunc_strides.Unary.time_ufunc('sin', 'd')
-        1.11±0ms         78.1±6μs     0.07  bench_ufunc_strides.Unary.time_ufunc('log10', 'd')
-        1.66±0ms         115±10μs     0.07  bench_ufunc_strides.Unary.time_ufunc('arccos', 'd')
-        1.24±0ms         79.6±4μs     0.06  bench_ufunc_strides.Unary.time_ufunc('expm1', 'd')
-        1.72±0ms          107±8μs     0.06  bench_ufunc_strides.Unary.time_ufunc('arcsin', 'd')
-        1.33±0ms         81.6±6μs     0.06  bench_ufunc_strides.Unary.time_ufunc('cosh', 'd')
-        1.94±0ms          118±7μs     0.06  bench_ufunc_strides.Unary.time_ufunc('sinh', 'd')
-        2.69±0ms         158±10μs     0.06  bench_ufunc_strides.Unary.time_ufunc('arctanh', 'd')
-       366±0.1μs         20.7±2μs     0.06  bench_ufunc_strides.Unary.time_ufunc('exp2', 'f')
-        2.29±0ms         121±10μs     0.05  bench_ufunc_strides.Unary.time_ufunc('arctan', 'd')
-        2.16±0ms          111±6μs     0.05  bench_ufunc_strides.Unary.time_ufunc('cbrt', 'd')
-      427±0.08μs         21.7±3μs     0.05  bench_ufunc_strides.Unary.time_ufunc('log2', 'f')
-        1.05±0ms         46.4±6μs     0.04  bench_ufunc_strides.Unary.time_ufunc('arccos', 'f')
-        2.36±0ms          103±7μs     0.04  bench_ufunc_strides.Unary.time_ufunc('tan', 'd')
-        1.24±0ms         52.9±6μs     0.04  bench_ufunc_strides.Unary.time_ufunc('log1p', 'f')
-     1.18±0.01ms         48.9±7μs     0.04  bench_ufunc_strides.Unary.time_ufunc('arctan', 'f')
-         974±3μs         38.9±6μs     0.04  bench_ufunc_strides.Unary.time_ufunc('arcsin', 'f')
-        2.32±0ms       89.9±0.5μs     0.04  bench_ufunc_strides.Unary.time_ufunc('arccosh', 'f')
-        2.58±0ms        86.3±10μs     0.03  bench_ufunc_strides.Unary.time_ufunc('arcsinh', 'f')
-        1.59±0ms         50.2±7μs     0.03  bench_ufunc_strides.Unary.time_ufunc('tan', 'f')
-       826±0.1μs         21.8±3μs     0.03  bench_ufunc_strides.Unary.time_ufunc('log10', 'f')
-        1.17±0ms         30.8±4μs     0.03  bench_ufunc_strides.Unary.time_ufunc('expm1', 'f')
-     1.94±0.01ms         49.6±4μs     0.03  bench_ufunc_strides.Unary.time_ufunc('tanh', 'f')
-        1.33±0ms         33.5±4μs     0.03  bench_ufunc_strides.Unary.time_ufunc('cosh', 'f')
-        2.45±0ms         57.3±9μs     0.02  bench_ufunc_strides.Unary.time_ufunc('arctanh', 'f')
-        1.98±0ms         42.1±5μs     0.02  bench_ufunc_strides.Unary.time_ufunc('sinh', 'f')
-        2.09±0ms         38.3±5μs     0.02  bench_ufunc_strides.Unary.time_ufunc('cbrt', 'f')

@mattip
Copy link
Member

mattip commented Jul 15, 2021

There is a failing test for x**p.

While all the x86 wheels will get the 1MB binary size increase, only AVX512_SKX-capable processors (gen11 ?) will be able to use it. On the other hand, a order of magnitude speedup or more for these traditionally heavy operations is great news.

I see in the library description there is an option to get 1ULP precision. How much does that affect speed? What is the default for other consumers of this library?

@mattip
Copy link
Member

mattip commented Jul 15, 2021

Maybe worth splitting the validator data and benchmark commits into a separate PR

@thiagomacieira
Copy link

There is a failing test for x**p.

While all the x86 wheels will get the 1MB binary size increase, only AVX512_SKX-capable processors (gen11 ?) will be able to use it. On the other hand, a order of magnitude speedup or more for these traditionally heavy operations is great news.

Quick correction on terms:

  • AVX512 has been present on processors since the Skylake microarchitecture, which was first made available in the Xeon Scalable products and later in the 7th Generation Core i9 Extreme (the X suffix). That's the one that this was tested on (i9-7900X)
  • AVX512 is also present in laptops since the Ice Lake processors, which was launched as "10th Generation Core". The 11th Generation is Tiger Lake.
  • Intel Graphics also use "gen" for their non-marketing names, but their numbering is slightly off: the GPU in Skylake was gen9; the one in Ice Lake is gen11 and the one in Tiger Lake is gen12.

Therefore, there are a lot of processors out there, especially in the cloud, where the new routines will be beneficial. For those of us testing on our own laptops, AVX512 has been available since 2020.

@r-devulap
Copy link
Member Author

Maybe worth splitting the validator data and benchmark commits into a separate PR

Split them into separate PR's: see #19485

@r-devulap
Copy link
Member Author

There is a failing test for x**p.

Looking into it..

I see in the library description there is an option to get 1ULP precision. How much does that affect speed? What is the default for other consumers of this library?

I haven't benchmarked these but I would think they would be considerably slower than 4ULP implementations.

@seberg seberg added the triage review Issue/PR to be discussed at the next triage meeting label Jul 15, 2021
@seberg
Copy link
Member

seberg commented Jul 15, 2021

The maximum ULP error across all the implementations is 4 ULP.

Is that so? The table sounds a lot like the actual ULPs for most functions is much lower? or do these table list "typical" ULP rather then maximum ULP errors!?

It is maybe a bit knee jerk, but the sheer number of lines makes me wonder if there is a way to avoid vendoring all of files?

@r-devulap
Copy link
Member Author

The maximum ULP error across all the implementations is 4 ULP.

Is that so? The table sounds a lot like the actual ULPs for most functions is much lower? or do these table list "typical" ULP rather then maximum ULP errors!?

The table does list the maximum ULP error and not an average. And the maximum ULP error for the low accuracy (LA) SVML is 4ULP. Based on the table, double precision tan seems to be the worst offender with a max ULP error of 3.53. But yes, for most other functions it seems to be less than 4.

It is maybe a bit knee jerk, but the sheer number of lines makes me wonder if there is a way to avoid vendoring all of files?

Most of the code in this patch comes from the SVML library which has been in use by its customers for a good few years now (since SkylakeX was introduced). I expect the library itself would be relatively low maintenance. Additionally, PR #19485 adds a lot of relevant tests to ensure good test coverage for the umath module.

@rgommers
Copy link
Member

rgommers commented Jul 16, 2021

Interesting, thanks @r-devulap! It's good to see SVML code under a BSD-3 license.

Performance gains sound relevant. This is a lot of code, so we'll have to do some higher-level benchmarking too I think, not just micro-benchmarks.

I think there'll be many maintainers who, like me, aren't fully up to speed on the SVML story is. It's a vectorized version of (parts of) libm, parts are also upstreamed into GCC (libmvec) I believe, and it's similar to SLEEF and Yeppp (?). Difference between VML and SVML is (?). It's a bit hard to find a clear and comprehensive overview; I found this paper somewhat helpful. If you have some good resources to help educate us, that'd be great.

EDIT: the last (only?) time we discussed this on the mailing list was March 2015, in this thread: https://mail.python.org/pipermail/numpy-discussion/2015-March/072406.html

@rgommers
Copy link
Member

It is maybe a bit knee jerk, but the sheer number of lines makes me wonder if there is a way to avoid vendoring all of files?

Most of the code in this patch comes from the SVML library which has been in use by its customers for a good few years now (since SkylakeX was introduced). I expect the library itself would be relatively low maintenance. Additionally, PR #19485 adds a lot of relevant tests to ensure good test coverage for the umath module.

I think @seberg's question may be more about where the files live, not about quality or test coverage. Perhaps taking all of numpy/core/src/umath/svml/, putting it in a separate repo (e.g. github.com/numpy/svml) and then using that as a git submodule would not be unreasonable.

@seberg
Copy link
Member

seberg commented Jul 16, 2021

Adding it to .gitattributes as numpy/core/.../svml linguist-vendored would possibly be just good enough and we should do that anyway. (I am considering naming it svml_vendored or so also, just to make it obvious when browsing the source.) That might even help hide most of the diff in the review I think.

Also should this be in LICENSES_bundled.txt or is retaining the per-file copyright sufficient?

Overlap with SLEEF, etc. would indeed be interesting; And yeah, I don't think anyone really thought about it seriously for a long time, Julian probably did at some point.

@rgommers
Copy link
Member

Also should this be in LICENSES_bundled.txt or is retaining the per-file copyright sufficient?

Given that there is no separate license file and it's the same BSD-3 license as the rest of NumPy, I don't see a compelling reason to list it in LICENSES_bundled.txt.

@akolesov-intel
Copy link

I think there'll be many maintainers who, like me, aren't fully up to speed on the SVML story is. It's a vectorized version of (parts of) libm,

Generally, yes. But SVML is not exactly the same as a vectorized LIBM. In most cases SVML algorithms are different from those in LIBM, and the function coverage may be different too. SVML supports several accuracy flavors: high (maximum errors up to 1 ulp), medium (up to 4 ulp), low (about half of the result’s significand bits are correct), and bitwise reproducible versions. LIBM provides only high accuracy implementations (more accurate than in SVML, with maximum errors of up to 0.6 ulp).

parts are also upstreamed into GCC (libmvec)

Exactly, some SSE4.2, AVX and AVX-512 SVML implementations were provided several years ago to GLIBC in libmvec, as Open Source assembly code.

I believe, and it's similar to SLEEF and Yeppp (?)

Correct. Though Yeppp only provides a small number of transcendental math functions. Our analysis of a few years ago showed that SVML provides the better overall implementations for Intel Architecture processors.

Difference between VML and SVML is (?).

SVML is internal library in Intel Compilers. VML (officially VM, or Vector Math) is part of the Intel oneAPI Math Kernel Library (oneMKL). The SVML API operates on SIMD input registers, while the VM API accepts pointers to vectors and the vector length as parameters. In general SVML is for ‘short vectors’, while VM is the most efficient on ‘long vectors’:

SVML: __m256 __svml_exp4(__m256 arg);
VM: vdExp(int length, double* arg, double* res);

@r-devulap
Copy link
Member Author

SVML currently works only on linux. Will need some modifications to enable on Windows which I think it might be easier to deal with in a separate PR later.

@r-devulap
Copy link
Member Author

Adding it to .gitattributes as numpy/core/.../svml linguist-vendored would possibly be just good enough and we should do that anyway.

Could you elaborate steps on how to do this?

@seberg
Copy link
Member

seberg commented Jul 21, 2021

Pretty high in that file there are some lines:

# Mark some files as vendored
numpy/linalg/lapack_lite/f2c.c linguist-vendored
numpy/linalg/lapack_lite/f2c.h linguist-vendored
tools/npy_tempita/* linguist-vendored
numpy/core/include/numpy/libdivide/* linguist-vendored

just appending one will do.

@matthew-brett
Copy link
Contributor

Does SVML does not work correctly on macOS?

@r-devulap
Copy link
Member Author

Does SVML does not work correctly on macOS?

Not the way it is right now. There are a few differences in assembler directives between macOS and linux (see https://stackoverflow.com/questions/19720084/what-is-the-difference-between-assembly-on-mac-and-assembly-on-linux) and the SVML sources need an update to accommodate these. Its definitely doable, but might be easier to do it in a separate patch at a later time.

@mattip mattip added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jul 25, 2021
Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two points I have here:

  1. Old dispatching mechanism should not be used anymore, the new dispatcher is more efficient supports all kinds of compilers and it can be disabled or managed during the runtime and build time.

  2. Why not mixing SVML with universal intrinsics, it will make the code more robust and leaves the door open for other architectures.

The following suggestions are putting the above points in action without losing performance:

numpy/core/code_generators/generate_umath.py Outdated Show resolved Hide resolved
numpy/core/code_generators/generate_umath.py Outdated Show resolved Hide resolved
numpy/core/code_generators/generate_umath.py Outdated Show resolved Hide resolved
numpy/core/code_generators/generate_umath.py Outdated Show resolved Hide resolved
numpy/core/code_generators/generate_umath.py Outdated Show resolved Hide resolved
numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved
numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved
numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved
numpy/core/src/umath/simd.inc.src Outdated Show resolved Hide resolved
numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved
@seiko2plus seiko2plus added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Jul 25, 2021
@r-devulap
Copy link
Member Author

@seiko2plus Thanks for the review. This patch uses universal intrinsics now. You didn't have to write out the code for me :)

@r-devulap
Copy link
Member Author

One of the things PR #19485 was supposed to do was add test coverage for raising FP exceptions for the ufuncs SVML vectorizes. I had created this table to ensure full coverage. It would be great if someone could review the table below and let me know if I am missing anything or if I got something wrong.

Ufunc overflow underflow invalid divide by zero
sin - - x = +/-np.inf -
cos - - x = +/-np.inf -
tan - - x = +/-np.inf -
arcsin - - x < -1.0 x > 1.0 -
arccos - - x < -1.0 x > 1.0 -
arctan - - - -
sinh large numbers - - -
cosh large numbers - - -
tanh - - - -
arcsinh - - - -
arccosh - - x < 1.0 -
arctanh - - x < -1.0 x > 1.0 x = +/- 1.0
cbrt - - - -
exp2 large numbers some denormals - -
expm1 large numbers - - -
log10 - - x < 0 x = 0
log1p - - x < -1.0 x = -1
log2 - - x < 0 x = 0

@seberg
Copy link
Member

seberg commented Aug 4, 2021

and let me know if I am missing anything or if I got something wrong.

The table looks right to me (and I bet at least MacOS gets it wrong, so if we would want to add a test we have to deal with it). I like how tan can't reach infinity since you can never have pi/2 as input exact enough :).

The rules are straight forward here:

  • If NaN is returned, but the input was not a NaN it is an invalid-value error.
  • If the result should be finite and not 0, but is out of range for floating point, it should be over/underflow error.
  • If the result is inf but not an overflow and the input is not inf, it is a divide-by-zero error (it can't happen here that an inf is created from a NaN, so no need to kill brain cells thinking about it)

@r-devulap
Copy link
Member Author

r-devulap commented Aug 4, 2021

I ran the SciPy test suite when built with this branch of NumPy and saw 4 tests fails. All failures are related to error tolerance. The tests pass when I slightly bump the tolerance. I am not sure if increasing the tolerance is acceptable to SciPy or not. But here are the changes to the tests that require that change: https://github.com/r-devulap/scipy/pull/1/files

@r-devulap
Copy link
Member Author

ping

@r-devulap
Copy link
Member Author

Friendly ping. Any thoughts/concerns with this patch?

@rgommers
Copy link
Member

rgommers commented Oct 2, 2021

Friendly ping. Any thoughts/concerns with this patch?

Not any concerns anymore really, I think it's more a "who dares to do final review and push the green button, cause there's a lot of code here and something is likely to break" kind of situation.

@r-devulap
Copy link
Member Author

Not any concerns anymore really, I think it's more a "who dares to do final review and push the green button, cause there's a lot of code here and something is likely to break" kind of situation.

will be happy to discuss how we can reinforce PR #19485 with more test coverage, if that helps..

@mattip mattip merged commit 1eff1c5 into numpy:main Oct 10, 2021
@mattip
Copy link
Member

mattip commented Oct 10, 2021

Thanks @r-devulap. We have some time before the release to get this to percolate through downstream users who are early adopters. Hopefully that will flush out any concerns about precision.

mhvk added a commit to mhvk/astropy that referenced this pull request Jan 1, 2022
Most likely as a consequence of numpy now supporting the faster but
slightly less precise Intel Short Vector Math Library (SVML), some of
our tests are failing, in ways that are not really interesting.  So,
up the tolerance slightly.

See numpy/numpy#19478
mhvk added a commit to mhvk/astropy that referenced this pull request Jan 2, 2022
Most likely as a consequence of numpy now supporting the faster but
slightly less precise Intel Short Vector Math Library (SVML), some of
our tests are failing, in ways that are not really interesting.  So,
up the tolerance slightly.

See numpy/numpy#19478
pllim pushed a commit to mhvk/astropy that referenced this pull request Jan 3, 2022
Most likely as a consequence of numpy now supporting the faster but
slightly less precise Intel Short Vector Math Library (SVML), some of
our tests are failing, in ways that are not really interesting.  So,
up the tolerance slightly.

See numpy/numpy#19478
pllim pushed a commit to mhvk/astropy that referenced this pull request Jan 3, 2022
Most likely as a consequence of numpy now supporting the faster but
slightly less precise Intel Short Vector Math Library (SVML), some of
our tests are failing, in ways that are not really interesting.  So,
up the tolerance slightly.

See numpy/numpy#19478

Temporarily pin numpy for matplotlib tests

Ensure theta of Gaussian2D is initialized
jobovy added a commit to jobovy/galpy that referenced this pull request Jan 6, 2022
…y#19478), but only pass for exact 1.22 so we can re-evaluate next version
jobovy added a commit to jobovy/galpy that referenced this pull request Jan 7, 2022
…y#19478), but only pass for exact 1.22 so we can re-evaluate next version; also includes better re-implementation of XYZ_to_lbd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: SIMD Issues in SIMD (fast instruction sets) code or machinery triage review Issue/PR to be discussed at the next triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants