New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Vectorizing umath module using AVX-512 (open sourced from Intel Short Vector Math Library, SVML) #19478
Conversation
There is a failing test for While all the x86 wheels will get the 1MB binary size increase, only AVX512_SKX-capable processors (gen11 ?) will be able to use it. On the other hand, a order of magnitude speedup or more for these traditionally heavy operations is great news. I see in the library description there is an option to get 1ULP precision. How much does that affect speed? What is the default for other consumers of this library? |
Maybe worth splitting the validator data and benchmark commits into a separate PR |
Quick correction on terms:
Therefore, there are a lot of processors out there, especially in the cloud, where the new routines will be beneficial. For those of us testing on our own laptops, AVX512 has been available since 2020. |
Split them into separate PR's: see #19485 |
Looking into it..
I haven't benchmarked these but I would think they would be considerably slower than 4ULP implementations. |
Is that so? The table sounds a lot like the actual ULPs for most functions is much lower? or do these table list "typical" ULP rather then maximum ULP errors!? It is maybe a bit knee jerk, but the sheer number of lines makes me wonder if there is a way to avoid vendoring all of files? |
The table does list the maximum ULP error and not an average. And the maximum ULP error for the low accuracy (LA) SVML is 4ULP. Based on the table, double precision
Most of the code in this patch comes from the SVML library which has been in use by its customers for a good few years now (since SkylakeX was introduced). I expect the library itself would be relatively low maintenance. Additionally, PR #19485 adds a lot of relevant tests to ensure good test coverage for the umath module. |
Interesting, thanks @r-devulap! It's good to see SVML code under a BSD-3 license. Performance gains sound relevant. This is a lot of code, so we'll have to do some higher-level benchmarking too I think, not just micro-benchmarks. I think there'll be many maintainers who, like me, aren't fully up to speed on the SVML story is. It's a vectorized version of (parts of) EDIT: the last (only?) time we discussed this on the mailing list was March 2015, in this thread: https://mail.python.org/pipermail/numpy-discussion/2015-March/072406.html |
I think @seberg's question may be more about where the files live, not about quality or test coverage. Perhaps taking all of |
Adding it to Also should this be in Overlap with SLEEF, etc. would indeed be interesting; And yeah, I don't think anyone really thought about it seriously for a long time, Julian probably did at some point. |
Given that there is no separate license file and it's the same BSD-3 license as the rest of NumPy, I don't see a compelling reason to list it in |
Generally, yes. But SVML is not exactly the same as a vectorized LIBM. In most cases SVML algorithms are different from those in LIBM, and the function coverage may be different too. SVML supports several accuracy flavors: high (maximum errors up to 1 ulp), medium (up to 4 ulp), low (about half of the result’s significand bits are correct), and bitwise reproducible versions. LIBM provides only high accuracy implementations (more accurate than in SVML, with maximum errors of up to 0.6 ulp).
Exactly, some SSE4.2, AVX and AVX-512 SVML implementations were provided several years ago to GLIBC in libmvec, as Open Source assembly code.
Correct. Though Yeppp only provides a small number of transcendental math functions. Our analysis of a few years ago showed that SVML provides the better overall implementations for Intel Architecture processors.
SVML is internal library in Intel Compilers. VML (officially VM, or Vector Math) is part of the Intel oneAPI Math Kernel Library (oneMKL). The SVML API operates on SIMD input registers, while the VM API accepts pointers to vectors and the vector length as parameters. In general SVML is for ‘short vectors’, while VM is the most efficient on ‘long vectors’: SVML: __m256 __svml_exp4(__m256 arg); |
SVML currently works only on linux. Will need some modifications to enable on Windows which I think it might be easier to deal with in a separate PR later. |
Could you elaborate steps on how to do this? |
Pretty high in that file there are some lines:
just appending one will do. |
Does SVML does not work correctly on macOS? |
Not the way it is right now. There are a few differences in assembler directives between macOS and linux (see https://stackoverflow.com/questions/19720084/what-is-the-difference-between-assembly-on-mac-and-assembly-on-linux) and the SVML sources need an update to accommodate these. Its definitely doable, but might be easier to do it in a separate patch at a later time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two points I have here:
-
Old dispatching mechanism should not be used anymore, the new dispatcher is more efficient supports all kinds of compilers and it can be disabled or managed during the runtime and build time.
-
Why not mixing SVML with universal intrinsics, it will make the code more robust and leaves the door open for other architectures.
The following suggestions are putting the above points in action without losing performance:
@seiko2plus Thanks for the review. This patch uses universal intrinsics now. You didn't have to write out the code for me :) |
One of the things PR #19485 was supposed to do was add test coverage for raising FP exceptions for the ufuncs SVML vectorizes. I had created this table to ensure full coverage. It would be great if someone could review the table below and let me know if I am missing anything or if I got something wrong.
|
The table looks right to me (and I bet at least MacOS gets it wrong, so if we would want to add a test we have to deal with it). I like how The rules are straight forward here:
|
I ran the SciPy test suite when built with this branch of NumPy and saw 4 tests fails. All failures are related to error tolerance. The tests pass when I slightly bump the tolerance. I am not sure if increasing the tolerance is acceptable to SciPy or not. But here are the changes to the tests that require that change: https://github.com/r-devulap/scipy/pull/1/files |
ping |
Friendly ping. Any thoughts/concerns with this patch? |
Not any concerns anymore really, I think it's more a "who dares to do final review and push the green button, cause there's a lot of code here and something is likely to break" kind of situation. |
will be happy to discuss how we can reinforce PR #19485 with more test coverage, if that helps.. |
Thanks @r-devulap. We have some time before the release to get this to percolate through downstream users who are early adopters. Hopefully that will flush out any concerns about precision. |
Most likely as a consequence of numpy now supporting the faster but slightly less precise Intel Short Vector Math Library (SVML), some of our tests are failing, in ways that are not really interesting. So, up the tolerance slightly. See numpy/numpy#19478
Most likely as a consequence of numpy now supporting the faster but slightly less precise Intel Short Vector Math Library (SVML), some of our tests are failing, in ways that are not really interesting. So, up the tolerance slightly. See numpy/numpy#19478
Most likely as a consequence of numpy now supporting the faster but slightly less precise Intel Short Vector Math Library (SVML), some of our tests are failing, in ways that are not really interesting. So, up the tolerance slightly. See numpy/numpy#19478
Most likely as a consequence of numpy now supporting the faster but slightly less precise Intel Short Vector Math Library (SVML), some of our tests are failing, in ways that are not really interesting. So, up the tolerance slightly. See numpy/numpy#19478 Temporarily pin numpy for matplotlib tests Ensure theta of Gaussian2D is initialized
…y#19478), but only pass for exact 1.22 so we can re-evaluate next version
…y#19478), but only pass for exact 1.22 so we can re-evaluate next version; also includes better re-implementation of XYZ_to_lbd
This patch integrates Intel Short Vector Math Library (SVML) into NumPy. SVML provides AVX-512 implementations of 44 math functions:
exp, exp2, log, log2, log10, expm1, log1p, cbrt, pow, sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh, asinh, acosh and atanh
(both single and double precisions). Some key points to note:Detailed benchmarking numbers: