MAINT: refactor _Distutils out of CCompiler_Opt #46

mattip · 2020-08-19T19:44:12Z

As part of PR numpy#13516, we added a CCompilerOpt class in numpy/distutils/ccompiler_opt.py. The class implements the compiler side of NEP0038 SIMD optimizations which

parses the command line arguments --cpu-baseline and --cpu-dispatch
probes the compiler to make sure the requested features can be used
creates header files with appropriate #define macros
caches the results for recompilation
reports results and errors to the console.

The code as-is should be refactored to reuse more of the distuitils infrastructure where possible. In particular, the _Distutils class re-implements logging and compilation: tasks that should be delegated.

The text was updated successfully, but these errors were encountered:

seiko2plus · 2020-08-22T22:33:55Z

I wonder if we could refactor the dispatch mechanism to be much more limited: only have two loops: a baseline and an advanced loop, written in C. Then only use these loops via the current ufunc reassign-c-function-loops-at-import rather than the macro-based runtime mechanism via NPY__CPU_DISPATCH_CALL. I am worried about the maintenance burden moving forward and the intricies of adding yet another C code format to NumPy.

@mattip, Can we discuss it here? I guess yes :)

I wonder if we could refactor the dispatch mechanism to be much more limited: only have two loops: a baseline and an advanced loop

1. The number of loops is (the number of additional targeted optimizations) + the baseline, for example(pseudo for ufunc):

void ufunc_inner_loop_AVX512F(args) {
    // avx512f code
}
void ufunc_inner_loop_AVX2(args) {
    // avx2 code
}
void ufunc_inner_loop(args) {
   // we tend to not add suffix to the baseline func so it can be friendly with static assign.
}

Now, dispatch the highest interest target on runtime during the load of umath module

static ufunc_inner_pointer = ufunc_inner_loop; // baseline
void umath_module_load()
{
    // check the machine support and cache the result in the memory
    npy_cpu_init();
    // ... other code
    // runtime dispatch
    if (NPY_CPU_HAVE(AVX512F)) {
        ufunc_inner_pointer = ufunc_inner_loop_AVX512F;
    } else if (NPY_CPU_HAVE(AVX2)) {
        ufunc_inner_pointer = ufunc_inner_loop_AVX2;
    }
    // ... other code
}

Then only use these loops via the current ufunc reassign-c-function-loops-at-import rather than the macro-based runtime mechanism via NPY__CPU_DISPATCH_CALL.

2. What's NPY__CPU_DISPATCH_CALL?

NPY__CPU_DISPATCH_CALL is a dynamic auto-generated repeater that used to repeat a certain macro
according to the additional targeted optimizations, for example:

// this how it looks if we targeting AVX512F and AVX2
#define NPY__CPU_DISPATCH_CALL(CHK, CB, ...) \
	NPY__CPU_DISPATCH_EXPAND_(CB((CHK(AVX512F)), AVX512F, __VA_ARGS__)) \
	NPY__CPU_DISPATCH_EXPAND_(CB((CHK(AVX)&&CHK(F16C)&&CHK(AVX2)), AVX2, __VA_ARGS__))

The args explained as following:
CHK: is the runtime check macro: NPY_CPU_HAVE(FEATURE_NAME)
CB: macro that needs to be repeated and it receives following args:
1- the outcome of examining the required CPU features that related to the required target, true or false
2- the target name
3- extra args

3. What's the use of NPY__CPU_DISPATCH_CALL?

Currently, we use it for two things:

A) in declaring a prototype, see:

https://github.com/numpy/numpy/blob/97f9fcb599fec377f35be647afdc2d5c2c6ba1f9/numpy/core/src/common/npy_cpu_dispatch.h#L169-L176

B) in dispatching call or assign, see

https://github.com/numpy/numpy/blob/97f9fcb599fec377f35be647afdc2d5c2c6ba1f9/numpy/core/src/common/npy_cpu_dispatch.h#L227-L235

reassign-c-function-loops-at-import

4. Yes we do

Yes we already do this with the ufunc as I explained in the first point, see
https://github.com/numpy/numpy/blob/97f9fcb599fec377f35be647afdc2d5c2c6ba1f9/numpy/core/code_generators/generate_umath.py#L1039-L1048
but in numpy#17102 we "check and runtime call" rather than "check and assign during module load", for example(pseudo):
exporting

void simd_packbits_AVX512F(args) {
    // AVX512F code
}
void simd_packbits_AVX2(args) {
    // AVX2 code
}
void simd_packbits(args) {
   // baseline
}

check and call

void the_callee(args)
{
    #if 0
       // check the CPU support and cache the result in the memory
       // we call it one time during the load of umath module
       npy_cpu_init();
    #endif
    // runtime dispatch
    if (NPY_CPU_HAVE(AVX512F)) {
        simd_packbits_AVX512F(some_args);
    } else if (NPY_CPU_HAVE(AVX2)) {
        simd_packbits_AVX2(some_args);
    }
    else {
        simd_packbits(some_args);
    }
}

It will look bad only if you going to deal with so many targets since we check the CPU support only one time
during the load of the module, plus we represent features names as integers id(memory index) rather than
strings so there's no strcmp.

However that's not the issue of the dispatcher, it gives you the ability

to assign

void the_callee(args)
{
    // ....
    NPY_CPU_DISPATCH_CALL(the_pointer = simd_packbits)
}

or to call

void the_callee(args)
{
    // ....
    NPY_CPU_DISPATCH_CALL(simd_packbits, (args))
}

depending on your situation and the code cost.

I am worried about the maintenance burden moving forward and the intricies of adding yet another C code format to NumPy.

I believe we have currently the most stable, flexible, and advanced infrastructure for SIMD optimizations comparing to all famous
C/C++ projects (I checked them all).

Please ask me more questions, that will help us in improving the documentation.

mattip · 2020-08-23T05:37:16Z

It is not a matter for documentation: this is a problem with the framework we have developed. It is too complicated for anyone but the most dedicated reviewers.

I believe we have currently the most stable, flexible, and advanced infrastructure ...

That may be true but if we cannot review PRs to use it, and each PR for the new infrastructure brings with it new tools and dozens of new files, we will never be able to merge them.

The number of loops is (the number of additional targeted optimizations) + the baseline, for example(pseudo for ufunc):

I understand that. I am suggesting we only generate 2 loops to simplify the framework for now.

Yes we already do this with the ufunc

I am suggesting we only do dispatching with the ufunc and not with any other mechanism for now.

Of course we can always go back and extend the optimizations in the future, but for now I think the current approach is too ambitious for NumPy. It is too hard to review PRs that not only add dozens of new intrinsic overrides but also new frameworks such as templates and complicated C macro systems.

Let's focus on doing this one step at a time. Putting all the required intrinsic types into numpy/core/src/common/simd should be the first priority. This should enable simplifying the current loop code, not complicating it.

Once we have refactored the ufunc loops to use universal intrinsics, we can think about expanding the loops to use more than two loops, and expanding the use of intrinsics to other functions like packbits and einsum.

seiko2plus · 2020-08-23T11:48:42Z

@mattip,
Oh I understand you now(damn on me :)), by two loops you mean we only use universal intrinsics under the domain of
the baseline without involving the new CPU dispatcher when it comes to non-ufunc loops.

That will reduce the level of complexity since there will be no need for moving a large number of functions into
a new dispatch-able source (in-place modifications).

In that case, the contributor must provide a benchmark based on baseline build for each targeted optimizations
that coverd by the unviersal intrinics, for example

python setup.py build --cpu-baseline="avx512_skx" install
// benchmark avx512_skx
python setup.py build --cpu-baseline="avx2 fma3" install
// benchmark avx2 with fma3
python setup.py install
// benchmark default baseline e.g. x86_64 (sse sse2 sse3)

Well that sounds good to me, totally agree and as you said later we can dispatching them in a separate process.

It is too hard to review PRs that not only add dozens of new intrinsic overrides but also new frameworks such as templates and complicated C macro systems.

I agree but numpy#16782 is designed to reduce this issue because any contributor must implement a testing unit for each new intrinsic then the module will build and run any new intrinsic under a set of several SIMD extensions e.g. on x86 (avx2, avx512f, avx512_skx). which allows testing the universal intrinsics that only work under the domain of baseline.

I want to see numpy#16782 get merged but I guess you alright about what you said in the last meeting between us
about making the module able to generate documentation too.

Well, the new template engine (currently under experiments) will easily allow us to improve numpy#16782 in the way of
dropping C macros and support documentation. please can I get your support on it?

Let's focus on doing this one step at a time. Putting all the required intrinsic types into numpy/core/src/common/simd should be the first priority. This should enable simplifying the current loop code, not complicating it.

I agree with you from my side numpy#16960 going to add a massive of intrinsics that will cover all important intrinsics, especially the non-contiguous, de&interleaving, and half-precision operations.

mattip · 2020-08-23T12:04:01Z

I do not want to add a new template engine. If you really want to push that forward, it needs to be a completely separate effort, via a new PR (maybe even a different repo) with documentation and tests.

seiko2plus · 2020-08-23T12:14:56Z

I do not want to add a new template engine

It just a template engine that's simplify the use of python, it doesn't bring any new syntax just a few tags.

it needs to be a completely separate effort, via a new PR (maybe even a different repo) with documentation and tests.

Sure I will do that, I'm just testing it right now on that pr numpy#16960 because I need to see it on a big example to imagine
the level of complexity and readability.

mattip · 2022-12-27T08:37:56Z

I think we have moved beyond the discussion here, closing.

seiko2plus mentioned this issue Aug 22, 2020

WIP::ENH:SIMD Improve the performance of comparison operators numpy/numpy#16960

Closed

seiko2plus mentioned this issue Aug 27, 2020

ENH: Improve the performance of einsum by using universal simd numpy/numpy#17049

Closed

2 tasks

mattip closed this as completed Dec 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: refactor _Distutils out of CCompiler_Opt #46

MAINT: refactor _Distutils out of CCompiler_Opt #46

mattip commented Aug 19, 2020

seiko2plus commented Aug 22, 2020 •

edited

mattip commented Aug 23, 2020

seiko2plus commented Aug 23, 2020 •

edited

mattip commented Aug 23, 2020

seiko2plus commented Aug 23, 2020 •

edited

mattip commented Dec 27, 2022

MAINT: refactor _Distutils out of CCompiler_Opt #46

MAINT: refactor _Distutils out of CCompiler_Opt #46

Comments

mattip commented Aug 19, 2020

seiko2plus commented Aug 22, 2020 • edited

1. The number of loops is (the number of additional targeted optimizations) + the baseline, for example(pseudo for ufunc):

2. What's NPY__CPU_DISPATCH_CALL?

3. What's the use of NPY__CPU_DISPATCH_CALL?

A) in declaring a prototype, see:

B) in dispatching call or assign, see

4. Yes we do

mattip commented Aug 23, 2020

seiko2plus commented Aug 23, 2020 • edited

mattip commented Aug 23, 2020

seiko2plus commented Aug 23, 2020 • edited

mattip commented Dec 27, 2022

seiko2plus commented Aug 22, 2020 •

edited

seiko2plus commented Aug 23, 2020 •

edited

seiko2plus commented Aug 23, 2020 •

edited