-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: refactor _Distutils out of CCompiler_Opt #46
Comments
@mattip, Can we discuss it here? I guess yes :)
1. The number of loops is (the number of additional targeted optimizations) + the baseline, for example(pseudo for ufunc):void ufunc_inner_loop_AVX512F(args) {
// avx512f code
}
void ufunc_inner_loop_AVX2(args) {
// avx2 code
}
void ufunc_inner_loop(args) {
// we tend to not add suffix to the baseline func so it can be friendly with static assign.
} Now, dispatch the highest interest target on runtime during the load of umath module static ufunc_inner_pointer = ufunc_inner_loop; // baseline
void umath_module_load()
{
// check the machine support and cache the result in the memory
npy_cpu_init();
// ... other code
// runtime dispatch
if (NPY_CPU_HAVE(AVX512F)) {
ufunc_inner_pointer = ufunc_inner_loop_AVX512F;
} else if (NPY_CPU_HAVE(AVX2)) {
ufunc_inner_pointer = ufunc_inner_loop_AVX2;
}
// ... other code
}
2. What's NPY__CPU_DISPATCH_CALL?
// this how it looks if we targeting AVX512F and AVX2
#define NPY__CPU_DISPATCH_CALL(CHK, CB, ...) \
NPY__CPU_DISPATCH_EXPAND_(CB((CHK(AVX512F)), AVX512F, __VA_ARGS__)) \
NPY__CPU_DISPATCH_EXPAND_(CB((CHK(AVX)&&CHK(F16C)&&CHK(AVX2)), AVX2, __VA_ARGS__)) The args explained as following: 3. What's the use of NPY__CPU_DISPATCH_CALL?Currently, we use it for two things: A) in declaring a prototype, see:B) in dispatching call or assign, see
4. Yes we doYes we already do this with the ufunc as I explained in the first point, see void simd_packbits_AVX512F(args) {
// AVX512F code
}
void simd_packbits_AVX2(args) {
// AVX2 code
}
void simd_packbits(args) {
// baseline
} check and call void the_callee(args)
{
#if 0
// check the CPU support and cache the result in the memory
// we call it one time during the load of umath module
npy_cpu_init();
#endif
// runtime dispatch
if (NPY_CPU_HAVE(AVX512F)) {
simd_packbits_AVX512F(some_args);
} else if (NPY_CPU_HAVE(AVX2)) {
simd_packbits_AVX2(some_args);
}
else {
simd_packbits(some_args);
}
} It will look bad only if you going to deal with so many targets since we check the CPU support only one time However that's not the issue of the dispatcher, it gives you the ability
void the_callee(args)
{
// ....
NPY_CPU_DISPATCH_CALL(the_pointer = simd_packbits)
}
void the_callee(args)
{
// ....
NPY_CPU_DISPATCH_CALL(simd_packbits, (args))
} depending on your situation and the code cost.
I believe we have currently the most stable, flexible, and advanced infrastructure for SIMD optimizations comparing to all famous Please ask me more questions, that will help us in improving the documentation. |
It is not a matter for documentation: this is a problem with the framework we have developed. It is too complicated for anyone but the most dedicated reviewers.
That may be true but if we cannot review PRs to use it, and each PR for the new infrastructure brings with it new tools and dozens of new files, we will never be able to merge them.
I understand that. I am suggesting we only generate 2 loops to simplify the framework for now.
I am suggesting we only do dispatching with the ufunc and not with any other mechanism for now. Of course we can always go back and extend the optimizations in the future, but for now I think the current approach is too ambitious for NumPy. It is too hard to review PRs that not only add dozens of new intrinsic overrides but also new frameworks such as templates and complicated C macro systems. Let's focus on doing this one step at a time. Putting all the required intrinsic types into Once we have refactored the ufunc loops to use universal intrinsics, we can think about expanding the loops to use more than two loops, and expanding the use of intrinsics to other functions like packbits and einsum. |
@mattip, That will reduce the level of complexity since there will be no need for moving a large number of functions into In that case, the contributor must provide a benchmark based on baseline build for each targeted optimizations python setup.py build --cpu-baseline="avx512_skx" install
// benchmark avx512_skx
python setup.py build --cpu-baseline="avx2 fma3" install
// benchmark avx2 with fma3
python setup.py install
// benchmark default baseline e.g. x86_64 (sse sse2 sse3) Well that sounds good to me, totally agree and as you said later we can dispatching them in a separate process.
I agree but numpy#16782 is designed to reduce this issue because any contributor must implement a testing unit for each new intrinsic then the module will build and run any new intrinsic under a set of several SIMD extensions e.g. on x86 (avx2, avx512f, avx512_skx). which allows testing the universal intrinsics that only work under the domain of baseline. I want to see numpy#16782 get merged but I guess you alright about what you said in the last meeting between us Well, the new template engine (currently under experiments) will easily allow us to improve numpy#16782 in the way of
I agree with you from my side numpy#16960 going to add a massive of intrinsics that will cover all important intrinsics, especially the non-contiguous, de&interleaving, and half-precision operations. |
I do not want to add a new template engine. If you really want to push that forward, it needs to be a completely separate effort, via a new PR (maybe even a different repo) with documentation and tests. |
It just a template engine that's simplify the use of python, it doesn't bring any new syntax just a few tags.
Sure I will do that, I'm just testing it right now on that pr numpy#16960 because I need to see it on a big example to imagine |
I think we have moved beyond the discussion here, closing. |
As part of PR numpy#13516, we added a CCompilerOpt class in
numpy/distutils/ccompiler_opt.py
. The class implements the compiler side of NEP0038 SIMD optimizations which--cpu-baseline
and--cpu-dispatch
#define
macrosThe code as-is should be refactored to reuse more of the distuitils infrastructure where possible. In particular, the
_Distutils
class re-implements logging and compilation: tasks that should be delegated.The text was updated successfully, but these errors were encountered: