Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Reduce overhead of configurable data allocation strategy (NEP49) #21531

Closed
wants to merge 3 commits into from

Conversation

eendebakpt
Copy link
Contributor

This PR reduces the overhead introduced in #17582 when the default allocation strategy is used. Approaches to reduce the overhead are discussed in #21488

Fixes #21488

@eendebakpt
Copy link
Contributor Author

@seberg Only documentation is failing. I cannot figure out what is wrong (I am trying to make a reference to the NEP49)

@seberg
Copy link
Member

seberg commented May 24, 2022

Hmmm, it looks right to me. The issue is just that we use intersphinx, so the link is broken until the NEPs get deployed.
@rossbar correct me if I am wrong, but I think we can just deploy with the test failure and it will sort itself out (at least with whatever PR gets merged after this one).

@seberg
Copy link
Member

seberg commented May 24, 2022

The alternative will be to make a tiny PR just adding the .. _NEP49:. All NEPs should have that link in any case.

_PyDataMem_GetHandler_Internal(PyObject *mem_handler)
{
#if (!defined(PYPY_VERSION_NUM) || PYPY_VERSION_NUM >= 0x07030600)
if (default_allocator_policy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to do PyDataMem_DefaultHandler == mem_handler here instead? That way it will remain working if the default had been changed once.

Otherwise, I am OK with this (modulo possible style nitpicks), so long @mattip doesn't mind the added complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattip I rebased to main. Could you review?

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its fine. I will be happy if anyone comes along to remove the micro-optimization again, but happy to trust that it is worthwhile for now.
I made two suggestions for long lines, and expanded the comment to explicitly note that it is a micro-optimization.

If no-one else comments, will apply the suggestions and merge tomorrow (or so).

doc/source/reference/c-api/data_memory.rst Outdated Show resolved Hide resolved
numpy/core/src/multiarray/alloc.c Outdated Show resolved Hide resolved
eendebakpt and others added 2 commits May 30, 2022 21:25
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
@seberg
Copy link
Member

seberg commented May 30, 2022

Sorry to come back to this after all. Could you check that these are truly worth it (e.g. show the benchmarks?).
While I don't hate the added complexity, we also have to draw the line somewhere, and right now, I am not sure on which side of the line we actually stand. I was just trying to be a bit optimistic.

@eendebakpt
Copy link
Contributor Author

Sorry to come back to this after all. Could you check that these are truly worth it (e.g. show the benchmarks?). While I don't hate the added complexity, we also have to draw the line somewhere, and right now, I am not sure on which side of the line we actually stand. I was just trying to be a bit optimistic.

Fair enough. Some numbers are in the issue #21488. I recall that the improvement was a little bit different between windows and linux, so I will try to run them both.

@eendebakpt
Copy link
Contributor Author

@seberg Benchmark on a linux system for np.sqrt (but other ufuncs are similar). I compare main to this branch (v2) and the alternative option (v1, main...eendebakpt:default_allocator_path_v1)

np.sqrt on float
================

Mean +- std dev: [main] 608 ns +- 30 ns -> [v1] 593 ns +- 38 ns: 1.03x faster
Mean +- std dev: [main] 608 ns +- 30 ns -> [v2] 576 ns +- 31 ns: 1.06x faster

np.sqrt on array
================

Mean +- std dev: [main] 388 ns +- 9 ns -> [v1] 379 ns +- 14 ns: 1.02x faster
Mean +- std dev: [main] 388 ns +- 9 ns -> [v2] 366 ns +- 5 ns: 1.06x faster

Geometric mean
==============

v1: 1.02x faster
v2: 1.06x faster

Test script:

import numpy as np
import pyperf

runner = pyperf.Runner()

runner.timeit(name=f"np.sqrt on float", stmt=f"np.sqrt(v)", setup='import numpy as np; v=1.1')
runner.timeit(name=f"np.sqrt on array", stmt=f"np.sqrt(v)", setup='import numpy as np; v=np.array([1.1])')

I also run the numpy ufunc benchmarks. They show both speedups and slowdowns, but I believe these are just random fluctuations (I see them every time I run the benchmarks).

Benchmark results for ufunc
       before           after         ratio
     [c8de16e4]       [34e5e024]
     <main>           <default_allocator_path_v2>
+      72.0±0.2μs         82.6±2μs     1.15  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'H')
+      77.6±0.4μs       88.6±0.3μs     1.14  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'H')
+      72.4±0.3μs       82.5±0.9μs     1.14  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'h')
+      86.6±0.7μs       98.4±0.8μs     1.14  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'i')
+        92.8±1μs        105±0.8μs     1.13  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'l')
+         604±5μs         679±20μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log'>, 4, 2, 'd')
+      99.2±0.7μs          110±2μs     1.11  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'L')
+      71.7±0.3μs       79.5±0.3μs     1.11  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'h')
+         449±3μs          497±4μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 4, 1, 'd')
+        95.3±1μs          105±1μs     1.10  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'Q')
+     1.36±0.02μs      1.49±0.07μs     1.09  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0))
+     1.42±0.03μs      1.54±0.08μs     1.08  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), axis=0, dtype=None))
+         153±3μs          165±4μs     1.08  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'L')
+     1.39±0.01μs      1.49±0.06μs     1.08  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0, None))
+      44.8±0.1μs       48.3±0.6μs     1.08  bench_ufunc_strides.AVX_cmplx_funcs.time_ufunc('absolute', 1, 'F')
+      45.0±0.2μs       48.3±0.5μs     1.07  bench_ufunc_strides.AVX_cmplx_funcs.time_ufunc('absolute', 2, 'F')
+         160±1μs          172±4μs     1.07  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 4, 'f')
+      45.5±0.2μs       48.6±0.7μs     1.07  bench_ufunc_strides.AVX_cmplx_funcs.time_ufunc('absolute', 4, 'F')
+      71.8±0.2μs         76.4±4μs     1.06  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'h')
+     1.36±0.03μs      1.44±0.05μs     1.06  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.])))
-         131±2μs          125±1μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'L')
-      82.4±0.3μs       78.5±0.9μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 1, 'd')
-     17.3±0.03μs       16.5±0.1μs     0.95  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('multiply', 1, 'D')
-     1.11±0.02μs      1.06±0.01μs     0.95  bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 100)
-         132±2μs        125±0.6μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'q')
-      74.4±0.4μs       70.7±0.3μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'b')
-        74.5±4μs       70.8±0.1μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'B')
-     1.02±0.02μs          972±3ns     0.95  bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 100)
-         879±7ns         835±10ns     0.95  bench_ufunc.UFuncSmall.time_ufunc_python_float('cos')
-         301±6μs          286±8μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 4, 'f')
-     1.74±0.03μs      1.65±0.03μs     0.95  bench_ufunc.Custom.time_and_bool
-        74.2±3μs       70.3±0.2μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'b')
-        832±10ns          786±6ns     0.94  bench_ufunc.Scalar.time_add_scalar_conv_complex
-        75.5±4μs       71.2±0.1μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'h')
-         867±5ns          817±9ns     0.94  bench_ufunc.UFuncSmall.time_ufunc_python_float('abs')
-        75.1±3μs       70.8±0.3μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'b')
-         227±3μs          214±2μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 2, 1, 'f')
-     3.27±0.06μs      3.08±0.01μs     0.94  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
-        75.6±4μs       71.1±0.1μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'h')
-         509±8ns        478±0.6ns     0.94  bench_ufunc.Custom.time_and_bool_small
-     1.09±0.02μs      1.03±0.01μs     0.94  bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 100)
-        787±20ns          739±1ns     0.94  bench_ufunc.Scalar.time_add_scalar_conv
-         173±3μs          162±4μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 2, 'd')
-     1.18±0.06ms      1.11±0.01ms     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'expm1'>, 2, 1, 'f')
-        76.7±4μs       71.8±0.2μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'h')
-        75.7±4μs       70.7±0.3μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'b')
-         125±1μs          117±3μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'i')
-         217±7μs        203±0.9μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'L')
-         847±9ns          791±2ns     0.93  bench_ufunc.UFuncSmall.time_ufunc_python_float('sqrt')
-      90.0±0.8μs         83.9±5μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'B')
-        596±20ns          555±3ns     0.93  bench_ufunc.UFuncSmall.time_ufunc_small_int_array('abs')
-        84.6±2μs         78.7±1μs     0.93  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 4, 'f')
-      76.1±0.7μs       70.5±0.3μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'b')
-        76.8±4μs       71.2±0.3μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'b')
-      76.6±0.1μs       70.9±0.5μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'b')
-         742±8μs          686±9μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 4, 1, 'd')
-      96.0±0.7μs       88.3±0.9μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'i')
-         111±1μs        102±0.4μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'i')
-         959±9ns          876±7ns     0.91  bench_ufunc.UFuncSmall.time_ufunc_small_int_array('sqrt')
-       296±0.6μs        270±0.9μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 4, 'f')
-         117±1μs        107±0.3μs     0.91  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 4, 'f')
-         222±9μs          203±2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'Q')
-         295±2μs          269±1μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 4, 'f')
-         183±1μs          167±2μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 1, 'd')
-         102±5μs       93.3±0.4μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'i')
-         219±5μs          199±2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'l')
-        91.3±7μs       83.0±0.9μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 1, 2, 'd')
-         113±4μs          102±2μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 4, 'f')
-         679±6ns          612±4ns     0.90  bench_ufunc.UFuncSmall.time_ufunc_small_array('cos')
-        279±10μs          251±6μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 4, 'd')
-     3.47±0.05μs      3.11±0.04μs     0.90  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
-         222±9μs          199±3μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'L')
-      80.6±0.4μs       72.0±0.2μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'H')
-         246±7μs          219±5μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 2, 'd')
-        175±10μs          155±3μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 1, 1, 'f')
-         166±1μs          146±1μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 4, 'f')
-         109±6μs         95.9±6μs     0.88  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'Q')
-       116±0.4μs        101±0.6μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 1, 'f')
-      92.0±0.8μs       78.1±0.2μs     0.85  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'H')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
Benchmarks on my windows system are too unstable.

@rossbar
Copy link
Contributor

rossbar commented May 31, 2022

In both the pyperf cases, there doesn't seem to be conclusive evidence that the changes improve performance, since the improvements are less than 1 stddev in the measured times. Given this, I'm -0.5 on adding extra complexity for what is seemingly no significant performance gain. OTOH I don't feel that strongly about it, so if anyone else feels strongly that this is an improvement I'm +1 as long as the comments pointing out the microopt are clear (as @seberg mentioned).

@seberg
Copy link
Member

seberg commented May 31, 2022

Just posting for posterity, the "scalar" benchmarks are probably most clear about the best case speedups some operations can get. (The rest tends too have too much noise).

       before           after         ratio
     [c8de16e4]       [129340f9]
     <main>           <default_allocator_path_v2>
-      8.62±0.2μs       8.19±0.2μs     0.95  bench_scalar.ScalarMath.time_add_int32arr_and_other('complex256')
-     26.9±0.05μs       25.5±0.1μs     0.95  bench_scalar.ScalarMath.time_power_of_two('float32')
-      6.02±0.2μs      5.71±0.09μs     0.95  bench_scalar.ScalarMath.time_add_other_and_int32arr('int32')
-     13.3±0.03μs      12.6±0.08μs     0.95  bench_scalar.ScalarMath.time_add_int32_other('complex64')
-     13.3±0.06μs      12.6±0.06μs     0.95  bench_scalar.ScalarMath.time_add_int32_other('float32')
-     8.45±0.02μs       7.98±0.2μs     0.94  bench_scalar.ScalarMath.time_add_int32arr_and_other('int64')
-     3.79±0.01μs      3.58±0.01μs     0.94  bench_scalar.ScalarMath.time_addition_pyint('float16')
-      5.92±0.2μs      5.58±0.08μs     0.94  bench_scalar.ScalarMath.time_add_int32arr_and_other('int32')
-     10.8±0.08μs      10.2±0.06μs     0.94  bench_scalar.ScalarMath.time_add_int32arr_and_other('complex64')
-     11.2±0.05μs      10.5±0.09μs     0.94  bench_scalar.ScalarMath.time_add_other_and_int32arr('float16')
-      22.0±0.7μs       20.7±0.3μs     0.94  bench_scalar.ScalarMath.time_power_of_two('int32')
-     3.85±0.01μs      3.61±0.03μs     0.94  bench_scalar.ScalarMath.time_addition_pyint('float32')
-     3.87±0.02μs      3.63±0.02μs     0.94  bench_scalar.ScalarMath.time_addition_pyint('complex64')
-     8.53±0.03μs       8.00±0.2μs     0.94  bench_scalar.ScalarMath.time_add_other_and_int32arr('int64')
-     10.8±0.04μs       10.1±0.1μs     0.94  bench_scalar.ScalarMath.time_add_int32arr_and_other('float16')
-     3.26±0.01μs      3.06±0.01μs     0.94  bench_scalar.ScalarMath.time_addition_pyint('int16')
-        10.8±0μs       10.1±0.1μs     0.93  bench_scalar.ScalarMath.time_add_int32arr_and_other('float32')
-      8.91±0.1μs      8.28±0.01μs     0.93  bench_scalar.ScalarMath.time_add_other_and_int32arr('complex128')
-     13.3±0.05μs       12.3±0.2μs     0.93  bench_scalar.ScalarMath.time_add_int32_other('float16')
-     8.62±0.08μs       7.98±0.1μs     0.93  bench_scalar.ScalarMath.time_add_int32arr_and_other('complex128')
-     8.77±0.04μs       8.09±0.1μs     0.92  bench_scalar.ScalarMath.time_add_other_and_int32arr('float64')
-      11.3±0.2μs      10.4±0.09μs     0.92  bench_scalar.ScalarMath.time_add_other_and_int32arr('complex64')
-      21.9±0.3μs       20.2±0.2μs     0.92  bench_scalar.ScalarMath.time_power_of_two('int16')
-      27.0±0.4μs      24.9±0.05μs     0.92  bench_scalar.ScalarMath.time_power_of_two('complex64')
-     8.54±0.02μs       7.77±0.1μs     0.91  bench_scalar.ScalarMath.time_add_int32arr_and_other('float64')
-     8.67±0.04μs      7.74±0.04μs     0.89  bench_scalar.ScalarMath.time_add_int32arr_and_other('longfloat')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@charris
Copy link
Member

charris commented May 31, 2022

I agree with @rossbar here, the gain looks to be marginal. I suppose scalars could be a bottleneck if used in a loop, but perhaps that could be fixed by refactoring the loop.

@seberg
Copy link
Member

seberg commented May 31, 2022

OK, sounds like we should draw the line here for now then, so closing. I have to admit, at least hacking around the PyContextVar_Get also feels a bit much to me.
I hope you don't mind @eendebakpt, I was curious how far these micro-opts can be worthwhile and add up to real improvements. Maybe this is as far as we get easily for now.

@seberg seberg closed this May 31, 2022
@eendebakpt
Copy link
Contributor Author

@seberg The scalar benchmarks you posted earlier show there are real improvements for scalar operations of about 5% (and somewhat less for small arrays). But 5% is not a lot, so I am fine with closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Reduce overhead of configurable data allocation strategy (NEP49)
4 participants