ENH: Reduce overhead of configurable data allocation strategy (NEP49) #21531

eendebakpt · 2022-05-18T10:04:58Z

This PR reduces the overhead introduced in #17582 when the default allocation strategy is used. Approaches to reduce the overhead are discussed in #21488

Fixes #21488

eendebakpt · 2022-05-24T21:49:44Z

@seberg Only documentation is failing. I cannot figure out what is wrong (I am trying to make a reference to the NEP49)

seberg · 2022-05-24T21:54:28Z

Hmmm, it looks right to me. The issue is just that we use intersphinx, so the link is broken until the NEPs get deployed.
@rossbar correct me if I am wrong, but I think we can just deploy with the test failure and it will sort itself out (at least with whatever PR gets merged after this one).

seberg · 2022-05-24T23:52:05Z

The alternative will be to make a tiny PR just adding the .. _NEP49:. All NEPs should have that link in any case.

seberg · 2022-05-25T17:51:45Z

numpy/core/src/multiarray/alloc.c

+_PyDataMem_GetHandler_Internal(PyObject *mem_handler)
+{
+#if (!defined(PYPY_VERSION_NUM) || PYPY_VERSION_NUM >= 0x07030600)
+    if (default_allocator_policy)


Would it make sense to do PyDataMem_DefaultHandler == mem_handler here instead? That way it will remain working if the default had been changed once.

Otherwise, I am OK with this (modulo possible style nitpicks), so long @mattip doesn't mind the added complexity.

@mattip I rebased to main. Could you review?

seberg

I think its fine. I will be happy if anyone comes along to remove the micro-optimization again, but happy to trust that it is worthwhile for now.
I made two suggestions for long lines, and expanded the comment to explicitly note that it is a micro-optimization.

If no-one else comments, will apply the suggestions and merge tomorrow (or so).

doc/source/reference/c-api/data_memory.rst

numpy/core/src/multiarray/alloc.c

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

seberg · 2022-05-30T20:10:54Z

Sorry to come back to this after all. Could you check that these are truly worth it (e.g. show the benchmarks?).
While I don't hate the added complexity, we also have to draw the line somewhere, and right now, I am not sure on which side of the line we actually stand. I was just trying to be a bit optimistic.

eendebakpt · 2022-05-30T20:15:26Z

Sorry to come back to this after all. Could you check that these are truly worth it (e.g. show the benchmarks?). While I don't hate the added complexity, we also have to draw the line somewhere, and right now, I am not sure on which side of the line we actually stand. I was just trying to be a bit optimistic.

Fair enough. Some numbers are in the issue #21488. I recall that the improvement was a little bit different between windows and linux, so I will try to run them both.

eendebakpt · 2022-05-31T07:59:11Z

@seberg Benchmark on a linux system for np.sqrt (but other ufuncs are similar). I compare main to this branch (v2) and the alternative option (v1, main...eendebakpt:default_allocator_path_v1)

np.sqrt on float
================

Mean +- std dev: [main] 608 ns +- 30 ns -> [v1] 593 ns +- 38 ns: 1.03x faster
Mean +- std dev: [main] 608 ns +- 30 ns -> [v2] 576 ns +- 31 ns: 1.06x faster

np.sqrt on array
================

Mean +- std dev: [main] 388 ns +- 9 ns -> [v1] 379 ns +- 14 ns: 1.02x faster
Mean +- std dev: [main] 388 ns +- 9 ns -> [v2] 366 ns +- 5 ns: 1.06x faster

Geometric mean
==============

v1: 1.02x faster
v2: 1.06x faster

Test script:

import numpy as np
import pyperf

runner = pyperf.Runner()

runner.timeit(name=f"np.sqrt on float", stmt=f"np.sqrt(v)", setup='import numpy as np; v=1.1')
runner.timeit(name=f"np.sqrt on array", stmt=f"np.sqrt(v)", setup='import numpy as np; v=np.array([1.1])')

I also run the numpy ufunc benchmarks. They show both speedups and slowdowns, but I believe these are just random fluctuations (I see them every time I run the benchmarks).

Benchmark results for ufunc

       before           after         ratio
     [c8de16e4]       [34e5e024]
     <main>           <default_allocator_path_v2>
+      72.0±0.2μs         82.6±2μs     1.15  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'H')
+      77.6±0.4μs       88.6±0.3μs     1.14  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'H')
+      72.4±0.3μs       82.5±0.9μs     1.14  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'h')
+      86.6±0.7μs       98.4±0.8μs     1.14  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'i')
+        92.8±1μs        105±0.8μs     1.13  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'l')
+         604±5μs         679±20μs     1.12  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'log'>, 4, 2, 'd')
+      99.2±0.7μs          110±2μs     1.11  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 2, 1, 'L')
+      71.7±0.3μs       79.5±0.3μs     1.11  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 1, 'h')
+         449±3μs          497±4μs     1.11  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp2'>, 4, 1, 'd')
+        95.3±1μs          105±1μs     1.10  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 1, 'Q')
+     1.36±0.02μs      1.49±0.07μs     1.09  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0))
+     1.42±0.03μs      1.54±0.08μs     1.08  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), axis=0, dtype=None))
+         153±3μs          165±4μs     1.08  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 1, 'L')
+     1.39±0.01μs      1.49±0.06μs     1.08  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.]), 0, None))
+      44.8±0.1μs       48.3±0.6μs     1.08  bench_ufunc_strides.AVX_cmplx_funcs.time_ufunc('absolute', 1, 'F')
+      45.0±0.2μs       48.3±0.5μs     1.07  bench_ufunc_strides.AVX_cmplx_funcs.time_ufunc('absolute', 2, 'F')
+         160±1μs          172±4μs     1.07  bench_ufunc_strides.Binary.time_ufunc('fmax', 2, 4, 4, 'f')
+      45.5±0.2μs       48.6±0.7μs     1.07  bench_ufunc_strides.AVX_cmplx_funcs.time_ufunc('absolute', 4, 'F')
+      71.8±0.2μs         76.4±4μs     1.06  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 1, 'h')
+     1.36±0.03μs      1.44±0.05μs     1.06  bench_ufunc.ArgParsingReduce.time_add_reduce_arg_parsing((array([0., 1.])))
-         131±2μs          125±1μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'L')
-      82.4±0.3μs       78.5±0.9μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 2, 1, 'd')
-     17.3±0.03μs       16.5±0.1μs     0.95  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('multiply', 1, 'D')
-     1.11±0.02μs      1.06±0.01μs     0.95  bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 100)
-         132±2μs        125±0.6μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 2, 'q')
-      74.4±0.4μs       70.7±0.3μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 1, 2, 'b')
-        74.5±4μs       70.8±0.1μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'B')
-     1.02±0.02μs          972±3ns     0.95  bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.uint16'>, 100)
-         879±7ns         835±10ns     0.95  bench_ufunc.UFuncSmall.time_ufunc_python_float('cos')
-         301±6μs          286±8μs     0.95  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 4, 4, 'f')
-     1.74±0.03μs      1.65±0.03μs     0.95  bench_ufunc.Custom.time_and_bool
-        74.2±3μs       70.3±0.2μs     0.95  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'b')
-        832±10ns          786±6ns     0.94  bench_ufunc.Scalar.time_add_scalar_conv_complex
-        75.5±4μs       71.2±0.1μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'h')
-         867±5ns          817±9ns     0.94  bench_ufunc.UFuncSmall.time_ufunc_python_float('abs')
-        75.1±3μs       70.8±0.3μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 4, 2, 'b')
-         227±3μs          214±2μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 2, 1, 'f')
-     3.27±0.06μs      3.08±0.01μs     0.94  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)
-        75.6±4μs       71.1±0.1μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 2, 'h')
-         509±8ns        478±0.6ns     0.94  bench_ufunc.Custom.time_and_bool_small
-     1.09±0.02μs      1.03±0.01μs     0.94  bench_ufunc.CustomArrayFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 100)
-        787±20ns          739±1ns     0.94  bench_ufunc.Scalar.time_add_scalar_conv
-         173±3μs          162±4μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 2, 'd')
-     1.18±0.06ms      1.11±0.01ms     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'expm1'>, 2, 1, 'f')
-        76.7±4μs       71.8±0.2μs     0.94  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 4, 'h')
-        75.7±4μs       70.7±0.3μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 1, 1, 'b')
-         125±1μs          117±3μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 4, 2, 'i')
-         217±7μs        203±0.9μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'L')
-         847±9ns          791±2ns     0.93  bench_ufunc.UFuncSmall.time_ufunc_python_float('sqrt')
-      90.0±0.8μs         83.9±5μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 2, 1, 'B')
-        596±20ns          555±3ns     0.93  bench_ufunc.UFuncSmall.time_ufunc_small_int_array('abs')
-        84.6±2μs         78.7±1μs     0.93  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 4, 'f')
-      76.1±0.7μs       70.5±0.3μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 2, 'b')
-        76.8±4μs       71.2±0.3μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 4, 2, 2, 'b')
-      76.6±0.1μs       70.9±0.5μs     0.93  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 1, 2, 'b')
-         742±8μs          686±9μs     0.92  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 4, 1, 'd')
-      96.0±0.7μs       88.3±0.9μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 1, 4, 2, 'i')
-         111±1μs        102±0.4μs     0.92  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 2, 4, 'i')
-         959±9ns          876±7ns     0.91  bench_ufunc.UFuncSmall.time_ufunc_small_int_array('sqrt')
-       296±0.6μs        270±0.9μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'cos'>, 2, 4, 'f')
-         117±1μs        107±0.3μs     0.91  bench_ufunc_strides.Binary.time_ufunc('fmin', 2, 1, 4, 'f')
-         222±9μs          203±2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 4, 1, 2, 'Q')
-         295±2μs          269±1μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 2, 4, 'f')
-         183±1μs          167±2μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 1, 'd')
-         102±5μs       93.3±0.4μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 2, 'i')
-         219±5μs          199±2μs     0.91  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'l')
-        91.3±7μs       83.0±0.9μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 1, 2, 'd')
-         113±4μs          102±2μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 4, 'f')
-         679±6ns          612±4ns     0.90  bench_ufunc.UFuncSmall.time_ufunc_small_array('cos')
-        279±10μs          251±6μs     0.90  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 4, 'd')
-     3.47±0.05μs      3.11±0.04μs     0.90  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
-         222±9μs          199±3μs     0.90  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 4, 2, 'L')
-      80.6±0.4μs       72.0±0.2μs     0.89  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 2, 1, 4, 'H')
-         246±7μs          219±5μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 4, 2, 'd')
-        175±10μs          155±3μs     0.89  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'exp'>, 1, 1, 'f')
-         166±1μs          146±1μs     0.88  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 4, 'f')
-         109±6μs         95.9±6μs     0.88  bench_ufunc_strides.BinaryInt.time_ufunc('maximum', 1, 2, 1, 'Q')
-       116±0.4μs        101±0.6μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'logical_not'>, 2, 1, 'f')
-      92.0±0.8μs       78.1±0.2μs     0.85  bench_ufunc_strides.BinaryInt.time_ufunc('minimum', 2, 4, 1, 'H')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Benchmarks on my windows system are too unstable.

rossbar · 2022-05-31T16:08:13Z

In both the pyperf cases, there doesn't seem to be conclusive evidence that the changes improve performance, since the improvements are less than 1 stddev in the measured times. Given this, I'm -0.5 on adding extra complexity for what is seemingly no significant performance gain. OTOH I don't feel that strongly about it, so if anyone else feels strongly that this is an improvement I'm +1 as long as the comments pointing out the microopt are clear (as @seberg mentioned).

seberg · 2022-05-31T16:17:11Z

Just posting for posterity, the "scalar" benchmarks are probably most clear about the best case speedups some operations can get. (The rest tends too have too much noise).

       before           after         ratio
     [c8de16e4]       [129340f9]
     <main>           <default_allocator_path_v2>
-      8.62±0.2μs       8.19±0.2μs     0.95  bench_scalar.ScalarMath.time_add_int32arr_and_other('complex256')
-     26.9±0.05μs       25.5±0.1μs     0.95  bench_scalar.ScalarMath.time_power_of_two('float32')
-      6.02±0.2μs      5.71±0.09μs     0.95  bench_scalar.ScalarMath.time_add_other_and_int32arr('int32')
-     13.3±0.03μs      12.6±0.08μs     0.95  bench_scalar.ScalarMath.time_add_int32_other('complex64')
-     13.3±0.06μs      12.6±0.06μs     0.95  bench_scalar.ScalarMath.time_add_int32_other('float32')
-     8.45±0.02μs       7.98±0.2μs     0.94  bench_scalar.ScalarMath.time_add_int32arr_and_other('int64')
-     3.79±0.01μs      3.58±0.01μs     0.94  bench_scalar.ScalarMath.time_addition_pyint('float16')
-      5.92±0.2μs      5.58±0.08μs     0.94  bench_scalar.ScalarMath.time_add_int32arr_and_other('int32')
-     10.8±0.08μs      10.2±0.06μs     0.94  bench_scalar.ScalarMath.time_add_int32arr_and_other('complex64')
-     11.2±0.05μs      10.5±0.09μs     0.94  bench_scalar.ScalarMath.time_add_other_and_int32arr('float16')
-      22.0±0.7μs       20.7±0.3μs     0.94  bench_scalar.ScalarMath.time_power_of_two('int32')
-     3.85±0.01μs      3.61±0.03μs     0.94  bench_scalar.ScalarMath.time_addition_pyint('float32')
-     3.87±0.02μs      3.63±0.02μs     0.94  bench_scalar.ScalarMath.time_addition_pyint('complex64')
-     8.53±0.03μs       8.00±0.2μs     0.94  bench_scalar.ScalarMath.time_add_other_and_int32arr('int64')
-     10.8±0.04μs       10.1±0.1μs     0.94  bench_scalar.ScalarMath.time_add_int32arr_and_other('float16')
-     3.26±0.01μs      3.06±0.01μs     0.94  bench_scalar.ScalarMath.time_addition_pyint('int16')
-        10.8±0μs       10.1±0.1μs     0.93  bench_scalar.ScalarMath.time_add_int32arr_and_other('float32')
-      8.91±0.1μs      8.28±0.01μs     0.93  bench_scalar.ScalarMath.time_add_other_and_int32arr('complex128')
-     13.3±0.05μs       12.3±0.2μs     0.93  bench_scalar.ScalarMath.time_add_int32_other('float16')
-     8.62±0.08μs       7.98±0.1μs     0.93  bench_scalar.ScalarMath.time_add_int32arr_and_other('complex128')
-     8.77±0.04μs       8.09±0.1μs     0.92  bench_scalar.ScalarMath.time_add_other_and_int32arr('float64')
-      11.3±0.2μs      10.4±0.09μs     0.92  bench_scalar.ScalarMath.time_add_other_and_int32arr('complex64')
-      21.9±0.3μs       20.2±0.2μs     0.92  bench_scalar.ScalarMath.time_power_of_two('int16')
-      27.0±0.4μs      24.9±0.05μs     0.92  bench_scalar.ScalarMath.time_power_of_two('complex64')
-     8.54±0.02μs       7.77±0.1μs     0.91  bench_scalar.ScalarMath.time_add_int32arr_and_other('float64')
-     8.67±0.04μs      7.74±0.04μs     0.89  bench_scalar.ScalarMath.time_add_int32arr_and_other('longfloat')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

charris · 2022-05-31T16:42:27Z

I agree with @rossbar here, the gain looks to be marginal. I suppose scalars could be a bottleneck if used in a loop, but perhaps that could be fixed by refactoring the loop.

seberg · 2022-05-31T17:09:58Z

OK, sounds like we should draw the line here for now then, so closing. I have to admit, at least hacking around the PyContextVar_Get also feels a bit much to me.
I hope you don't mind @eendebakpt, I was curious how far these micro-opts can be worthwhile and add up to real improvements. Maybe this is as far as we get easily for now.

eendebakpt · 2022-05-31T18:57:28Z

@seberg The scalar benchmarks you posted earlier show there are real improvements for scalar operations of about 5% (and somewhat less for small arrays). But 5% is not a lot, so I am fine with closing this for now.

github-actions bot added the 01 - Enhancement label May 18, 2022

eendebakpt force-pushed the default_allocator_path_v2 branch from 5bc581f to 10dded6 Compare May 18, 2022 10:49

seberg reviewed May 25, 2022

View reviewed changes

eendebakpt force-pushed the default_allocator_path_v2 branch from 10dded6 to c4cbcb2 Compare May 25, 2022 18:46

PERF: Fast path for returning default memory allocator

34e5e02

eendebakpt force-pushed the default_allocator_path_v2 branch from c4cbcb2 to 34e5e02 Compare May 30, 2022 18:47

seberg approved these changes May 30, 2022

View reviewed changes

doc/source/reference/c-api/data_memory.rst Outdated Show resolved Hide resolved

numpy/core/src/multiarray/alloc.c Outdated Show resolved Hide resolved

eendebakpt and others added 2 commits May 30, 2022 21:25

Update doc/source/reference/c-api/data_memory.rst

81aa86e

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

Update numpy/core/src/multiarray/alloc.c

129340f

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

seberg closed this May 31, 2022

eendebakpt mentioned this pull request Aug 31, 2023

ENH: [RFC] Improve performance of numpy operations involving scalars #24345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Reduce overhead of configurable data allocation strategy (NEP49) #21531

ENH: Reduce overhead of configurable data allocation strategy (NEP49) #21531

eendebakpt commented May 18, 2022

eendebakpt commented May 24, 2022

seberg commented May 24, 2022

seberg commented May 24, 2022

seberg May 25, 2022

eendebakpt May 30, 2022

seberg left a comment

seberg commented May 30, 2022

eendebakpt commented May 30, 2022

eendebakpt commented May 31, 2022

rossbar commented May 31, 2022

seberg commented May 31, 2022

charris commented May 31, 2022

seberg commented May 31, 2022

eendebakpt commented May 31, 2022

ENH: Reduce overhead of configurable data allocation strategy (NEP49) #21531

ENH: Reduce overhead of configurable data allocation strategy (NEP49) #21531

Conversation

eendebakpt commented May 18, 2022

eendebakpt commented May 24, 2022

seberg commented May 24, 2022

seberg commented May 24, 2022

seberg May 25, 2022

Choose a reason for hiding this comment

eendebakpt May 30, 2022

Choose a reason for hiding this comment

seberg left a comment

Choose a reason for hiding this comment

seberg commented May 30, 2022

eendebakpt commented May 30, 2022

eendebakpt commented May 31, 2022

rossbar commented May 31, 2022

seberg commented May 31, 2022

charris commented May 31, 2022

seberg commented May 31, 2022

eendebakpt commented May 31, 2022