ENH: [RFC] Improve performance of numpy operations involving scalars #24345

eendebakpt · 2023-08-05T21:16:02Z

Proposed new feature or change:

The performance of numpy operations involving scalars can be improved by adding a simple cache for some of the arrays generated in the convert_ufunc_arguments. A prototype implementation is main...eendebakpt:scalar_no_alloc_proto

A benchmark of the prototype (i an integer, f a float and x a small numpy array):

i + x: Mean +- std dev: [main] 688 ns +- 36 ns -> [pr] 522 ns +- 25 ns: 1.32x faster
f + x: Mean +- std dev: [main] 706 ns +- 57 ns -> [pr] 517 ns +- 21 ns: 1.36x faster
np.sin(2*np.pi*x + phi): Mean +- std dev: [main] 2.06 us +- 0.11 us -> [pr] 1.61 us +- 0.05 us: 1.28x faster
s + x: Mean +- std dev: [main] 723 ns +- 47 ns -> [pr] 742 ns +- 44 ns: 1.03x slower

Benchmark hidden because not significant (2): x + x, x**2

Geometric mean: 1.14x faster

Benchmark script

import pyperf
import numpy as np
#print(np, np.__version__)

np._set_promotion_state('weak')

setup="""
import numpy as np
np._set_promotion_state('weak')

x = np.array([1., 2., 3.])

i = 3
f = 1.1

s = np.float64(20.)
phi = np.pi/4
"""


runner = pyperf.Runner()
runner.timeit(name="i + x", stmt="_ = f + x",setup=setup)
runner.timeit(name="f + x", stmt="_ = f + x",setup=setup)
runner.timeit(name="np.sin(2*np.pi*x + phi)", stmt="_ = np.sin(2*np.pi*x + phi)",setup=setup)
runner.timeit(name="s + x", stmt="_ = s + x",setup=setup)
runner.timeit(name="x + x", stmt="_ = s + x",setup=setup)
runner.timeit(name="x**2", stmt="_ = x**2",setup=setup)

The basic ideas for this approach can be found in a mailing list thread: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/DPATA4IOVBSXC5VNY7IBYIKLXL2ZGI36/#DPATA4IOVBSXC5VNY7IBYIKLXL2ZGI36

Notes:

The performance gain from the prototype is due to prevention of the allocation and deallocation of a numpy array. For a scalar argument to a ufunc such as a python float, the scalar argument is converted internally to a python array
with PyArray_FromAny in convert_ufunc_arguments. The allocation of a temporary numpy array is expensive, so in the prototype we keep track of the generated array and re-use it if possible.
Profiling 1.1 + x with valgrind results in (without this PR and the promotion state from NEP50 set to weak):

Without this PR about 25% if the runtime is spend in allocating a deallocating a temporary numpy array.

The performance of the PyArray_FromAny could perhaps be improved by using freelists of generated arrays. This is more complex as it is unclear when generated arrays will be available again. So one has to keep a longer list of arrays and check reference counts. In the prototype from this PR there is a single array in the freelist, and it will be free after the ufunc execution is complete.
An advantage of improving the PyArray_FromAny is that for larger arrays x, y, z in the expression np.sin( (x+y)+z) the intermediate result for x+y will be available quickly, so one can avoid allocation and deallocation of larger arrays. The prototype here only deals with scalars.
The prototype is not complete yet. Some open items are proper initialization, thread safety and fast path for numpy scalars.

The text was updated successfully, but these errors were encountered:

mattip · 2023-08-06T05:51:35Z

With a no-gil pytohn on the horizon, we should be very careful about multithreading and race conditions when creating caches.

mhvk · 2023-08-06T12:49:23Z

This is an interesting idea, but like @mattip I'd worry greatly with doing this in a multi-threading environment. I was not part of the mailing list discussion, but there would seem more general options. E.g.,

Ensure a fast path that calls the ufunc inner loop with appropriate ndim directly for the scalar-only case.
Have a special fast path that calls the corresponding scalar math function (these already exist! which is why f64 + f64 is far faster than np.add(f64, f64); see below). I think this would be harder (not possible?) as it would require adjustments to the ufunc to carry a pointer to the scalar math function.
Ensure that at the C level the scalars look like an array sufficiently so that they can be used inside ufuncs like arrays (effectively, they'd need to be an array subclass for that; no idea how hard that is, but they clearly share a lot of machinery already; and really we should just have array scalars...).

Note that there is more than 25% to be gained:

f64 = np.float64(1.)
%timeit f64 + f64
55.1 ns ± 0.155 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
%timeit np.add(f64, f64)
793 ns ± 2.27 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

seberg · 2023-08-06T15:31:21Z

A free list for (simple!) scalar arrays does seem a bit more clear to me though if it helps a lot. PyArray_FromAny definitely has potential for improvement also in the exact array input path, the scalar input path is pretty bad and there is lot of improvement.

There are three cases:

All inputs are scalars, you could in principle call the ufunc inner-loop directly on that.
Some inputs are scalars, but the iteration is "trivial". This can also work, but you need to drag around the "this is a scalar" information.
General case: uses NpyIter, so you dfinitely need an array object.

We could use scalar functions, but we don't really have them for everything and we need to tag them optionally on the ufunc/ArrayMethod to make that reasonable. Otherwise you would need some logic like: this ufunc is actually an operator, if all inputs are scalars call the operator instead. (I doubt this is fun to implement, but yeah maybe possible.)

In general ufuncs have a lot of things to do (array-ufunc dispatching, dtype dispatching/resolution, array-wrap, array-prepare and probably some more). Scalars actually have to practically nothing, they just check the other scalar is known (it is here) and otherwise defer to the ufunc.
So I would agree that a = np.array(1.); %timeit a + a can probably be improved, but if we get the speed to be basically identical for scalars, that would already be pretty good!

mhvk · 2023-08-06T15:58:53Z

The nice thing about improving PyArray_FromAny is that it benefits everything else in numpy too...

On the ufunc side, I agree that most benefit is probably to be had from ensuring that anything that can be handled trivially goes through a fast path, (I also dislike connecting to scalarmath; the reverse, using ufunc inner loops also for scalar math would seem nicer; the overhead of doing an inner loop of size 1 should not be much compared to going in and out of python -- but I may well be wrong!)

seberg · 2023-08-08T11:07:32Z

the reverse, using ufunc inner loops also for scalar math would seem nicer

Yeah, agreed, although it might be a bit tricky to get the full speed. You could just re-use the math core (after custom promotion, which is OK for most operators it's simple!); I suspect the slowdown would be small enough, but one tricky thing is that we actually have special integer loops to warn about overflows.
We could also re-use the ufunc promotion, and for re-using ufunc loops directly, that might be be good. But it would need some smart caching...

Which, brings me to "can we do super-fast scalars in ufuncs"? Maybe, although it would add another layer and we may need to use some care to improve the caching lookup if we keep adding these.
The point being that we could add caching layer based on only input types (not dtypes) to decide cache:

Which type to dispatch to for __array_ufunc__
Whether to use a scalar only path.
(The two things are really very closely related, since you could see a scalar special path just as an __array_ufunc__ implementation for scalars.)

With such a caching layer, you actually can a have a true scalar fast-path. The overhead of that should in practice be less than the __array_ufunc__ overhead TBH.

EDIT: However, there is still the point that Python operators need to defer and the other issues. So not sure how useful the scalar fast path would be as a replacement for the scalar operators.

mhvk · 2023-08-08T12:17:32Z

I like the caching based on type! Conceptually, it is then similar to the scalars having __array_ufunc__ themselves, and it speeds up overrides for everything else too.

eendebakpt · 2023-08-08T19:47:11Z

Thanks for the feedback. The discussion on the scalarmath and ufuncs I cannot follow entirely, so I will focus on the idea of caching and freelists for now. Some more thoughts:

We cannot use freelists of large arrays because of the extra memory it would require. We can use it for scalar arrays. Does it make sense to add freelists for "small" arrays. And if so, what would be a threshold for the size? 10, 100, 1000 elements
If we use freelists, we can avoid memory allocation and deallocation, and perhaps object initialization. Currently, for PyArray_FromAny on a float this is the profiling result:

So to make the approach really fast, we also need to address the packing and dtype discovery as well. This can be done similar to this PR I guess.

A numpy scalar is not immutable. We can change the dtype (and perhaps more?):

import numpy as np

y=np.array(1.1, dtype=np.float64)
print(f'{y}, y.itemsize {y.itemsize}, y.size {y.size}, y.dtype {y.dtype}')

y.dtype=np.dtype('int')
print(f'{y}, y.itemsize {y.itemsize}, y.size {y.size}, y.dtype {y.dtype}')

y.setfield(2, np.dtype('int'))
print(f'{y}, y.itemsize {y.itemsize}, y.size {y.size}, y.dtype {y.dtype}')

y.setfield(2.5, np.dtype('float64'))
print(f'{y}, y.itemsize {y.itemsize}, y.size {y.size}, y.dtype {y.dtype}')

So if we generate a scalar array with certain dtype and add it to the freelist, we can only use it later if either we check the dtype is still good, or we update the dtype.

eendebakpt · 2023-08-31T20:01:14Z

@mattip @mhvk @seberg Thanks for the feedback on the first prototype. Here is an update on the tests with more general freelists I performed.

In scalar_freelists_clean there is a prototype of the freelist approach to increasing performance of scalar and small array operations. The performance improvement over main is good:

np.sin(f): Mean +- std dev: [main] 672 ns +- 44 ns -> [pr3] 431 ns +- 23 ns: 1.56x faster
f + x: Mean +- std dev: [main] 700 ns +- 65 ns -> [pr3] 534 ns +- 20 ns: 1.31x faster
np.sin(s): Mean +- std dev: [main] 688 ns +- 52 ns -> [pr3] 515 ns +- 42 ns: 1.34x faster
np.sin(2*np.pi*x + phi): Mean +- std dev: [main] 2.03 us +- 0.12 us -> [pr3] 1.73 us +- 0.08 us: 1.18x faster
s + x: Mean +- std dev: [main] 726 ns +- 50 ns -> [pr3] 652 ns +- 46 ns: 1.11x faster
x + x: Mean +- std dev: [main] 728 ns +- 45 ns -> [pr3] 650 ns +- 42 ns: 1.12x faster
f + x1000: Mean +- std dev: [main] 2.36 us +- 0.16 us -> [pr3] 2.01 us +- 0.08 us: 1.18x faster
i + x: Mean +- std dev: [main] 698 ns +- 41 ns -> [pr3] 678 ns +- 63 ns: 1.03x faster

Benchmark hidden because not significant (1): x**2

Geometric mean: 1.19x faster

Benchmark script

import pyperf
import numpy as np
#print(np, np.__version__)

np._set_promotion_state('weak')

setup="""
import numpy as np
np._set_promotion_state('weak')

x = np.array([1., 2., 3.])
x1000= np.arange(1000,)
i = 3
f = 1.1

s = np.float64(20.)
phi = np.pi/4
"""

runner = pyperf.Runner()
runner.timeit(name="f + x", stmt="_ = f + x",setup=setup)
runner.timeit(name="np.sin(f)", stmt="_ = np.sin(f)",setup=setup)
runner.timeit(name="np.sin(s)", stmt="_ = np.sin(s)",setup=setup)
runner.timeit(name="i + x", stmt="_ = f + x",setup=setup)
runner.timeit(name="np.sin(2*np.pi*x + phi)", stmt="_ = np.sin(2*np.pi*x + phi)",setup=setup)
runner.timeit(name="s + x", stmt="_ = s + x",setup=setup)
runner.timeit(name="x + x", stmt="_ = s + x",setup=setup)
runner.timeit(name="x**2", stmt="_ = x**2",setup=setup)
runner.timeit(name="f + x1000", stmt="_ = f + x1000",setup=setup)

The performance is not yet at the level of scalar math operations, but it is significantly faster than current main. If we can also make improvements to the general "ufunc machinery" the relative gains of this approach would become even larger.

Some notes:

Compared to main...eendebakpt:scalar_no_alloc_proto this approach is a bit more complicated. But the performance gain is larger because not only the input for ufuncs is covered, but also part of the output
The prototype is not thread safe and the implementation should be improved (in particular: static allocation of some structures). I will probably wait for a generic approach to threading and static allocations that might have to be developed for numpy with multiple interpreters and (possibly) no-gil in the near future.
The entry point for the freelist is PyArray_NewFromDescrAndBase. This covers array allocation (including size 0), but does not cover the allocation in PyArray_Scalar, as this uses a direct call to tp_alloc. For PyArray_Scalar we could perhaps to do something similar, but the overhead there is less to the possible gains are less as well.
With this approach other elements of the computation become the bottleneck. Some profiling shows there computation costs are spread over about 7 different components, each taking a similar computation time. (try_trivial_single_output_loop the only one really bigger)

In this prototype we add elements to the freelist at allocation time and check whether the reference count is 1 in order to use the elements. Another approach is to add elements to the freelist from the destructor of the numpy array. I believe the performance gain will be similar.
The microbenchmarks only tell part of the story as there for almost 100% of the allocations a free array can be used. For real-world applications there could be scenarios where the fraction of successful allocations from the freelist is less.
The performance gain in the prototype is due to two elements: fast paths for float and int in discovery (e.g. skip PyArray_DiscoverDTypeAndShape), and less allocation and deallocation. The allocation itself could perhaps also be made faster. Profiling results on main (section with the call to PyArray_FromAny):
The PyType_GenericAlloc is a direct call to cpython, probably not much to be gained there. For PyDataMem_GetHandler there was a rejected PR ENH: Reduce overhead of configurable data allocation strategy (NEP49) #21531 with a small speedup. PyDatserNEW has 40% finding the right handler via PyCapsule_GetPointer, the remainder is in allocations. The PyArray_DiscoverDTypeAndShape can be specialized for float and int (done in the prototype). PyArray_Pack is quite general, for common types (e.g. float) we could probably specialize. Profiling:

eendebakpt changed the title ~~ENH: Improve performance of numpy operations involving scalars~~ ENH: [RFC] Improve performance of numpy operations involving scalars Aug 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: [RFC] Improve performance of numpy operations involving scalars #24345

ENH: [RFC] Improve performance of numpy operations involving scalars #24345

eendebakpt commented Aug 5, 2023

mattip commented Aug 6, 2023

mhvk commented Aug 6, 2023

seberg commented Aug 6, 2023

mhvk commented Aug 6, 2023

seberg commented Aug 8, 2023 •

edited

mhvk commented Aug 8, 2023

eendebakpt commented Aug 8, 2023

eendebakpt commented Aug 31, 2023

ENH: [RFC] Improve performance of numpy operations involving scalars #24345

ENH: [RFC] Improve performance of numpy operations involving scalars #24345

Comments

eendebakpt commented Aug 5, 2023

Proposed new feature or change:

mattip commented Aug 6, 2023

mhvk commented Aug 6, 2023

seberg commented Aug 6, 2023

mhvk commented Aug 6, 2023

seberg commented Aug 8, 2023 • edited

mhvk commented Aug 8, 2023

eendebakpt commented Aug 8, 2023

eendebakpt commented Aug 31, 2023

seberg commented Aug 8, 2023 •

edited