-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: [RFC] Improve performance of numpy operations involving scalars #24345
Comments
With a no-gil pytohn on the horizon, we should be very careful about multithreading and race conditions when creating caches. |
This is an interesting idea, but like @mattip I'd worry greatly with doing this in a multi-threading environment. I was not part of the mailing list discussion, but there would seem more general options. E.g.,
Note that there is more than 25% to be gained:
|
A free list for (simple!) scalar arrays does seem a bit more clear to me though if it helps a lot. There are three cases:
We could use scalar functions, but we don't really have them for everything and we need to tag them optionally on the ufunc/ArrayMethod to make that reasonable. Otherwise you would need some logic like: this ufunc is actually an operator, if all inputs are scalars call the operator instead. (I doubt this is fun to implement, but yeah maybe possible.) In general ufuncs have a lot of things to do (array-ufunc dispatching, dtype dispatching/resolution, array-wrap, array-prepare and probably some more). Scalars actually have to practically nothing, they just check the other scalar is known (it is here) and otherwise defer to the ufunc. |
The nice thing about improving On the |
Yeah, agreed, although it might be a bit tricky to get the full speed. You could just re-use the math core (after custom promotion, which is OK for most operators it's simple!); I suspect the slowdown would be small enough, but one tricky thing is that we actually have special integer loops to warn about overflows. Which, brings me to "can we do super-fast scalars in ufuncs"? Maybe, although it would add another layer and we may need to use some care to improve the caching lookup if we keep adding these.
With such a caching layer, you actually can a have a true scalar fast-path. The overhead of that should in practice be less than the EDIT: However, there is still the point that Python operators need to defer and the other issues. So not sure how useful the scalar fast path would be as a replacement for the scalar operators. |
I like the caching based on type! Conceptually, it is then similar to the scalars having |
Thanks for the feedback. The discussion on the scalarmath and ufuncs I cannot follow entirely, so I will focus on the idea of caching and freelists for now. Some more thoughts:
So to make the approach really fast, we also need to address the packing and dtype discovery as well. This can be done similar to this PR I guess.
So if we generate a scalar array with certain dtype and add it to the freelist, we can only use it later if either we check the dtype is still good, or we update the dtype. |
@mattip @mhvk @seberg Thanks for the feedback on the first prototype. Here is an update on the tests with more general freelists I performed. In scalar_freelists_clean there is a prototype of the freelist approach to increasing performance of scalar and small array operations. The performance improvement over main is good:
Benchmark script
The performance is not yet at the level of scalar math operations, but it is significantly faster than current main. If we can also make improvements to the general "ufunc machinery" the relative gains of this approach would become even larger. Some notes:
|
Proposed new feature or change:
The performance of numpy operations involving scalars can be improved by adding a simple cache for some of the arrays generated in the
convert_ufunc_arguments
. A prototype implementation is main...eendebakpt:scalar_no_alloc_protoA benchmark of the prototype (
i
an integer,f
a float andx
a small numpy array):Benchmark script
The basic ideas for this approach can be found in a mailing list thread: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/DPATA4IOVBSXC5VNY7IBYIKLXL2ZGI36/#DPATA4IOVBSXC5VNY7IBYIKLXL2ZGI36
Notes:
float
, the scalar argument is converted internally to a python arraywith
PyArray_FromAny
inconvert_ufunc_arguments
. The allocation of a temporary numpy array is expensive, so in the prototype we keep track of the generated array and re-use it if possible.1.1 + x
with valgrind results in (without this PR and the promotion state from NEP50 set toweak
):Without this PR about 25% if the runtime is spend in allocating a deallocating a temporary numpy array.
PyArray_FromAny
could perhaps be improved by using freelists of generated arrays. This is more complex as it is unclear when generated arrays will be available again. So one has to keep a longer list of arrays and check reference counts. In the prototype from this PR there is a single array in the freelist, and it will be free after the ufunc execution is complete.An advantage of improving the
PyArray_FromAny
is that for larger arraysx, y, z
in the expressionnp.sin( (x+y)+z)
the intermediate result forx+y
will be available quickly, so one can avoid allocation and deallocation of larger arrays. The prototype here only deals with scalars.The text was updated successfully, but these errors were encountered: