New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Reduce overhead of configurable data allocation strategy (NEP49) #21488
Comments
|
Link to the mailing list discussion: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/YZ3PNTXZUT27B6ITFAD3WRSM3T3SRVK4/ |
My 2¢, here.
|
@seberg @mattip Thanks for the extensive responses. I must admit I still do not fully understand what the contexts are used for, but from your responses I gather they are needed and we would rather not change the public interface. I created two branches that improve the overhead as long as the default allocator is used. If the user sets a new allocator, the overhead penalty is still there.
Benchmark on main:
v1:
v2:
Performance improvements seem larger than on my other system (linux, python 3.8). But on both systems the improvement is measurable and applies to all ufuncs. In principle we could also combine Is either |
Going to close this issue for now, seems we have settled on not worrying about this for now. If anyone ever comes back here even though its closed, maybe that will be a reason to reconsider ;). |
Proposed new feature or change:
In NEP49 a configurable allocator has been introduced in numpy (implemented in #17582). This mechanism introduces some overhead for operations on small arrays and scalars. A benchmark with
np.sqrt
shows that the overhead can be in the 5-10% range.Benchmark details
We compare fast_handler_test_compare (numpy main with two performance related PRs included)
with fast_handler_test (the same, but with hard-coded allocator)
Benchmark
Results of
fast_handler_test_compare
Results of
fast_handler_test_compare
(allocator overhead removed)The allocator is retrieved for every numpy array or scalar constructed, which matters for small arrays and scalars. The overhead is in two ways:
PyDataMem_UserNEW
the allocator is retrieved via aPyCapsule
which performs some run-time checksPyDataMem_GetHandler
there is a call toPyContextVar_Get
which is expensive.The first item can be addressed by replacing the attribute
PyObject *mem_handler
in PyArrayObject_fields (which is currently aPyCapsule
) by aPyDataMem_Handler*
. (unless this is exposed to the public API)About the second item: the
PyContextVar_Get
calls_PyThreadState_GET
internally. So perhaps the allocator can depend on the thread? Maybe we can introduce a mechanism that skips this part if there is only a single allocator (e.g. whenPyDataMem_SetHandler
has never been called).@mattip As the author of NEP49, can you comment on this?
The text was updated successfully, but these errors were encountered: