Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Reduce overhead of configurable data allocation strategy (NEP49) #21488

Closed
eendebakpt opened this issue May 10, 2022 · 5 comments
Closed

Comments

@eendebakpt
Copy link
Contributor

Proposed new feature or change:

In NEP49 a configurable allocator has been introduced in numpy (implemented in #17582). This mechanism introduces some overhead for operations on small arrays and scalars. A benchmark with np.sqrt shows that the overhead can be in the 5-10% range.

Benchmark details

We compare fast_handler_test_compare (numpy main with two performance related PRs included)
with fast_handler_test (the same, but with hard-coded allocator)

Benchmark

import numpy as np
import math
import time
from numpy import sqrt
print(np.__version__)

w=np.float64(1.1)
wf=1.1
array=np.random.rand(2)

niter=1_200_000

for kk in range(3):
    t0=time.perf_counter()
    for ii in range(niter):
        _=sqrt(w)
    dt=time.perf_counter()-t0
    t0=time.perf_counter()
    for ii in range(niter):
        _=sqrt(wf)
    dt2=time.perf_counter()-t0
    t0=time.perf_counter()
    for ii in range(niter):
        _=sqrt(array)
    dt3=time.perf_counter()-t0
    print(f'loop {kk}: {dt} {dt2} {dt3}')

Results of fast_handler_test_compare

1.23.0.dev0+1185.gf16125e86
loop 0: 0.7580233269982273 0.7543466200004332 0.5045701469971391
loop 1: 0.7591422369987413 0.7547550320014125 0.5020621660005418
loop 2: 0.7476994270000432 0.7537849910004297 0.5018936799970106

Results of fast_handler_test_compare (allocator overhead removed)

1.23.0.dev0+1186.gbb76538a1
loop 0: 0.6839246829986223 0.6962255100006587 0.4676538419989811
loop 1: 0.6820040509992396 0.6967140100023244 0.468011064996972
loop 2: 0.6811004699993646 0.6971791299984034 0.4678809920005733

The allocator is retrieved for every numpy array or scalar constructed, which matters for small arrays and scalars. The overhead is in two ways:

  1. In methods like PyDataMem_UserNEW the allocator is retrieved via a PyCapsule which performs some run-time checks
  2. In PyDataMem_GetHandler there is a call to PyContextVar_Get which is expensive.

The first item can be addressed by replacing the attribute PyObject *mem_handler in PyArrayObject_fields (which is currently a PyCapsule) by a PyDataMem_Handler*. (unless this is exposed to the public API)

About the second item: the PyContextVar_Get calls _PyThreadState_GET internally. So perhaps the allocator can depend on the thread? Maybe we can introduce a mechanism that skips this part if there is only a single allocator (e.g. when PyDataMem_SetHandler has never been called).

@mattip As the author of NEP49, can you comment on this?

@mattip
Copy link
Member

mattip commented May 11, 2022

  1. The capsule is exposed via PyDataMem_GetHandler and PyDataMem_SetHandler. We could contrive a way to reduce the overhead, at the expense of making the code more complicated.

  2. The need for PyContextVar_Get was discussed on the mailing list and summarized in this comment to the PR.

@eendebakpt
Copy link
Contributor Author

@seberg
Copy link
Member

seberg commented May 11, 2022

My 2¢, here.

  1. About getting the struct:

    • IIRC the ABI is somewhat fixed on the array, so the array must store a capsule. But, I might be OK with introduicng new API and tell people direct access is forbidden.
    • There was a reason for using a capsule, it is an object, and an object has reference counting. There was a thought that this might be helpful for cleanup. In principle it wouldn't have to be a capsule, but we have to consider ABI if changing, now.
    • The slow thing seems to be comparing the name mem_handler in the capsule? Maybe Python could add an == check on the pointer, that would remove most of that ;).
    • We could in principle add a fast-path for handler == numpy_capsule, if it makes a reasonably sized dent on some benchmarks.
  2. I do think PyContextVar_Get is absolutely necessary. It should be possible to say that changes to the handler must go through our API, so you probably could detect if a handler definitely was never used. Not sure how worthwhile that is. OTOH, probably effectively nobody uses a handlers right now, so it may well be.
    There might be better solutions, but I can't really think of them, maybe Python contexts could grow a "version" to quickly check that definitely nothing changed. But I doubt it they have that right now, and am not sure how that would work?

@eendebakpt
Copy link
Contributor Author

@seberg @mattip Thanks for the extensive responses. I must admit I still do not fully understand what the contexts are used for, but from your responses I gather they are needed and we would rather not change the public interface.

I created two branches that improve the overhead as long as the default allocator is used. If the user sets a new allocator, the overhead penalty is still there.

  1. main...eendebakpt:default_allocator_path_v1 This is basically the suggestion from Sebastian to create a fast path for the default case.
  2. main...eendebakpt:default_allocator_path_v2 A more aggressive optimization that also skips checking the context if possible.

Benchmark on np.sqrt (not very accurate, this is on a windows laptop):

main:

1.43758969999908
1.4361847000109265
1.307078099998762

v1:

1.2066571999894222
1.1520389999932377
1.2282870999915758

v2:

1.1243611999962013
1.082299100002274
1.0827838999975938
import time
import numpy as np
print(np)

niter=2_000_000
x=1.1

for kk in range(3):
    t0=time.perf_counter()
    for ii in range(niter):
        np.sqrt(x)
    dt=time.perf_counter()-t0
    print(dt)

Performance improvements seem larger than on my other system (linux, python 3.8). But on both systems the improvement is measurable and applies to all ufuncs.

In principle we could also combine v1 and v2. It would give the performance of v2, but less reduction of performance if a user sets a new allocator in some (but not all) contexts. This seems a rare case to me, so I would prefer to continue with the approach of either v1 or v2.

Is either v1 or v2 a good approach to address the overhead? If so, I will make a PR with cleanups and updated documentation.

@seberg
Copy link
Member

seberg commented Jun 11, 2022

Going to close this issue for now, seems we have settled on not worrying about this for now. If anyone ever comes back here even though its closed, maybe that will be a reason to reconsider ;).

@seberg seberg closed this as completed Jun 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants