Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Configurable allocator #17582

Merged
merged 83 commits into from Oct 25, 2021
Merged

ENH: Configurable allocator #17582

merged 83 commits into from Oct 25, 2021

Conversation

mattip
Copy link
Member

@mattip mattip commented Oct 19, 2020

Fixes gh-17467. Adds a public struct to hold memory manipulation routines PyDataMem_Handler and two new API functions PyDataMem_SetHandler to replace the current routines with the new ones, and PyDataMem_GetHandlerName to get the string name of the current routines (either globally or for a specific ndarray object). This also changes the size of the ndarray object to hold the PyDataMem_Handler active when it was created so subsequent actions on its data memory will remain consistent.

Tests and documentation are included. Along the way, I found some places in the code where the current policy is inconsistent (all data memory handling should have gone through npy_*_cache not PyDataMem_*) so even if this is rejected it might improve the cache handling.

The PyDataMem_Handler has fields to override memcpy, these are currently not implemented: memcpy in the code base is untouched. I think this PR is invasive enough as-is, if desired memcpy can be handled in a follow-up PR.

Once the general idea undergoes some review, this should hit the mailing list.

@mattip mattip changed the title Configurable allocator ENH: Configurable allocator Oct 19, 2020
@@ -384,7 +384,8 @@ array_data_set(PyArrayObject *self, PyObject *op)
}
if (PyArray_FLAGS(self) & NPY_ARRAY_OWNDATA) {
PyArray_XDECREF(self);
PyDataMem_FREE(PyArray_DATA(self));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have been npy_free_cache

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this raises a possible risk in these changes - as I understand it, it didn't matter if the npy_*_cache functions were used for the allocation and not the free or vice versa; everything would behave correctly, it would just affect the cache capacity.

With the changes in this PR, either:

  • We make the same promise to user hooks, that things you free may be deallocated by someone else or vice versa
  • We need to audit every free in our code (and likely, accept bug reports for a couple releases for segfaults due to ones we miss)

I suppose as long as our default allocator is happy to accept raw mallocs and frees, this won't actually result in a regression.

numpy/core/src/multiarray/item_selection.c Show resolved Hide resolved
@@ -1991,7 +1991,8 @@ array_setstate(PyArrayObject *self, PyObject *args)
}

if ((PyArray_FLAGS(self) & NPY_ARRAY_OWNDATA)) {
PyDataMem_FREE(PyArray_DATA(self));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory handling in this file ignored npy_*_cache

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, github showed my comment twice, but it was only there once...

After lots of debugging (first with valgrind, but that only showed me that size 1 (or 0) allocations were added on the cache as larger ones... And annoying, slow and careful stepping through with gdb:

At this point the size of the array is already modified, so freeing (and adding to cache) just doesn't work unless you calculate the size earlier (and maybe move the free to an earlier point). This function is a bit weird, maybe it makes sense to NULL data right away? Its likely still buggy, but at least we can't run into a double-free then.

# so this "if" will be false, preventing infinite recursion
#
# if needed, debug this by setting extra_argv=['-vvx']
np.core.test(verbose=0, extra_argv=[])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call will run all the np.core tests with the new memory routines. Since the new routines allocs but then returns a pointer shifted by 64 bytes, if the calls are mismatched glibc will segfault when freeing the wrong address.

numpy/testing/_private/extbuild.py Show resolved Hide resolved
numpy/core/include/numpy/ndarraytypes.h Show resolved Hide resolved

.. code-block:: c

typedef struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to reference the builtin PyMemAllocatorEx somewhere in this section and compare the two (https://docs.python.org/3/c-api/memory.html#customize-memory-allocators).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add. The original npy_*_cache set of functions uses PyDataMem_* functions, which it seems preceeded these more sophisticated interfaces and directly used malloc/calloc/free.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented. The Python ones do not use our non-documented-but-public PyDataMem_EventHookFunc callbacks. So I could see a path where we deprecate PyDataMem_EventHookFunc and move to the Python memory management strategies, although that would mean

  • if someone by chance implemented a PyDataMem_EventHookFunc callback it would no longer work.
  • in order to override data allocations, a user would also override other random memory allocations

@mattip mattip force-pushed the configurable_allocator branch 2 times, most recently from f71fd5b to 811b9e4 Compare October 19, 2020 15:19
@mattip
Copy link
Member Author

mattip commented Oct 19, 2020

rebase and force-psuhed to clear merge conflict

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fairly quick pass, looks nice! And the addition of a test run helper that build test modules from strings is probably overdue :).

Additionally to the inline comments, I think I would like an NPY_USE_DEBUG_ALLOCATOR=1 test run, so that we run the full test suit with an allocator that would blow up if there are incompatibilities/code paths that were forgotten. I guess all that would be needed is allocate always 32 bytes or so extra bytes and store the shifted pointer (probably must remain aligned, or the test suit will go down in flames).

I don't know the needs of those who would like to see such an allocator well, though, so it would be interesting whether we could anticipate the API to grow.

numpy/core/src/multiarray/ctors.c Outdated Show resolved Hide resolved
numpy/core/include/numpy/ndarraytypes.h Show resolved Hide resolved
numpy/core/include/numpy/ndarraytypes.h Outdated Show resolved Hide resolved
numpy/core/include/numpy/ndarraytypes.h Show resolved Hide resolved
numpy/core/include/numpy/ndarraytypes.h Outdated Show resolved Hide resolved
@mattip
Copy link
Member Author

mattip commented Oct 19, 2020

II guess all that would be needed is allocate always 32 bytes or so extra bytes and store the shifted pointer (probably must remain aligned, or the test suit will go down in flames).

Moving the pointer is what the test I wrote does, either it is not documented well enough or I didn't understand the comment.

@seberg
Copy link
Member

seberg commented Oct 19, 2020

Sorry @mattip, I did not read the tests carefully yet, my main point was, that it would be nice to run the full test suit with a custom allocator set, and I don't think you are doing that yet?

@mattip
Copy link
Member Author

mattip commented Oct 19, 2020

run the full test suit

Correct. I am only running the tests under numpy.core, and even that only under the slow decorator. Running all the tests would double the test suite running time. I did run the complete test suite locally (by changing np.core.test() to np.test()) and it passed.

@seberg
Copy link
Member

seberg commented Oct 19, 2020

@mattip I was thinking we could run a single CI job with it? That would test it thoroughly also in the future without really adding much time to the whole process.

@tdimitri
Copy link

tdimitri commented Oct 20, 2020

Thanks Matti. Just downloaded the version and took a look... here are my thoughts.

  1. There should be a PyDataMem_GetHandler that just returns the current handler, as opposed to trying return it in PyDataMem_SetHandler(NULL) which has the side effect of resetting it. This would be the first function I would call anyway as I might map some allocator functions that I do not support yet to the existing function.

  2. I would change the struct PyDataMem_Handler (this allowed? this the first time it is being exposed as public?)

This is what I currently see.

typedef struct {
    char name[128];  /* multiple of 64 to keep the struct aligned */
    PyDataMem_AllocFunc *alloc;
    PyDataMem_ZeroedAllocFunc *zeroed_alloc;
    PyDataMem_FreeFunc *free;
    PyDataMem_ReallocFunc *realloc;
    PyDataMem_CopyFunc *host2obj;  /* copy from the host python */
    PyDataMem_CopyFunc *obj2host;  /* copy to the host python */
    PyDataMem_CopyFunc *obj2obj;  /* copy between two objects */
} PyDataMem_Handler;

I would try to plumb this structure for the future. I would put a struct version high/low. And I would also include FEATURE bits for the memory allocation. I imagine this structure will change in the future as new features are possible.
The current allocator understands huge pages for Linux and can cache small sizes. It does not allocate with byte alignment. So to me, those are three feature bits right there...
Understand Huge/Large Pages
Can Cache (or Can Cache small sizes)
Guarantees Byte Alignment

For small vs large it would be nice to know what the cutoff sizes for both are.. and maybe they are settable in the future.
For guaranteed byte alignment, what byte alignment is guaranteed?
What about thread safe allocator... the current numpy allocator is not thread safe for small sizes due to its caching of small memory (the cache part is not thread safe). Why care if the allocator is thread safe?... because the current numpy extension we are writing adds threads, and sometimes those threads allocate temporary buffers when doing work. Be nice to use a thread safe allocator that caches so that threads could get quick memory.

Another feature might be GPU aware or NUMA node aware.

Then if each allocator chains to the next (see the next pointer below).. that would be a nice feature.

So the struct might end up looking something like this...

typedef struct {
   PyDataMem_Handler* next;  /* next in chain, null terminated */
   int16 versionhi;   // current version of pydatamem_handler
   int16 versionlow;
   int32 structsize;   // size of this struct
   int64 feature_bits;  // 64 feature bits including has large pages, numa aware, thread safe, cancache, etc.
   char name[128];  /* multiple of 64 to keep the struct aligned */
..
   int64   cachesmallsize;  // up to what size is small memory cached
   int64   cachelargesize;  // at what size is large memory allocated
   int64   bytealignment;  // what byte alignment is guaranteed
..
};

  1. Then I don't know if PyDataMem_SetHandler can fail, but seems possible. So is NULL returned on failure?

@mattip
Copy link
Member Author

mattip commented Oct 20, 2020

@tdimitri thanks for the review.

Maybe PyDataMem_SetHandler should not return anything. I am not comfortable with the user overriding only some of the functions with no internal checks. Perhaps another idea would be to force the user to set at least alloc/free/realloc, and if the others are NULL fill them in with defaults.

Yes, this is the first time this struct is being exposed. We use the NumPy-wide C_API_VERSION to specify the version. The idea of versioning public structures has come up before, see this discussion. We should set an across-the-board policy for whether it matters and if so how to deal with it.

I am concerned with "feature creep" when adding bits that describe aspects of the functions. Since the PyDataMem_Handler is a public structure, changing it will requires recompiling the entire scientific python ecosystem. If downstream users think the struct should be more flexible, we should make it private and access it with a different API that only exposes it as a void *handle. Any functionality can be coded into the name. I am curious what you expect users to do with the information on cachesmallsize or bytealignment?

@tdimitri
Copy link

tdimitri commented Oct 20, 2020

I agree that any memory handler has to follow some rules like chain to the previous handler, and fill in any NULL function pointers with the previous value or as you point out alloc/free/realloc.

In both Linux and NT kernels, it is common to chain and use the first member of the struct to chain. There are macros in both kernels to do this. I think this is a good practice that I have used myself.

As per feature creep, the flip side is no or few new features? How often is this really going to change and force a stack recompile?

As per cachesmallsize... I might defer to the existing memory allocator for smallsizes, but I would need to compare sizes to know when to call it.

Information like this (and more) can be returned in an info call. Per byte alignment, let's say the current allocator already does 32 byte alignment, and I have AVX-2526 -- then no need to change. But then we detect AVX-512 capability on another computer, now I need to replace handler with 64 byte alignment.

What happens in the future when a version 1.0 memory handler calls SetHandler that is version 2.0 ? That might fail, which is why I thought it could fail.

On another note, not sure why char name[128]; /* multiple of 64 to keep the struct aligned */ is used instead of
const char* name. If the struct holds permanent function pointers, then it can hold a pointer to a string as well.

For riptable, we can also flush the cache or return statistics on the cache. For instance, we return the hit/miss ratio. We also have a timeout for the garbage collector that can be changed.

In numpy, code can be written to flush the cache, but there is no mechanism for it.

Another possible addition to the Struct is a field reserved (void * ownerdefined) for the Struct owner.

Then if memory handlers are chained, might need a way to remove a "name" from the chain (or move to front).

@seberg
Copy link
Member

seberg commented Oct 20, 2020

Sounds like there is enough reason that we may want to grow the struct/feature list in the future. So it probably does make sense to make the struct opaque and maybe version it. I suppose going with a FromSpec API would work as well, but is maybe more than necessary.
Some things like cache hits/misses, seems like it probably doesn't need to be defined in NumPy, but I wonder if we could extend things for example to expose not just a name, but a Python object?

@leofang
Copy link
Contributor

leofang commented Oct 20, 2020

If downstream users think the struct should be more flexible, we should make it private and access it with a different API that only exposes it as a void *handle.

Another possible addition to the Struct is a field reserved (void * ownerdefined) for the Struct owner.

I think having a void* in the struct would be useful for downstream libraries to hack in additional fields. For example, in the GPU world the memory allocation/deallocation could be stream-ordered, so seems useful to me if I could attach a stream pointer to the malloc/free routines. I haven't put much though into this PR so just throwing out a few bits off top of my head 😅

@tdimitri
Copy link

Any solution that provides 1) PyDataMem_GetHandler , 2) can chain and I can walk the chain, 3) can remove a handler from the chain. and 4) indicates what version of the memory handler (which i suppose too old a version can be rejected) in the future I am ok with and will write code to use. If up to me, I would not make the struct opaque. Thank you.

@mattip mattip added the triage review Issue/PR to be discussed at the next triage meeting label Oct 21, 2020
@mattip
Copy link
Member Author

mattip commented Oct 21, 2020

Could you give a concrete use case for walking the chain of methods? What problem does this solve?

@tdimitri
Copy link

If there are only two possible memory handlers at any point; The new one and the default one... then chaining might not matter. But I believe there can be > 2 as any module can try to Set the memory handler. Especially because arrays can stick around for the life of the process and only the memory handler that allocated can properly delete it.

Per walking the chain -- aside from just general debugging to know who has hooked what... (which i prefer is exposed and not hidden) I could walk the chain, get to the last member of the chain or search for the "default" handler name. Then I would make note of the default alloc handler.

Then when a small memory request came in, I might call the default handler since I know it already has a cache for small memory < 1024 in size.

In general anytime I wanted to "punt" to the default routine I could.
Or, I could see I was already in the chain, but not first. That might help me understand why I am not seeing memory allocations.

Or there could be two useful memory handlers in the chain -- perhaps one that is NUMA aware -- or for Windows it can allocate large pages (numpy does not currently support it). Maybe a module in the future does support it. I might search for and call that memory handler for large objects. windows large page support

From my experience, it is common to be able call the next hook see:CallNextHookEx

Or here is an example of chaining signal handlers in linux chain signal handler -- but what if the previous memory handler is removed? Then what was returned in SetMemoryHandler() (the previous one) is now gone but I still have a pointer to it.

I might just call if (next) next->alloc() since next-> is always up to date if I wanted to punt the alloc.

@mattip
Copy link
Member Author

mattip commented Oct 22, 2020

It seems this is complicated enough that a short NEP is needed.

@mattip
Copy link
Member Author

mattip commented Oct 26, 2020

Since this changes the PyArrayObject_fields struct, it will hit the same problems that gh-16938 hit in this long discussion:it turns out that we cannot modify C_ABI_VERSION without breaking other projects.

@mattip mattip removed the triage review Issue/PR to be discussed at the next triage meeting label Nov 4, 2020
@leofang
Copy link
Contributor

leofang commented Nov 24, 2020

Just curious about the status of this PR?

@mattip
Copy link
Member Author

mattip commented Oct 25, 2021

Can we get to a point where if we got NULL here, we use the normal free for deallocating and give a warning that mentions the base-object workaround/encourages to open a NumPy issue for support if necessary?

@seberg I added a warning in e81eff4.

@jakirkham
Copy link
Contributor

Sorry random question that I've been thinking about. Would it be possible to use this with shared memory?

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matti and Elias, great work keeping up with all the API request changes and introducing the context var!
As far as I am aware the public API is de-facto settled and this has gotten some pretty thorough review. The last real concern I had was the OWNDATA "transfer" which is addressed now.

I think at this point, progress on the PR would give diminishing returns and it seems stable enough to ship.

@numpy/numpy-developers If anyone has API concerns or finds any fixup requests, please do not hesitate to note them, I expect we still have some wiggle-room (but personally would be surprised about any serious change). But the next release is also soon ;).


Notes, to move to a new issue:

  • Release note is missing?
  • I would like a bit "hand holding" for the arr->flags |= NPY_OWNDATA user. I.e. tell users in some place, (maybe the flag documentation?) linked from the warning: Please use use a PyCapsule as the array base (this is simple, but you have to know about PyCapsule). (I could look into that myself.)
  • PyCapsule_GetPointer technically can set an error and may need the GIL? I am not worried enough about this to delay, but the functions look like they are meant to be callable without the GIL (I doubt we use it, and I doubt it really can error/need the GIL, but maybe we can make it a bit nicer anyway).
  • The void functions use (normally) small allocation: I think it would make more sense to just use the Python allocators rather than a (possibly slow) aligned allocator.

doc/neps/nep-0049.rst Outdated Show resolved Hide resolved
@seberg seberg merged commit 84e0707 into numpy:main Oct 25, 2021
@seberg
Copy link
Member

seberg commented Oct 25, 2021

Lets try this 🎉, thanks again!

@seberg
Copy link
Member

seberg commented Oct 25, 2021

Sorry random question that I've been thinking about. Would it be possible to use this with shared memory?

Not sure I understand the question, but maybe I don't know the shared-memory use-case well enough. This allows you to customize allocations. I suppose you could do something:

with allocation_policy_designed_for_sharing():
    arr = np.arange(100000)

share_data(arr)  # Knows how `arr`'s data was allocated and may have customized where/how it was allocated.

But does that help?!

@leofang
Copy link
Contributor

leofang commented Oct 25, 2021

Great work @mattip and everyone! Looking forward to the evolution of this allocator! @eliaskoromilas's numpy-allocator is great in that we have a Python-only playground, but I hope we can mature this support sooner so that the Python allocator API can be made as part of NumPy.

@jakirkham
Copy link
Contributor

But does that help?!

Quite possibly, yes. Am still thinking through how this might work. Though being able to plug this into NumPy would be pretty useful.

@mattip mattip deleted the configurable_allocator branch December 17, 2021 08:44
saimn added a commit to saimn/astropy that referenced this pull request Nov 9, 2022
Fix astropy#12324 : Fix the warnings emitted by Numpy when transferring
ownership by setting ``NPY_ARRAY_OWNDATA``:

    RuntimeWarning: Trying to dealloc data, but a memory policy is not
    set.  If you take ownership of the data, you must set a base owning
    the data (e.g. a PyCapsule).

The warning was added in [1], and is now visible only when setting the
NUMPY_WARN_IF_NO_MEM_POLICY environment variable [2]. For more details
and a fix see [3].

[1] numpy/numpy#17582
[2] numpy/numpy#20200
[3] https://github.com/numpy/numpy/blob/main/doc/source/reference/c-api/data_memory.rst#what-happens-when-deallocating-if-there-is-no-policy-set
saimn added a commit to saimn/astropy that referenced this pull request Nov 15, 2022
Fix astropy#12324 : Fix the warnings emitted by Numpy when transferring
ownership by setting ``NPY_ARRAY_OWNDATA``:

    RuntimeWarning: Trying to dealloc data, but a memory policy is not
    set.  If you take ownership of the data, you must set a base owning
    the data (e.g. a PyCapsule).

The warning was added in [1], and is now visible only when setting the
NUMPY_WARN_IF_NO_MEM_POLICY environment variable [2]. For more details
and a fix see [3].

[1] numpy/numpy#17582
[2] numpy/numpy#20200
[3] https://github.com/numpy/numpy/blob/main/doc/source/reference/c-api/data_memory.rst#what-happens-when-deallocating-if-there-is-no-policy-set
saimn added a commit to saimn/astropy that referenced this pull request Nov 18, 2022
Fix astropy#12324 : Fix the warnings emitted by Numpy when transferring
ownership by setting ``NPY_ARRAY_OWNDATA``:

    RuntimeWarning: Trying to dealloc data, but a memory policy is not
    set.  If you take ownership of the data, you must set a base owning
    the data (e.g. a PyCapsule).

The warning was added in [1], and is now visible only when setting the
NUMPY_WARN_IF_NO_MEM_POLICY environment variable [2]. For more details
and a fix see [3].

[1] numpy/numpy#17582
[2] numpy/numpy#20200
[3] https://github.com/numpy/numpy/blob/main/doc/source/reference/c-api/data_memory.rst#what-happens-when-deallocating-if-there-is-no-policy-set
saimn added a commit to saimn/astropy that referenced this pull request Dec 11, 2022
Fix astropy#12324 : Fix the warnings emitted by Numpy when transferring
ownership by setting ``NPY_ARRAY_OWNDATA``:

    RuntimeWarning: Trying to dealloc data, but a memory policy is not
    set.  If you take ownership of the data, you must set a base owning
    the data (e.g. a PyCapsule).

The warning was added in [1], and is now visible only when setting the
NUMPY_WARN_IF_NO_MEM_POLICY environment variable [2]. For more details
and a fix see [3].

[1] numpy/numpy#17582
[2] numpy/numpy#20200
[3] https://github.com/numpy/numpy/blob/main/doc/source/reference/c-api/data_memory.rst#what-happens-when-deallocating-if-there-is-no-policy-set
saimn added a commit to saimn/astropy that referenced this pull request Dec 11, 2022
Fix astropy#12324 : Fix the warnings emitted by Numpy when transferring
ownership by setting ``NPY_ARRAY_OWNDATA``:

    RuntimeWarning: Trying to dealloc data, but a memory policy is not
    set.  If you take ownership of the data, you must set a base owning
    the data (e.g. a PyCapsule).

The warning was added in [1], and is now visible only when setting the
NUMPY_WARN_IF_NO_MEM_POLICY environment variable [2]. For more details
and a fix see [3].

[1] numpy/numpy#17582
[2] numpy/numpy#20200
[3] https://github.com/numpy/numpy/blob/main/doc/source/reference/c-api/data_memory.rst#what-happens-when-deallocating-if-there-is-no-policy-set
saimn added a commit to saimn/astropy that referenced this pull request Dec 13, 2022
Fix astropy#12324 : Fix the warnings emitted by Numpy when transferring
ownership by setting ``NPY_ARRAY_OWNDATA``:

    RuntimeWarning: Trying to dealloc data, but a memory policy is not
    set.  If you take ownership of the data, you must set a base owning
    the data (e.g. a PyCapsule).

The warning was added in [1], and is now visible only when setting the
NUMPY_WARN_IF_NO_MEM_POLICY environment variable [2]. For more details
and a fix see [3].

[1] numpy/numpy#17582
[2] numpy/numpy#20200
[3] https://github.com/numpy/numpy/blob/main/doc/source/reference/c-api/data_memory.rst#what-happens-when-deallocating-if-there-is-no-policy-set
dougbrn pushed a commit to dougbrn/astropy that referenced this pull request Mar 13, 2023
Fix astropy#12324 : Fix the warnings emitted by Numpy when transferring
ownership by setting ``NPY_ARRAY_OWNDATA``:

    RuntimeWarning: Trying to dealloc data, but a memory policy is not
    set.  If you take ownership of the data, you must set a base owning
    the data (e.g. a PyCapsule).

The warning was added in [1], and is now visible only when setting the
NUMPY_WARN_IF_NO_MEM_POLICY environment variable [2]. For more details
and a fix see [3].

[1] numpy/numpy#17582
[2] numpy/numpy#20200
[3] https://github.com/numpy/numpy/blob/main/doc/source/reference/c-api/data_memory.rst#what-happens-when-deallocating-if-there-is-no-policy-set
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: allow using aligned memory allocation, or exposing an API for memory management
8 participants