Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow using aligned memory allocation, or exposing an API for memory management #17467

Closed
mattip opened this issue Oct 6, 2020 · 15 comments · Fixed by #17582
Closed

ENH: allow using aligned memory allocation, or exposing an API for memory management #17467

mattip opened this issue Oct 6, 2020 · 15 comments · Fixed by #17582

Comments

@mattip
Copy link
Member

mattip commented Oct 6, 2020

Submitting this here since it is easier to link to all the issues and PRs dealing with memory management. Will submit it to the mailing list after a first-discussion here if needed.

There are three allocation to instantiate a PyArrayObject: the python allocation of the PyObject, the use of PyArray_malloc/PyArray_realloc/PyArray_free to allocate array dims and strides, and the PyDataMem_NEW/PyDataMem_NEWZEROED/PyDataMem_FREE calls for the array data.

I would like to make the case for an API for overloading internal use of PyDataMem_NEW, PyDataMem_NEWZEROED and PyDataMem_FREE. Work on this was begun in gh-5470, and referred to in issues gh-5312 (asking for an aligned allocator). In addition there is PR gh-5457, to provide a global interface for specifying aligned allocations.

As mentioned in the issue, alignment can be crucial for some applications, but in general is just extra overhead, so it should be configurable by the user/app.

I think I would prefer providing an API hook as in gh-5470. That PR actually had two parts: a configurable PyDataMem* overrides and a hook mechanism. The hook mechanism was merged (does anyone use that?) but the part I am interested in would requre an API change.

A fallback proposal would be to allow specifying a default allocation alignment aka gh-5457, perhaps using PyArray_malloc_aligned and friends as part of NumPy with the random API refactor.

@seberg
Copy link
Member

seberg commented Oct 6, 2020

I am not opposed, fo rme it was mainly something that other people were always more interested in (e.g. Julian). One issue I see glancing at the PR, is that currently we support for a user to provide the memory block (and NumPy taking ownership). And that usage is broken by gh-5470. So I wonder if we would need an allocation-strategy flag on the array, or something similar?

Other than that, the code seems simple enough that it is not an issue, unless there is some negative performance implication. One more question: You want such a higher-aligned memory, but you do not actually care for knowing whether an array provides that alignment? So this doesn't affect the question in gh-17359, right?

@mattip
Copy link
Member Author

mattip commented Oct 8, 2020

So this doesn't affect the question in gh-17359, right?

Right, it is orthogonal.

@mattip
Copy link
Member Author

mattip commented Oct 8, 2020

we support for a user to provide the memory block (and NumPy taking ownership). And that usage is broken by gh-5470

As envisioned in the PR, the functions are part of the PyArrayObject so each ndarray object would carry the information with it. The PR would have to make sure we use the correct functions consistently in all the right places: everywhere PyDataMem_NEW, PyDataMem_NEWZEROED and PyDataMem_FREE (and realloc would have to be carefully checked) are used.

@seberg
Copy link
Member

seberg commented Oct 8, 2020

Ah, I missed that it was stored on the array-object. That seems like a good solution to be able to change things safely (I am a bit worried that the cache must only be used if it is the current default allocator, maybe be part of the allocator struct, but that is a small thing).

So maybe this had also stalled because it required extending the array struct? I think I am in favor of being open to that in any case, it is also useful for cleaning up buffer handling (and who knows what else).

@leofang
Copy link
Contributor

leofang commented Oct 13, 2020

Speaking of memory management, it would be great if NumPy provides an API in either Python (preferred) or C (we can still make it work) for swapping in a memory manager implemented by 3rd parties. This need comes from GPU libraries: We use something called pinned/pagelocked memory to speed up host-device transfer, and in CuPy, for example, there is a custom memory manager to handle this. But it's not very convenient when we want every NumPy array to be backed by our memory pool, and it's better if NumPy offers an API to do so:

# very rough sketch just to illustrate the idea
import numpy as np
import cupy as cp

np.set_allocator(cp.cuda.alloc_pinned_memory)
a = np.empty(100)  # use pinned memory

This would make our life a lot easier (rel: cupy/cupy#3625).

cc possibly interested people @jakirkham @kmaehashi @gmarkall

@jakirkham
Copy link
Contributor

cc @pentschev @hmaarrfk (also likely interested)

@mattip
Copy link
Member Author

mattip commented Oct 19, 2020

@leofang could you take a look at gh-17582? here is the rendered documentation. Do you require overriding memcpy as well?

@jakirkham
Copy link
Contributor

Thanks Matti! 😄

There can be some value to overriding memcpy if we know we are going to/from pinned memory and/or device memory as this could be done asynchronously on a stream. Additionally there are tools like gdrcopy, which could speed up copies further if plugged in. Though I guess how useful this would be would depend on how complex this is to implement and use. Do you have a sense for either of these?

@mattip
Copy link
Member Author

mattip commented Oct 20, 2020

I think memcpy would be tricky to get right. It might be good if we can do this in phases leaving memcpy for a later phase, but that depends on there "only" being performance penalties when using memcpy instead of the more advanced functions and not errors. Is that true?

@leofang
Copy link
Contributor

leofang commented Oct 20, 2020

Thank you @mattip for considering our need! The rendered doc looks great -- It seems the memory hooks would allow us to hack something together so that a numpy.ndarray (not cupy) could also live on a GPU! Not saying it'd be useful, but it's very interesting.

but that depends on there "only" being performance penalties when using memcpy instead of the more advanced functions and not errors. Is that true?

As far as I can tell, this is a valid observation. If it's hard to override memcpy, I suppose we could still add some hacks to do the data transfer. But @jakirkham and @pentschev might have other concerns? I was never able to make gdrcopy work so I couldn't comment.

@jakirkham
Copy link
Contributor

Yeah I think it is possible to handle memcpy alternatives outside of NumPy. So I don't think it is strictly necessary (even less so as it sounds complicated to do).

@pentschev
Copy link
Contributor

I agree with comments from @jakirkham and @leofang , it would be useful to override memcpy but doesn't seem like a blocker and we could work around that outside of NumPy using the hooks.

@mattip
Copy link
Member Author

mattip commented Oct 21, 2020

@pentschev I am not sure what you meant by "using the hooks". How would you detect that a chunk of memory was modified? Just to be clear: I am reasonably sure I correctly identified malloc/realloc/free callsites inside NumPy and replaced them with function calls that use the hooks. However I did not touch the ~250 uses of memcpy inside NumPy. This is a large task, and one I would prefer to defer. My question was if people contemplating using the malloc/realloc/free hooks forsee problems due to glibc's memcpy being used internally by NumPy?

@pentschev
Copy link
Contributor

Sorry @mattip , I think I confused myself in several ways above. When I wrote about hooks, I was thinking of memory allocations, internal memcpy calls have really nothing to do with that. Generally speaking, it seems memcpy overriding could be useful, but I'm not sure in which cases that would be necessary, so I'm ok with deferring that to future work.

@jakirkham
Copy link
Contributor

jakirkham commented Oct 21, 2020

Yeah I don't think we are worried about when NumPy uses memcpy internally. In terms of moving things from device-to-host or host-to-device, we would end up doing this ourselves. The OS (and thus memcpy) should be able to make smart choices about pinned memory under-the-hood.

seberg pushed a commit that referenced this issue Oct 25, 2021
Fixes gh-17467. Adds a public struct to hold memory manipulation routines PyDataMem_Handler and two new API functions PyDataMem_SetHandler to replace the current routines with the new ones, and PyDataMem_GetHandlerName to get the string name of the current routines (either globally or for a specific ndarray object). This also changes the size of the ndarray object to hold the PyDataMem_Handler active when it was created so subsequent actions on its data memory will remain consistent.

Tests and documentation are included. Along the way, I found some places in the code where the current policy is inconsistent (all data memory handling should have gone through npy_*_cache not PyDataMem_*) so even if this is rejected it might improve the cache handling.

The PyDataMem_Handler has fields to override memcpy, these are currently not implemented: memcpy in the code base is untouched. I think this PR is invasive enough as-is, if desired memcpy can be handled in a follow-up PR.


* ENH: add and use global configurable memory routines

* ENH: add tests and a way to compile c-extensions from tests

* fix allocation/free exposed by tests

* DOC: document the new APIs (and some old ones too)

* BUG: return void from FREE, also some cleanup

* MAINT: changes from review

* fixes from linter

* setting ndarray->descr on 0d or scalars mess with FREE

* make scalar allocation more consistent wrt np_alloc_cache

* change formatting for sphinx

* remove memcpy variants

* update to match NEP 49

* ENH: add a python-level get_handler_name

* ENH: add core.multiarray.get_handler_name

* Allow closure-like definition of the data mem routines

* Fix incompatible pointer warnings

* Note PyDataMemAllocator and PyMemAllocatorEx differentiation

Co-authored-by: Matti Picus <matti.picus@gmail.com>

* Redefine default allocator handling

* Always allocate new arrays using the current_handler

* Search for the mem_handler name of the data owner

* Sub-comparisons don't need a local mem_handler

* Make the default_handler a valid PyDataMem_Handler

* Fix PyDataMem_SetHandler description (NEP discussion)

* Pass the allocators by reference

* Implement allocator context-locality

* Fix documentation, make PyDataMem_GetHandler return const

* remove import of setuptools==49.1.3, doesn't work on python3.10

* Fix refcount leaks

* fix function signatures in test

* Return early on PyDataMem_GetHandler error (VOID_compare)

* Add context/thread-locality tests, allow testing custom policies

* ENH: add and use global configurable memory routines

* ENH: add tests and a way to compile c-extensions from tests

* fix allocation/free exposed by tests

* DOC: document the new APIs (and some old ones too)

* BUG: return void from FREE, also some cleanup

* MAINT: changes from review

* fixes from linter

* setting ndarray->descr on 0d or scalars mess with FREE

* make scalar allocation more consistent wrt np_alloc_cache

* change formatting for sphinx

* remove memcpy variants

* update to match NEP 49

* ENH: add a python-level get_handler_name

* ENH: add core.multiarray.get_handler_name

* Allow closure-like definition of the data mem routines

* Fix incompatible pointer warnings

* Note PyDataMemAllocator and PyMemAllocatorEx differentiation

Co-authored-by: Matti Picus <matti.picus@gmail.com>

* Redefine default allocator handling

* Always allocate new arrays using the current_handler

* Search for the mem_handler name of the data owner

* Sub-comparisons don't need a local mem_handler

* Make the default_handler a valid PyDataMem_Handler

* Fix PyDataMem_SetHandler description (NEP discussion)

* Pass the allocators by reference

* remove import of setuptools==49.1.3, doesn't work on python3.10

* fix function signatures in test

* try to fix cygwin extension building

* YAPF mem_policy test

* Less empty lines, more comments (tests)

* Apply suggestions from code review (set an exception and)

Co-authored-by: Matti Picus <matti.picus@gmail.com>

* skip test on cygwin

* update API hash for changed signature

* TST: add gc.collect to make sure cycles are broken

* Implement thread-locality for PyPy

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

* Update numpy/core/tests/test_mem_policy.py

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

* fixes from review

* update circleci config

* fix test

* make the connection between OWNDATA and having a allocator handle more explicit

* improve docstring, fix flake8 for tests

* update PyDataMem_GetHandler() from review

* Implement allocator lifetime management

* update NEP and add best-effort handling of error in PyDataMem_UserFREE

* ENH: fix and test for blindly taking ownership of data

* Update doc/neps/nep-0049.rst

Co-authored-by: Elias Koromilas <elias.koromilas@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants