Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add locking around initializing the argparse cache #26430

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented May 13, 2024

The argparse cache relies on a struct that gets statically defined in each function using npy_parse_arguments. The struct caches the interned string names of keyword arguments. Up until now, the GIL prevented data races while initializing the cache. Nothing prevents the same python function from being simultaneously called in multiple threads in the free-threaded build, so we need something a little more sophisticated there.

It turns out that the symbols in the up-until-now private pyatomic.h header are now exported by Python.h starting in Python 3.13. I've used some of atomic operations to implement an internal-only single initialization API, npy_call_once and then use that API in the argument parser.

There are some other caches like this, so I've factored things so that I can re-use npy_call_once elsewhere in the codebase. Because the pyatomic symbols are only exposed starting in Python3.13, I've decided to only make use of npy_call_once in the free-threaded build, everything is guarded by preprocessor checks for Py_GIL_DISABLED and this should be a no-op change for the regular build.

I re-implemented this using a single atomic flag and a PyThread_type_lock mutex, per Sam's suggestion.

@ngoldbaum ngoldbaum added the 39 - no-GIL PRs and issues related to support for free-threaded CPython (a.k.a. no-GIL, PEP 703) label May 13, 2024
@ngoldbaum
Copy link
Member Author

I just got feedback that using the private pyatomic API isn't a good idea and I should rewrite this to use C11 atomics directly. I'll also need to see if that causes issues on Windows since MSVC added C11 atomics in visual studio 2022, and several of the windows builds use visual studio 2019.

@charris
Copy link
Member

charris commented May 13, 2024

several of the windows builds use visual studio 2019

If we want to upgrade to 2022, starting the process now would be a good idea. @h-vetinari Are there constraints on doing this?

@colesbury
Copy link

colesbury commented May 13, 2024

If you need additional atomics beyond this, then I'd consider figuring out how to use VS Code 2022. Otherwise, I'd rearchitect this a bit as follows:

In each _NpyArgParserCache you have an atomic_int initialized;. On MSVC, you can #define atomic_int volatile int. Use this for the fast path check.

You also have a single PyThread_type_lock that protects the slow path. You don't need a PyThread_type_lock per _NpyArgParserCache, just a single global one. The lack of concurrency for initializing argument parsing caches doesn't matter. You need to initialize the lock early on with PyThread_allocate_lock before any arguments are parsed.

So in pseudo-code it would look something like:

#ifdef __STDC_NO_ATOMICS__
#define atomic_int volatile int
#else
#include <stdatomic.h>
#endif

static PyThread_type_lock global_cache_lock;

typedef struct {
    atomic_int initialized;
    ...
} _NpyArgParserCache;

// cache initialization
if (!cache->initialized) {
    PyThread_acquire_lock(global_cache_lock, 1);
    if (!cache->initialized) {
        initialize_keywords(cache);
        cache->initialized = 1;
    }
    PyThread_release_lock(global_cache_lock);
}

// module initialization
global_cache_lock = PyThread_allocate_lock();

EDIT: fixed bug in pseudo-code

@h-vetinari
Copy link
Contributor

h-vetinari commented May 13, 2024

If we want to upgrade to 2022, starting the process now would be a good idea. @h-vetinari Are there constraints on doing this?

VS2019 is now EOL, so you'd have every right to move on. I'm guessing MSFT will also start removing it from public CI offerings in the not-too-distant future (if they follow a similar timeline as for VS2019), at which point it becomes a moot question anyway.

What it means is that all people wanting to compile against compiled numpy artefacts (in the past that was mostly libnpymath and libnpyrandom; not sure what the status of #20880 is) on windows will also have to use VS2022, due to the requirement that the consuming toolchain be at least as new as the producing one.

And in this case, it would probably also force conda-forge to move on, as numpy has ~1000 compiled dependencies, and at that point it becomes a pretty unavoidable. I can bring this up in the core call, but my personal opinion is that you should not make suboptimal engineering choices just to support EOL toolchains (obviously a question of degree, but certainly if the win is substantial).

As VS2022 is ABI-compatible with VS2019 and a drop-in replacement, I really don't see much of an issue arising from the toolchain version virality (I didn't hear of any relevant case in the transition from VS2017->VS2019 either, though I asked around for feedback).

CC @rgommers

@rgommers
Copy link
Member

And in this case, it would probably also force conda-forge to move on, as numpy has ~1000 compiled dependencies, and at that point it becomes a pretty unavoidable.

Not necessarily, the only large user of the libnpymath static library that I know of is SciPy. And recently we've gotten rid of libnpyrandom in SciPy, so there are zero known users for that one.

not sure what the status of #20880 is

There is a path and a mostly working prototype to remove libnpymath. As you'd expect, that is fairly complex though and it's been on hold for a couple of months now.

If you need additional atomics beyond this, then I'd consider figuring out how to use VS Code 2022.

If we have significant needs here for supporting the free-threaded build, then I guess we have to bite the bullet.

I've used some of atomic operations to implement an internal-only single initialization API,

@ngoldbaum since we are not using atomics now, please keep in mind that we will need build config changes before this PR goes in. libatomic is shipped as a separate library by GCC instead of as part of libgcc on some platforms like armv7. That should be handled like so:

https://github.com/scipy/scipy/blob/be0d4263c61a36f177f3ec764d20fcfc6dcb7d0d/scipy/meson.build#L439-L476

@ngoldbaum
Copy link
Member Author

I tried using a single global mutex initialized during module setup, but I found that spawned threads saw uninitialized mutexes. Maybe this issue is specific to PyThread_type_lock? It does work if each cache has a mutex though, so I'm going to try pushing that after integrating Ralf's meson suggestion.

@ngoldbaum ngoldbaum changed the title ENH: add single-initialization API and use it for the argparse cache ENH: add locking around initializing the argparse cache May 15, 2024
@ngoldbaum
Copy link
Member Author

It turns out the issue I was seeing was caused by initializing the lock in the _multiarray_umath module but not the _multiarray_tests module, so the test I added using a function in that module was seeing uninitialized data. Now we're using a global lock following Sam's original suggestion.

@h-vetinari
Copy link
Contributor

If we have significant needs here for supporting the free-threaded build, then I guess we have to bite the bullet.

FWIW, we discussed this in conda-forge/core, and I invited people to comment here if they care - AFAICT there was no opposition or concern about numpy using VS2022. We might even be able to stay on vs2019, because under the hood, we're using vs2022 already, but targeting the vc142 toolset. Will depend how that shakes out, but in any case, don't worry about conda-forge w.r.t. this.

@ngoldbaum
Copy link
Member Author

@rgommers I never tried running CI on this PR without the build changes and I know you were interested in taking a look. Should I try to see if it's actually necessary? I know we do test via QEMU on ARM with gcc so at least a few of the emulated tests should fail without the libatomic detection if it is necessary.

@rgommers
Copy link
Member

That may be interesting to do once indeed, yes. Now that I look at this again: the reason for the uint64_t was (IIRC) that libatomic as a separate dependency was specifically necessary when you're locking objects containing 64-bit elements on 32-bit platforms.

@@ -12,6 +12,16 @@

#include "arrayfunction_override.h"

static PyThread_type_lock global_mutex;

int init_argparse_mutex() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be: int init_argparse_mutex(void)?

/* Null terminated list of keyword argument name strings */
PyObject *kw_strings[_NPY_MAX_KWARGS+1];
} _NpyArgParserCache;

NPY_NO_EXPORT int init_argparse_mutex();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init_argparse_mutex(void);

@@ -214,6 +214,43 @@ else
lapack_dep = declare_dependency(dependencies: [lapack, blas_dep])
endif

# Determine whether it is necessary to link libatomic. This could be the case
# e.g. on 32-bit platforms when atomic operations are used on 64-bit types.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RISC-V with GCC is weird in that it supports 64-bit atomic ops without -latomic, but not 8-bit or 16-bit ops.

So in CPython we had to adjust the test to check for both 64-bit and 8-bit atomic operations:
https://github.com/python/cpython/blob/406ffb5293a8c9ca315bf63de1ee36a9b33f9aaf/configure.ac#L7417-L7422

If you are only using atomic_int then it might not matter.

@@ -4,6 +4,13 @@
#include <Python.h>
#include "numpy/ndarraytypes.h"

#ifdef __STDC_NO_ATOMICS__
#define atomic_int volatile int

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat: this is "good enough" for x86 MSVC and also Window's "ARM64EC" platform target, but not other ARM64 targets on Windows.

If we need to support other Windows ARM64 targets, we'll need a bit more work. Hopefully, MSVC will support <stdatomic.h> soon, but I wouldn't hold my breath for it.

https://learn.microsoft.com/en-us/cpp/build/reference/volatile-volatile-keyword-interpretation

Copy link
Member Author

@ngoldbaum ngoldbaum May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying! We don't test or build wheels for windows ARM64 right now, but I think we might need to support _M_ARM64 (see e.g. #22530). According to the MSVC docs, ARM64EC sets _M_X64.

I see pyatomic_msc.h has:

static inline uint64_t
_Py_atomic_load_uint64(const uint64_t *obj)
{
#if defined(_M_X64) || defined(_M_IX86)
    return *(volatile uint64_t *)obj;
#elif defined(_M_ARM64)
    return (uint64_t)__ldar64((unsigned __int64 volatile *)obj);
#else
#  error "no implementation of _Py_atomic_load_uint64"
#endif
}

And storing is implemented using e.g. _InterlockedCompareExchange64 from intrin.h.

That all seems reasonable to include in NumPy to support this.

Even with the experimental atomics support in MSVC 2022, they still set _ STDC_NO_ATOMICS _, so we'd need to go out of our way to detect and use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 39 - no-GIL PRs and issues related to support for free-threaded CPython (a.k.a. no-GIL, PEP 703)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants