New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the layout of PyArray_Descr wrt. structured dtypes? #19735
Comments
It's perhaps worth remembering that tuple unpacking is essentially free, and the memory layout is just the same as the struct you propose apart from the extra PyObject header. The point about dicts is salient though. |
Fair point about tuple unpacking, but I guess PyLong_AsSsize_t (converting the offset from a python int to a C ssize_t) is not (well, it's probably not that expensive, but it's still a call into libpython; at least it does show up in my profiling...). See https://github.com/python/cpython/blob/ae5259171b8ef62165e061b9dea7ad645a5131a2/Objects/longobject.c#L484 |
Yeah, this would be a breaking ABI change, so I doubt it will be possible – we would rely on the few users to be fine with us breaking them. I may consider that we can assume that nobody mutates the dict (because that would be awful), so you might get away with providing a more efficient access pattern. And of course we can probably (and arguably would ideally) be able to deprecate it slowly and replace it with structured-dtype specific API. I am not motivated to try any of this mid-term, though. If you want to look into the dictionary point, it might be Even if you need to "unpack by name" you could do what argument parsing used to do (before FASTCALL and if you have "few¨ names): Iterate the dict and match against all names directly in a for loop. That should be a lot faster for fairly few fields at least. |
I had a quick look at the dict iteration option. Unfortunately, it fails in presence of dtype field titles, because those also appear in the ->fields dict :-( |
Hmm, true. Although, you may be able to ignore them |
I think you cut off mid-sentence. |
Sorry, just a verb too much ;), but let me explain (without double checking). If titles are used, I believe the field always is length 3 and includes the actual title, I think? So if |
No, the triple appears when keying off either the name or the title:
You could compare whether the keys equals the metadata, but you could also have a pathological case where the field name and the metadata are equal... |
I don't understand, that does not sound right to me, the C-code e.g. has this (somewhat weird) code:
we obviously assume that this will practically always work, so the last entry (if it exists) is always the title. So if |
TBH I haven't tried, but IIUC you propose doing something like (at the C-level):
? Or do you propose some other loop? |
Yes, that is what I propose.
True, but IIRC that is invalid. |
Sorry for insisting here: You are saying that a field having the same name and title is invalid? Is this limitation documented somewhere? |
No idea if it is documented somewhere, but:
or
|
Ah, great, thanks. |
OK, here's a quick patch that implements the "iterate over fields" approach (I can post that as a draft PR if you prefer), and passes all the tests: diff --git c/numpy/core/src/multiarray/arraytypes.c.src w/numpy/core/src/multiarray/arraytypes.c.src
index b3ea7544d..29f27ddfe 100644
--- c/numpy/core/src/multiarray/arraytypes.c.src
+++ w/numpy/core/src/multiarray/arraytypes.c.src
@@ -790,34 +790,51 @@ VOID_getitem(void *input, void *vap)
NPY_NO_EXPORT int PyArray_CopyObject(PyArrayObject *, PyObject *);
-/* Given a structured PyArrayObject arr, index i and structured datatype descr,
- * modify the dtype of arr to contain a single field corresponding to the ith
- * field of descr, recompute the alignment flag, and return the offset of the
- * field (in offset_p). This is useful in preparation for calling copyswap on
- * individual fields of a numpy structure, in VOID_setitem. Compare to inner
- * loops in VOID_getitem and VOID_nonzero.
+/* Given a structured PyArrayObject arr, dict internal pointer ppos, and
+ * structured datatype descr, modify the dtype of arr to contain a single
+ * field corresponding to the ppos field of descr (as defined by PyDict_Next),
+ * recompute the alignment flag, and return the offset of the field (in
+ * offset_p). Entries in descr->fields corresponding to field titles are
+ * skipped.
+ *
+ * This is useful in preparation for calling copyswap on individual fields of
+ * a numpy structure, in VOID_setitem. Compare to inner loops in VOID_getitem
+ * and VOID_nonzero.
+ *
+ * Returns whether a field was correctly loaded. This function does not
+ * distinguish between end-of-iteration and field parsing errors, so the number
+ * of iterations should be controlled by an external for loop.
*
* WARNING: Clobbers arr's dtype and alignment flag, should not be used
* on the original array!
*/
NPY_NO_EXPORT int
-_setup_field(int i, PyArray_Descr *descr, PyArrayObject *arr,
- npy_intp *offset_p, char *dstdata)
+_setup_next_field(Py_ssize_t *ppos, PyArray_Descr *descr, PyArrayObject *arr,
+ npy_intp *offset_p, char *dstdata)
{
- PyObject *key;
- PyObject *tup;
- PyArray_Descr *new;
+ PyObject *name, *tup;
+ PyArray_Descr *newdescr;
npy_intp offset;
- key = PyTuple_GET_ITEM(descr->names, i);
- tup = PyDict_GetItem(descr->fields, key);
- if (_unpack_field(tup, &new, &offset) < 0) {
- return -1;
+ while (1) {
+ if (!PyDict_Next(descr->fields, ppos, &name, &tup)) {
+ return 0;
+ }
+ if (PyTuple_GET_SIZE(tup) < 2) {
+ return 0;
+ }
+ if ((PyTuple_GET_SIZE(tup) == 3)
+ && PyObject_RichCompareBool(name, PyTuple_GET_ITEM(tup, 2), Py_EQ)) {
+ continue; // This is a title entry, skip it.
+ }
+ break;
}
+ newdescr = (PyArray_Descr *)PyTuple_GET_ITEM(tup, 0);
+ offset = PyLong_AsSsize_t(PyTuple_GET_ITEM(tup, 1));
- ((PyArrayObject_fields *)(arr))->descr = new;
- if ((new->alignment > 1) &&
- ((((uintptr_t)dstdata + offset) % new->alignment) != 0)) {
+ ((PyArrayObject_fields *)(arr))->descr = newdescr;
+ if ((newdescr->alignment > 1) &&
+ ((((uintptr_t)dstdata + offset) % newdescr->alignment) != 0)) {
PyArray_CLEARFLAGS(arr, NPY_ARRAY_ALIGNED);
}
else {
@@ -825,7 +842,7 @@ _setup_field(int i, PyArray_Descr *descr, PyArrayObject *arr,
}
*offset_p = offset;
- return 0;
+ return 1;
}
/* Helper function for VOID_setitem, which uses the copyswap or casting code to
@@ -838,14 +855,14 @@ _copy_and_return_void_setitem(PyArray_Descr *dstdescr, char *dstdata,
PyArrayObject *dummy_arr = (PyArrayObject *)&dummy_struct;
npy_int names_size = PyTuple_GET_SIZE(dstdescr->names);
npy_intp offset;
- npy_int i;
+ int i;
+ Py_ssize_t pos = 0;
int ret;
/* Fast path if dtypes are equal */
if (PyArray_EquivTypes(srcdescr, dstdescr)) {
for (i = 0; i < names_size; i++) {
- /* neither line can ever fail, in principle */
- if (_setup_field(i, dstdescr, dummy_arr, &offset, dstdata)) {
+ if (!_setup_next_field(&pos, dstdescr, dummy_arr, &offset, dstdata)) {
return -1;
}
PyArray_DESCR(dummy_arr)->f->copyswap(dstdata + offset,
@@ -910,11 +927,14 @@ VOID_setitem(PyObject *op, void *input, void *vap)
PyArrayObject_fields dummy_fields = get_dummy_stack_array(ap);
PyArrayObject *dummy_arr = (PyArrayObject *)&dummy_fields;
+ Py_ssize_t pos = 0;
for (i = 0; i < names_size; i++) {
PyObject *item;
- if (_setup_field(i, descr, dummy_arr, &offset, ip) == -1) {
+ if (!_setup_next_field(&pos, descr, dummy_arr, &offset, ip)) {
+ /* The for loop controls the number of iterations, so this
+ * should never signal end-of-iteration. */
failed = 1;
break;
}
@@ -936,10 +956,12 @@ VOID_setitem(PyObject *op, void *input, void *vap)
PyArrayObject_fields dummy_fields = get_dummy_stack_array(ap);
PyArrayObject *dummy_arr = (PyArrayObject *)&dummy_fields;
+ Py_ssize_t pos = 0;
for (i = 0; i < names_size; i++) {
/* temporarily make ap have only this field */
- if (_setup_field(i, descr, dummy_arr, &offset, ip) == -1) {
+ if (!_setup_next_field(&pos, descr, dummy_arr, &offset, ip)) {
+ /* see comment above re: _setup_next_field */
failed = 1;
break;
} Strangely enough, the benchmarks are all over the place (though with more slowdowns than speedups...):
Do you have any idea why so many benchmarks which (apparently) do not deal with structured dtypes are affected that much? |
Nope, still waiting for someone to tell me :). My best guess is that we should be using profile-guided optimizations to avoid random changes in compiler optimization. Especially |
OK, feel free to close this if you don't think it's going to go anywhere, although it may still be worth revisiting this in the long run (fwiw I think structured dtypes are great (I tend to stay away from pandas) and would be happy to shave as much overhead as possible from them, even independently of the loadtxt() use case).
Sounds like a good idea, but also sounds like a fair bit of engineering work on the build side :/ |
Well, the issue probably has a point, although, I guess if this idea was useful the The one other thing I could imagine is caching the preparation work and storing it on the descriptor (since we can add information so long we still keep the dictionary around). Which might work for the casting machinery (and also
Not sure it should be hard, but there is a problem in gcc 9-11, so you may have to use a gcc 8 or a dev-branch (gcc 10 and 11 are fixed, but very recently, so I am not sure there is a release with the fix yet). I will probably bother trying it as soon as debian gives me a default gcc with the fix/workaround. |
Yes, but that would probably have been just a couple of percents at most, but looks like the compiler just moved stuff at random and shifted performance everywhere by much more :/ So I still have no idea whether this can really be made to matter for loadtxt or not.
I don't have the profiles under my hand right now but I absolutely saw _setup_field show up quite a bit in it, and IIRC numpy/numpy/core/src/multiarray/arraytypes.c.src Lines 897 to 932 in c0d8696
is basically the loop we're trying to optimize for the new loadtxt implementation.
Yes, caching the alignment info numpy/numpy/core/src/multiarray/arraytypes.c.src Lines 819 to 820 in c0d8696
(modulo is "slow" :/) would be another thing to do.
Not to give you more work, but I played a bit with cpython recently and |
Hmmm, the modulo thing may be silly there. Except for |
OK, I tried the following patch, which passes all tests, and only checks the flag if we're going to use diff --git c/numpy/core/src/multiarray/arraytypes.c.src w/numpy/core/src/multiarray/arraytypes.c.src
index b3ea7544d..681dd9ce7 100644
--- c/numpy/core/src/multiarray/arraytypes.c.src
+++ w/numpy/core/src/multiarray/arraytypes.c.src
@@ -792,7 +792,7 @@ NPY_NO_EXPORT int PyArray_CopyObject(PyArrayObject *, PyObject *);
/* Given a structured PyArrayObject arr, index i and structured datatype descr,
* modify the dtype of arr to contain a single field corresponding to the ith
- * field of descr, recompute the alignment flag, and return the offset of the
+ * field of descr, clear the alignment flag, and return the offset of the
* field (in offset_p). This is useful in preparation for calling copyswap on
* individual fields of a numpy structure, in VOID_setitem. Compare to inner
* loops in VOID_getitem and VOID_nonzero.
@@ -816,14 +816,7 @@ _setup_field(int i, PyArray_Descr *descr, PyArrayObject *arr,
}
((PyArrayObject_fields *)(arr))->descr = new;
- if ((new->alignment > 1) &&
- ((((uintptr_t)dstdata + offset) % new->alignment) != 0)) {
- PyArray_CLEARFLAGS(arr, NPY_ARRAY_ALIGNED);
- }
- else {
- PyArray_ENABLEFLAGS(arr, NPY_ARRAY_ALIGNED);
- }
-
+ PyArray_CLEARFLAGS(arr, NPY_ARRAY_ALIGNED);
*offset_p = offset;
return 0;
}
@@ -836,6 +829,7 @@ _copy_and_return_void_setitem(PyArray_Descr *dstdescr, char *dstdata,
PyArray_Descr *srcdescr, char *srcdata){
PyArrayObject_fields dummy_struct;
PyArrayObject *dummy_arr = (PyArrayObject *)&dummy_struct;
+ PyArray_Descr *new;
npy_int names_size = PyTuple_GET_SIZE(dstdescr->names);
npy_intp offset;
npy_int i;
@@ -848,8 +842,12 @@ _copy_and_return_void_setitem(PyArray_Descr *dstdescr, char *dstdata,
if (_setup_field(i, dstdescr, dummy_arr, &offset, dstdata)) {
return -1;
}
- PyArray_DESCR(dummy_arr)->f->copyswap(dstdata + offset,
- srcdata + offset, 0, dummy_arr);
+ new = PyArray_DESCR(dummy_arr);
+ if ((new->alignment == 1)
+ || ((((uintptr_t)dstdata + offset) % new->alignment) == 0)) {
+ PyArray_ENABLEFLAGS(dummy_arr, NPY_ARRAY_ALIGNED);
+ }
+ new->f->copyswap(dstdata + offset, srcdata + offset, 0, dummy_arr);
}
return 0;
} (to be clear, linux
to
which mean that likely different compiler optimizations again made a mess of the comparison. |
Going to close this. Changing the ABI is not on the table, and I think this needs a clear proposal that keeps ABI rather than an open issue. Plus, the original motivation should be mostly gone with the loadtxt speedups. Basically, there may be some nice things doable here, but I don't think there is a use in having a TODO item about it open. |
Feature
I have recently posted a series of PRs that speed up many times np.loadtxt(); starting from #19687 some of the speedup comes from directly assigning tuples to a structured array (see that last PR for a long description), which implicitly assigns each tuple element to the corresponding field in the structured array. It turns out that when profiling the code after #19734, this assignment process starts now to represent a significant part of the runtime. Specifically, around 10% of the runtime is spent in
numpy/numpy/core/src/multiarray/arraytypes.c.src
Line 804 in 158159d
numpy/numpy/core/src/multiarray/arraytypes.c.src
Line 813 in 158159d
numpy/numpy/core/src/multiarray/common.c
Lines 334 to 345 in 158159d
For reference the structured dtype information in PyArray_Descr is currently stored in the fields
(copy-pasted from the docs).
Because
fields
is a Python dict storing Python tuples, we have to pay for Python dict lookups, and, even worse, for unpacking Python tuples of ints (in _unpack_field). It would thus seems perhaps better (for the performance of struct arrays in general, and of loadtxt in particular) to e.g. switch this info to something like, for example (I don't claim this is the best possibility)which would make iteration over fields much faster on the C side (avoiding a dict lookup if accessing by field index, and in all cases avoiding having to unpack the (descr, offset) tuple). Keeping
fields
would simply avoid makingdtype.fields
slower (the alternative being to dynamically regenerate the mappingproxy on attribute access), but is otherwise unneeded on the C side.Does that sound like a reasonable proposal? Are there any ABI guarantees on the layout of PyArray_Descr, which would prevent something along these lines?
The text was updated successfully, but these errors were encountered: