Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the layout of PyArray_Descr wrt. structured dtypes? #19735

Closed
anntzer opened this issue Aug 23, 2021 · 22 comments
Closed

Change the layout of PyArray_Descr wrt. structured dtypes? #19735

anntzer opened this issue Aug 23, 2021 · 22 comments

Comments

@anntzer
Copy link
Contributor

anntzer commented Aug 23, 2021

Feature

I have recently posted a series of PRs that speed up many times np.loadtxt(); starting from #19687 some of the speedup comes from directly assigning tuples to a structured array (see that last PR for a long description), which implicitly assigns each tuple element to the corresponding field in the structured array. It turns out that when profiling the code after #19734, this assignment process starts now to represent a significant part of the runtime. Specifically, around 10% of the runtime is spent in

_setup_field(int i, PyArray_Descr *descr, PyArrayObject *arr,
including around 2.5% in each of
tup = PyDict_GetItem(descr->fields, key);
and
/**
* unpack tuple of dtype->fields (descr, offset, title[not-needed])
*
* @param "value" should be the tuple.
*
* @return "descr" will be set to the field's dtype
* @return "offset" will be set to the field's offset
*
* returns -1 on failure, 0 on success.
*/
NPY_NO_EXPORT int
_unpack_field(PyObject *value, PyArray_Descr **descr, npy_intp *offset)

For reference the structured dtype information in PyArray_Descr is currently stored in the fields

PyObject *fields
If this is non-NULL, then this data-type-descriptor has fields described by a Python dictionary whose keys are names (and also titles if given) and whose values are tuples that describe the fields. Recall that a data-type-descriptor always describes a fixed-length set of bytes. A field is a named sub-region of that total, fixed-length collection. A field is described by a tuple composed of another data- type-descriptor and a byte offset. Optionally, the tuple may contain a title which is normally a Python string. These tuples are placed in this dictionary keyed by name (and also title if given).

PyObject *names
An ordered tuple of field names. It is NULL if no field is defined.

(copy-pasted from the docs).

Because fields is a Python dict storing Python tuples, we have to pay for Python dict lookups, and, even worse, for unpacking Python tuples of ints (in _unpack_field). It would thus seems perhaps better (for the performance of struct arrays in general, and of loadtxt in particular) to e.g. switch this info to something like, for example (I don't claim this is the best possibility)

PyObject *fields;  // same as previously.
struct { PyObject *name; PyArray_Descr subfield_fmt; npy_intp offset } *native_fields;  // array of native_fields
PyObject *name_to_index_dict;

which would make iteration over fields much faster on the C side (avoiding a dict lookup if accessing by field index, and in all cases avoiding having to unpack the (descr, offset) tuple). Keeping fields would simply avoid making dtype.fields slower (the alternative being to dynamically regenerate the mappingproxy on attribute access), but is otherwise unneeded on the C side.

Does that sound like a reasonable proposal? Are there any ABI guarantees on the layout of PyArray_Descr, which would prevent something along these lines?

@eric-wieser
Copy link
Member

It's perhaps worth remembering that tuple unpacking is essentially free, and the memory layout is just the same as the struct you propose apart from the extra PyObject header. The point about dicts is salient though.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 23, 2021

Fair point about tuple unpacking, but I guess PyLong_AsSsize_t (converting the offset from a python int to a C ssize_t) is not (well, it's probably not that expensive, but it's still a call into libpython; at least it does show up in my profiling...). See https://github.com/python/cpython/blob/ae5259171b8ef62165e061b9dea7ad645a5131a2/Objects/longobject.c#L484

@seberg
Copy link
Member

seberg commented Aug 23, 2021

Does that sound like a reasonable proposal? Are there any ABI guarantees on the layout of PyArray_Descr, which would prevent something along these lines?

Yeah, this would be a breaking ABI change, so I doubt it will be possible – we would rely on the few users to be fine with us breaking them. I may consider that we can assume that nobody mutates the dict (because that would be awful), so you might get away with providing a more efficient access pattern. And of course we can probably (and arguably would ideally) be able to deprecate it slowly and replace it with structured-dtype specific API.

I am not motivated to try any of this mid-term, though. If you want to look into the dictionary point, it might be just as well faster to iterate the dict rather than unpack it by name.

Even if you need to "unpack by name" you could do what argument parsing used to do (before FASTCALL and if you have "few¨ names): Iterate the dict and match against all names directly in a for loop. That should be a lot faster for fairly few fields at least.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 23, 2021

I had a quick look at the dict iteration option. Unfortunately, it fails in presence of dtype field titles, because those also appear in the ->fields dict :-(

@seberg
Copy link
Member

seberg commented Aug 23, 2021

Hmm, true. Although, you may be able to ignore them using by checking if the result and key.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 23, 2021

I think you cut off mid-sentence.

@seberg
Copy link
Member

seberg commented Aug 23, 2021

Sorry, just a verb too much ;), but let me explain (without double checking). If titles are used, I believe the field always is length 3 and includes the actual title, I think? So if key is field[2] the entry belongs to a title rather than a name. But, it is been a bit since I fought with this, so I may well be wrong.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 23, 2021

No, the triple appears when keying off either the name or the title:

$ python -c 'import numpy as np; print(np.dtype([(("name", "meta"), int)]).fields)'
{'meta': (dtype('int64'), 0, 'name'), 'name': (dtype('int64'), 0, 'name')}

You could compare whether the keys equals the metadata, but you could also have a pathological case where the field name and the metadata are equal...

@seberg
Copy link
Member

seberg commented Aug 23, 2021

but you could also have a pathological case where the field name and the metadata are equal...

I don't understand, that does not sound right to me, the C-code e.g. has this (somewhat weird) code:

if (!PyArg_ParseTuple(value, "Oi|O", &new, &offset, &title)) {

we obviously assume that this will practically always work, so the last entry (if it exists) is always the title. So if key == title will skip "by title" entries.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 23, 2021

TBH I haven't tried, but IIUC you propose doing something like (at the C-level):

for name, tuple in descr->fields.items():
    if len(tuple) == 3 and name == tuple[-1]: continue
    # ok, now we only have real fields
    # do whatever we want...

?
In that case if a field has the same name and title then it will be incorrectly skipped.

Or do you propose some other loop?

@seberg
Copy link
Member

seberg commented Aug 23, 2021

Yes, that is what I propose.

if a field has the same name and title then it will be incorrectly skipped

True, but IIRC that is invalid.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 23, 2021

Sorry for insisting here: You are saying that a field having the same name and title is invalid? Is this limitation documented somewhere?

@seberg
Copy link
Member

seberg commented Aug 23, 2021

No idea if it is documented somewhere, but:

In [1]: np.dtype([(("a", "a"), "f")])
ValueError: title already used as a name or title.

or

In [1]: np.dtype([(("a", "b"), "f"), ("b", "f")])
ValueError: field 'b' occurs more than once

@anntzer
Copy link
Contributor Author

anntzer commented Aug 23, 2021

Ah, great, thanks.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 23, 2021

OK, here's a quick patch that implements the "iterate over fields" approach (I can post that as a draft PR if you prefer), and passes all the tests:

diff --git c/numpy/core/src/multiarray/arraytypes.c.src w/numpy/core/src/multiarray/arraytypes.c.src
index b3ea7544d..29f27ddfe 100644
--- c/numpy/core/src/multiarray/arraytypes.c.src
+++ w/numpy/core/src/multiarray/arraytypes.c.src
@@ -790,34 +790,51 @@ VOID_getitem(void *input, void *vap)
 
 NPY_NO_EXPORT int PyArray_CopyObject(PyArrayObject *, PyObject *);
 
-/* Given a structured PyArrayObject arr, index i and structured datatype descr,
- * modify the dtype of arr to contain a single field corresponding to the ith
- * field of descr, recompute the alignment flag, and return the offset of the
- * field (in offset_p). This is useful in preparation for calling copyswap on
- * individual fields of a numpy structure, in VOID_setitem.  Compare to inner
- * loops in VOID_getitem and VOID_nonzero.
+/* Given a structured PyArrayObject arr, dict internal pointer ppos, and
+ * structured datatype descr, modify the dtype of arr to contain a single
+ * field corresponding to the ppos field of descr (as defined by PyDict_Next),
+ * recompute the alignment flag, and return the offset of the field (in
+ * offset_p).  Entries in descr->fields corresponding to field titles are
+ * skipped.
+ *
+ * This is useful in preparation for calling copyswap on individual fields of
+ * a numpy structure, in VOID_setitem.  Compare to inner loops in VOID_getitem
+ * and VOID_nonzero.
+ *
+ * Returns whether a field was correctly loaded.  This function does not
+ * distinguish between end-of-iteration and field parsing errors, so the number
+ * of iterations should be controlled by an external for loop.
  *
  * WARNING: Clobbers arr's dtype and alignment flag, should not be used
  *          on the original array!
  */
 NPY_NO_EXPORT int
-_setup_field(int i, PyArray_Descr *descr, PyArrayObject *arr,
-            npy_intp *offset_p, char *dstdata)
+_setup_next_field(Py_ssize_t *ppos, PyArray_Descr *descr, PyArrayObject *arr,
+                  npy_intp *offset_p, char *dstdata)
 {
-    PyObject *key;
-    PyObject *tup;
-    PyArray_Descr *new;
+    PyObject *name, *tup;
+    PyArray_Descr *newdescr;
     npy_intp offset;
 
-    key = PyTuple_GET_ITEM(descr->names, i);
-    tup = PyDict_GetItem(descr->fields, key);
-    if (_unpack_field(tup, &new, &offset) < 0) {
-        return -1;
+    while (1) {
+        if (!PyDict_Next(descr->fields, ppos, &name, &tup)) {
+            return 0;
+        }
+        if (PyTuple_GET_SIZE(tup) < 2) {
+            return 0;
+        }
+        if ((PyTuple_GET_SIZE(tup) == 3)
+                && PyObject_RichCompareBool(name, PyTuple_GET_ITEM(tup, 2), Py_EQ)) {
+            continue;  // This is a title entry, skip it.
+        }
+        break;
     }
+    newdescr = (PyArray_Descr *)PyTuple_GET_ITEM(tup, 0);
+    offset = PyLong_AsSsize_t(PyTuple_GET_ITEM(tup, 1));
 
-    ((PyArrayObject_fields *)(arr))->descr = new;
-    if ((new->alignment > 1) &&
-                ((((uintptr_t)dstdata + offset) % new->alignment) != 0)) {
+    ((PyArrayObject_fields *)(arr))->descr = newdescr;
+    if ((newdescr->alignment > 1) &&
+                ((((uintptr_t)dstdata + offset) % newdescr->alignment) != 0)) {
         PyArray_CLEARFLAGS(arr, NPY_ARRAY_ALIGNED);
     }
     else {
@@ -825,7 +842,7 @@ _setup_field(int i, PyArray_Descr *descr, PyArrayObject *arr,
     }
 
     *offset_p = offset;
-    return 0;
+    return 1;
 }
 
 /* Helper function for VOID_setitem, which uses the copyswap or casting code to
@@ -838,14 +855,14 @@ _copy_and_return_void_setitem(PyArray_Descr *dstdescr, char *dstdata,
     PyArrayObject *dummy_arr = (PyArrayObject *)&dummy_struct;
     npy_int names_size = PyTuple_GET_SIZE(dstdescr->names);
     npy_intp offset;
-    npy_int i;
+    int i;
+    Py_ssize_t pos = 0;
     int ret;
 
     /* Fast path if dtypes are equal */
     if (PyArray_EquivTypes(srcdescr, dstdescr)) {
         for (i = 0; i < names_size; i++) {
-            /* neither line can ever fail, in principle */
-            if (_setup_field(i, dstdescr, dummy_arr, &offset, dstdata)) {
+            if (!_setup_next_field(&pos, dstdescr, dummy_arr, &offset, dstdata)) {
                 return -1;
             }
             PyArray_DESCR(dummy_arr)->f->copyswap(dstdata + offset,
@@ -910,11 +927,14 @@ VOID_setitem(PyObject *op, void *input, void *vap)
 
             PyArrayObject_fields dummy_fields = get_dummy_stack_array(ap);
             PyArrayObject *dummy_arr = (PyArrayObject *)&dummy_fields;
+            Py_ssize_t pos = 0;
 
             for (i = 0; i < names_size; i++) {
                 PyObject *item;
 
-                if (_setup_field(i, descr, dummy_arr, &offset, ip) == -1) {
+                if (!_setup_next_field(&pos, descr, dummy_arr, &offset, ip)) {
+                    /* The for loop controls the number of iterations, so this
+                     * should never signal end-of-iteration. */
                     failed = 1;
                     break;
                 }
@@ -936,10 +956,12 @@ VOID_setitem(PyObject *op, void *input, void *vap)
 
             PyArrayObject_fields dummy_fields = get_dummy_stack_array(ap);
             PyArrayObject *dummy_arr = (PyArrayObject *)&dummy_fields;
+            Py_ssize_t pos = 0;
 
             for (i = 0; i < names_size; i++) {
                 /* temporarily make ap have only this field */
-                if (_setup_field(i, descr, dummy_arr, &offset, ip) == -1) {
+                if (!_setup_next_field(&pos, descr, dummy_arr, &offset, ip)) {
+                    /* see comment above re: _setup_next_field */
                     failed = 1;
                     break;
                 }

Strangely enough, the benchmarks are all over the place (though with more slowdowns than speedups...):

       before           after         ratio                                                                                                                                                         
     [85754153]       [a10b003b]                                                                                                                                                                    
     <loadtxt-approx-split-line>       <_wip/fast-structured-dtype-iter>                    
+         176±3μs          324±1μs     1.84  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rad2deg'>, 4, 4, 'd')
+     3.33±0.03μs       4.07±0.6μs     1.22  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'complex128')                                                                              
+      27.1±0.1μs       32.8±0.1μs     1.21  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('divide', 2, 'D')                                                                                  
+      12.3±0.1μs       14.6±0.1μs     1.19  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 10000, <class 'numpy.int8'>)
+      27.9±0.1μs       33.0±0.2μs     1.19  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int16'>)                                                                       
+     18.2±0.09μs         21.0±1μs     1.15  bench_core.CountNonzero.time_count_nonzero_axis(2, 10000, <class 'bool'>)                                                                              
+      31.4±0.1μs       35.9±0.3μs     1.14  bench_core.CountNonzero.time_count_nonzero_axis(3, 10000, <class 'numpy.int32'>)
+     4.50±0.06μs       5.14±0.4μs     1.14  bench_lib.Nan.time_nanmax(200, 0)                
+     2.71±0.01ms      3.07±0.01ms     1.13  bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'numpy.int64'>)                                                                     
+      4.44±0.1μs       5.01±0.3μs     1.13  bench_lib.Nan.time_nanmin(200, 2.0)                 
+      4.06±0.2μs      4.53±0.09μs     1.12  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'numpy.int8'>)                                                                    
+     2.55±0.02ms       2.84±0.2ms     1.11  bench_indexing.IndexingSeparate.time_mmap_fancy_indexing                                                                                               
+      26.2±0.1μs       29.1±0.1μs     1.11  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('divide', 1, 'D')
+     4.48±0.02μs       4.99±0.3μs     1.11  bench_lib.Nan.time_nanmax(200, 0.1)                                                                                                                    
+      14.4±0.6μs       15.9±0.2μs     1.11  bench_core.CountNonzero.time_count_nonzero_axis(1, 10000, <class 'numpy.int32'>)
+     4.12±0.04μs      4.55±0.06μs     1.10  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'numpy.int64'>)
+     4.62±0.05μs       5.09±0.3μs     1.10  bench_lib.Nan.time_nanmax(200, 50.0)                                                                                                                   
+     3.89±0.06μs       4.28±0.2μs     1.10  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'numpy.int32'>)                                                                         
+     3.83±0.04μs       4.21±0.2μs     1.10  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'numpy.int8'>)  
+     4.50±0.06μs       4.95±0.3μs     1.10  bench_lib.Nan.time_nanmin(200, 0)                                                                                                                      
+     5.70±0.05μs      6.24±0.06μs     1.10  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'object'>)       
+     5.60±0.03μs       6.11±0.2μs     1.09  bench_core.CountNonzero.time_count_nonzero_axis(1, 100, <class 'object'>)
+      6.60±0.1μs       7.20±0.3μs     1.09  bench_lib.Nan.time_nansum(200, 0)                                                                                                                      
+     3.87±0.08μs       4.22±0.1μs     1.09  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'bool'>)                                                                          
+     6.68±0.02μs       7.28±0.3μs     1.09  bench_lib.Nan.time_nansum(200, 2.0)                                                                                                                    
+     4.32±0.02μs       4.70±0.2μs     1.09  bench_core.CountNonzero.time_count_nonzero_axis(2, 100, <class 'numpy.int64'>)
+     4.12±0.07μs       4.49±0.1μs     1.09  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'numpy.int16'>)                                                                   
+     4.38±0.06μs       4.76±0.2μs     1.09  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'numpy.int64'>)                                                                   
+     4.68±0.02μs       5.07±0.2μs     1.08  bench_lib.Nan.time_nanmax(200, 90.0)                 
+     4.08±0.08μs      4.41±0.02μs     1.08  bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'numpy.int32'>)                                                                   
+      11.2±0.1μs       12.1±0.1μs     1.08  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('add', 4, 'F')                                                                                     
+     4.48±0.07μs       4.83±0.2μs     1.08  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 100, <class 'numpy.int8'>)                                                                    
+     6.76±0.02μs       7.27±0.3μs     1.07  bench_lib.Nan.time_nansum(200, 90.0)
+     24.6±0.09μs       26.5±0.1μs     1.07  bench_ma.UFunc.time_scalar_1d(False, True, 10)                                                                                                         
+     2.36±0.01μs      2.53±0.01μs     1.07  bench_scalar.ScalarMath.time_addition_pyint('int32')                                                                                                   
+     4.36±0.08μs      4.67±0.06μs     1.07  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'numpy.int16'>)    
+     4.52±0.03μs       4.85±0.2μs     1.07  bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'numpy.int64'>)          
+     11.7±0.06μs       12.6±0.6μs     1.07  bench_lib.Nan.time_nanargmax(200, 0.1)                                                                                                                 
+      24.2±0.2μs       26.0±0.4μs     1.07  bench_ma.UFunc.time_scalar_1d(True, False, 100)                                                                                                        
+     4.29±0.02μs       4.59±0.2μs     1.07  bench_core.CountNonzero.time_count_nonzero_axis(2, 100, <class 'numpy.int16'>)          
+     6.93±0.01μs       7.41±0.4μs     1.07  bench_lib.Nan.time_nanprod(200, 50.0)                                                                                                                  
+     4.39±0.09μs      4.69±0.06μs     1.07  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'numpy.int32'>)    
+      67.9±0.3μs         72.5±3μs     1.07  bench_lib.Nan.time_nanquantile(200, 2.0)                                                                                                               
+     3.84±0.03μs       4.10±0.1μs     1.07  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'bool'>)           
+      19.7±0.1μs       21.0±0.4μs     1.06  bench_ma.UFunc.time_scalar_1d(True, True, 10)                                                                                                          
+        98.0±1μs          104±2μs     1.06  bench_lib.Pad.time_pad((1, 1, 1, 1, 1), 1, 'mean')   
+      4.35±0.1μs       4.63±0.1μs     1.06  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'numpy.int8'>)                                                                    
+     20.5±0.07μs       21.7±0.2μs     1.06  bench_ufunc_strides.AVX_cmplx_arithmetic.time_ufunc('multiply', 4, 'D')                                                                                
+      25.2±0.2μs      26.7±0.06μs     1.06  bench_ma.UFunc.time_scalar_1d(False, True, 100)
+       178±0.3ms          188±2ms     1.06  bench_trim_zeros.TrimZeros.time_trim_zeros(dtype('bool'), 300000) 
+     17.3±0.06μs       18.2±0.2μs     1.06  bench_ma.UFunc.time_1d(True, True, 10)                                                                                                                 
+     24.4±0.06μs       25.7±0.3μs     1.06  bench_ma.UFunc.time_scalar_1d(True, False, 10)                                                                                                         
+      62.0±0.3ms       65.5±0.3ms     1.06  bench_trim_zeros.TrimZeros.time_trim_zeros(dtype('int64'), 30000)                    
+      29.3±0.2μs       30.9±0.3μs     1.06  bench_ma.UFunc.time_scalar_1d(False, True, 1000)                                                                                                       
+      44.9±0.1μs         47.4±1μs     1.05  bench_lib.Nan.time_nanvar(200, 90.0)                                                                                                                   
+     2.80±0.01μs      2.95±0.05μs     1.05  bench_scalar.ScalarMath.time_addition_pyint('float16')                          
+     1.83±0.02μs      1.92±0.02μs     1.05  bench_core.CorrConv.time_convolve(50, 10, 'same')
+      22.7±0.2μs       23.9±0.2μs     1.05  bench_ma.UFunc.time_scalar_1d(True, True, 1000)                                                                                                        
+     2.36±0.03μs      2.48±0.02μs     1.05  bench_scalar.ScalarMath.time_addition_pyint('int16')
+      17.5±0.1μs       18.4±0.2μs     1.05  bench_ma.UFunc.time_1d(True, True, 100)                                                                                                                
+      22.0±0.2μs       23.1±0.2μs     1.05  bench_ma.UFunc.time_1d(True, False, 100)                                                                                                               
-       114±0.5μs          109±4μs     0.95  bench_indexing.Indexing.time_op('indexes_rand_', ':,I', '')          
-      36.0±0.5μs       33.9±0.3μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sqrt'>, 4, 1, 'f')                                                                                        
-      28.3±0.4μs       26.6±0.4μs     0.94  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'absolute'>, 2, 1, 'f')             
-      38.0±0.1μs       35.3±0.2μs     0.93  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 4, 1, 'f')                     
-        523±10ns          486±3ns     0.93  bench_array_coercion.ArrayCoercionSmall.time_array_dtype_not_kwargs([1])                                                                               
-     2.19±0.01ms      2.00±0.03ms     0.92  bench_lib.Pad.time_pad((1, 1, 1, 1, 1), 8, 'mean')                                                                                                     
-        85.9±5μs       78.1±0.2μs     0.91  bench_function_base.Sort.time_argsort('merge', 'int16', ('sorted_block', 1000))
-         126±4μs          114±2μs     0.91  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 4, 'd')                                                                                        
-      24.0±0.5μs       20.9±0.1μs     0.87  bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 10000, <class 'numpy.int16'>)
-         198±3μs          172±3μs     0.87  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sin'>, 4, 4, 'f')           
-     2.73±0.05ms      2.22±0.01ms     0.81  bench_core.CountNonzero.time_count_nonzero_multi_axis(3, 1000000, <class 'numpy.int8'>)                                                                
-         209±3μs        162±0.3μs     0.77  bench_lib.Pad.time_pad((4, 4, 4, 4), 8, 'constant')                                                                                                    
-         195±3μs          138±1μs     0.71  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 4, 2, 'd')                                                                                        
-         189±1μs          127±2μs     0.67  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'fabs'>, 4, 4, 'f')

Do you have any idea why so many benchmarks which (apparently) do not deal with structured dtypes are affected that much?

@seberg
Copy link
Member

seberg commented Aug 23, 2021

Do you have any idea why so many benchmarks which (apparently) do not deal with structured dtypes are affected that much?

Nope, still waiting for someone to tell me :). My best guess is that we should be using profile-guided optimizations to avoid random changes in compiler optimization. Especially count_nonzero is notorious.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 23, 2021

Nope, still waiting for someone to tell me :)

OK, feel free to close this if you don't think it's going to go anywhere, although it may still be worth revisiting this in the long run (fwiw I think structured dtypes are great (I tend to stay away from pandas) and would be happy to shave as much overhead as possible from them, even independently of the loadtxt() use case).

we should be using profile-guided optimizations to avoid random changes in compiler optimization

Sounds like a good idea, but also sounds like a fair bit of engineering work on the build side :/

@seberg
Copy link
Member

seberg commented Aug 23, 2021

Well, the issue probably has a point, although, I guess if this idea was useful the io benchmarks should see at least some positive change in timing. I guess you checked that VOID_setitem is called in your case?

The one other thing I could imagine is caching the preparation work and storing it on the descriptor (since we can add information so long we still keep the dictionary around). Which might work for the casting machinery (and also setitem, once "upgraded" to the new API). But may also require some care, since we allow some mutability :( – although, I am not sure if only for names.

Sounds like a good idea, but also sounds like a fair bit of engineering work on the build side :/

Not sure it should be hard, but there is a problem in gcc 9-11, so you may have to use a gcc 8 or a dev-branch (gcc 10 and 11 are fixed, but very recently, so I am not sure there is a release with the fix yet). I will probably bother trying it as soon as debian gives me a default gcc with the fix/workaround.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 23, 2021

the io benchmarks should see at least some positive change in timing.

Yes, but that would probably have been just a couple of percents at most, but looks like the compiler just moved stuff at random and shifted performance everywhere by much more :/ So I still have no idea whether this can really be made to matter for loadtxt or not.

I guess you checked that VOID_setitem is called in your case?

I don't have the profiles under my hand right now but I absolutely saw _setup_field show up quite a bit in it, and IIRC

else if (PyTuple_Check(op)) {
/* if it's a tuple, copy field-by-field to ap, */
npy_intp names_size = PyTuple_GET_SIZE(descr->names);
if (names_size != PyTuple_Size(op)) {
errmsg = PyUnicode_FromFormat(
"could not assign tuple of length %zd to structure "
"with %" NPY_INTP_FMT " fields.",
PyTuple_Size(op), names_size);
PyErr_SetObject(PyExc_ValueError, errmsg);
Py_DECREF(errmsg);
return -1;
}
PyArrayObject_fields dummy_fields = get_dummy_stack_array(ap);
PyArrayObject *dummy_arr = (PyArrayObject *)&dummy_fields;
for (i = 0; i < names_size; i++) {
PyObject *item;
if (_setup_field(i, descr, dummy_arr, &offset, ip) == -1) {
failed = 1;
break;
}
item = PyTuple_GetItem(op, i);
if (item == NULL) {
failed = 1;
break;
}
/* use setitem to set this field */
if (PyArray_SETITEM(dummy_arr, ip + offset, item) < 0) {
failed = 1;
break;
}
}
}

is basically the loop we're trying to optimize for the new loadtxt implementation.

caching the preparation work

Yes, caching the alignment info

if ((new->alignment > 1) &&
((((uintptr_t)dstdata + offset) % new->alignment) != 0)) {

(modulo is "slow" :/) would be another thing to do.

I will probably bother trying it as soon as debian gives me a default gcc with the fix/workaround.

Not to give you more work, but I played a bit with cpython recently and ./configure --enable-optimizations automatically doing build + testsuite->profile + build-with-pgo is quite nice.

@seberg
Copy link
Member

seberg commented Aug 23, 2021

Hmmm, the modulo thing may be silly there. Except for copyswapn it is not worthwhile even figuring out the alignment (just assume unaligned). And I am not even sure we really use copyswapn.

@anntzer
Copy link
Contributor Author

anntzer commented Aug 24, 2021

OK, I tried the following patch, which passes all tests, and only checks the flag if we're going to use copyswap (or does that not need alignment checking either?)

diff --git c/numpy/core/src/multiarray/arraytypes.c.src w/numpy/core/src/multiarray/arraytypes.c.src
index b3ea7544d..681dd9ce7 100644
--- c/numpy/core/src/multiarray/arraytypes.c.src
+++ w/numpy/core/src/multiarray/arraytypes.c.src
@@ -792,7 +792,7 @@ NPY_NO_EXPORT int PyArray_CopyObject(PyArrayObject *, PyObject *);
 
 /* Given a structured PyArrayObject arr, index i and structured datatype descr,
  * modify the dtype of arr to contain a single field corresponding to the ith
- * field of descr, recompute the alignment flag, and return the offset of the
+ * field of descr, clear the alignment flag, and return the offset of the
  * field (in offset_p). This is useful in preparation for calling copyswap on
  * individual fields of a numpy structure, in VOID_setitem.  Compare to inner
  * loops in VOID_getitem and VOID_nonzero.
@@ -816,14 +816,7 @@ _setup_field(int i, PyArray_Descr *descr, PyArrayObject *arr,
     }
 
     ((PyArrayObject_fields *)(arr))->descr = new;
-    if ((new->alignment > 1) &&
-                ((((uintptr_t)dstdata + offset) % new->alignment) != 0)) {
-        PyArray_CLEARFLAGS(arr, NPY_ARRAY_ALIGNED);
-    }
-    else {
-        PyArray_ENABLEFLAGS(arr, NPY_ARRAY_ALIGNED);
-    }
-
+    PyArray_CLEARFLAGS(arr, NPY_ARRAY_ALIGNED);
     *offset_p = offset;
     return 0;
 }
@@ -836,6 +829,7 @@ _copy_and_return_void_setitem(PyArray_Descr *dstdescr, char *dstdata,
                               PyArray_Descr *srcdescr, char *srcdata){
     PyArrayObject_fields dummy_struct;
     PyArrayObject *dummy_arr = (PyArrayObject *)&dummy_struct;
+    PyArray_Descr *new;
     npy_int names_size = PyTuple_GET_SIZE(dstdescr->names);
     npy_intp offset;
     npy_int i;
@@ -848,8 +842,12 @@ _copy_and_return_void_setitem(PyArray_Descr *dstdescr, char *dstdata,
             if (_setup_field(i, dstdescr, dummy_arr, &offset, dstdata)) {
                 return -1;
             }
-            PyArray_DESCR(dummy_arr)->f->copyswap(dstdata + offset,
-                    srcdata + offset, 0, dummy_arr);
+            new = PyArray_DESCR(dummy_arr);
+            if ((new->alignment == 1)
+                    || ((((uintptr_t)dstdata + offset) % new->alignment) == 0)) {
+                PyArray_ENABLEFLAGS(dummy_arr, NPY_ARRAY_ALIGNED);
+            }
+            new->f->copyswap(dstdata + offset, srcdata + offset, 0, dummy_arr);
         }
         return 0;
     }

(to be clear, linux perf explicitly showed the modulo operation to be take by itself a few (<5) percents of the loadtxt benchmarks). Unfortunately, the benchmarks are once again all over the place, going from

+       107±0.4μs         165±10μs     1.55  bench_lib.Nan.time_nanmin(200000, 0)
+         109±2μs         167±20μs     1.53  bench_lib.Nan.time_nanmin(200000, 0.1)
+      33.1±0.3μs       49.3±0.4μs     1.49  bench_io.CopyTo.time_copyto_sparse
+     3.35±0.03μs      4.86±0.04μs     1.45  bench_itemselection.Take.time_contiguous((1000, 3), 'clip', 'longfloat')
+     2.90±0.02μs       4.13±0.1μs     1.42  bench_itemselection.Take.time_contiguous((1000, 2), 'raise', 'complex128')
+       137±0.5μs         194±30μs     1.41  bench_lib.Nan.time_nanmin(200000, 2.0)
+      31.4±0.4μs       44.4±0.7μs     1.41  bench_function_base.Sort.time_sort('merge', 'int64', ('sorted_block', 1000))
+      57.8±0.3μs         80.0±1μs     1.38  bench_function_base.Sort.time_sort('merge', 'int64', ('sorted_block', 100))
+         314±7μs          432±1μs     1.37  bench_core.PackBits.time_packbits_axis0(<class 'bool'>)
+      56.4±0.3μs         76.5±3μs     1.36  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'sign'>, 1, 2, 'f')

to

-       124±0.5μs         85.5±3μs     0.69  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'bool'>)
-         187±3μs          128±4μs     0.68  bench_ufunc_strides.Unary.time_ufunc(<ufunc 'degrees'>, 1, 2, 'd')
-      71.5±0.1μs       47.5±0.2μs     0.66  bench_function_base.Sort.time_sort('quick', 'int16', ('uniform',))
-      63.2±0.1μs       41.6±0.4μs     0.66  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'bool'>)
-      63.0±0.2μs       41.1±0.2μs     0.65  bench_core.CountNonzero.time_count_nonzero(1, 1000000, <class 'numpy.int8'>)
-         125±1μs       81.0±0.4μs     0.65  bench_core.CountNonzero.time_count_nonzero(2, 1000000, <class 'numpy.int8'>)
-      1.58±0.4μs      1.02±0.01μs     0.64  bench_itemselection.PutMask.time_sparse(True, 'complex64')
-        192±10μs        122±0.6μs     0.64  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'bool'>)
-        202±40μs        122±0.8μs     0.61  bench_core.CountNonzero.time_count_nonzero(3, 1000000, <class 'numpy.int8'>)
-     7.10±0.03μs      4.06±0.02μs     0.57  bench_io.CopyTo.time_copyto_dense

which mean that likely different compiler optimizations again made a mess of the comparison.

@seberg
Copy link
Member

seberg commented May 22, 2022

Going to close this. Changing the ABI is not on the table, and I think this needs a clear proposal that keeps ABI rather than an open issue. Plus, the original motivation should be mostly gone with the loadtxt speedups.

Basically, there may be some nice things doable here, but I don't think there is a use in having a TODO item about it open.

@seberg seberg closed this as completed May 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants