ENH: Adding __array_ufunc__ capability to MaskedArrays (again) #22914

greglucas · 2023-01-03T00:25:29Z

This is a rebase and update of several previous PRs: #21977 and #16022, which were both reverted due to issues downstream. I've kept a bunch of smaller commits showing the updates to comments to try and make it easier for review.

I've tested with Astropy and there are still two failing tests over there, which I don't quite understand yet, so I'm hoping a review might be able to point out some hints for what I'm missing. It looks like we do enter MaskedColumn.__array_wrap__, but with a context=None, so that wrap isn't undoing anything because it doesn't know which function it came from... I'm not sure if this means the update needs to be handled in numpy or astropy.

astropy/units/tests/test_quantity_ufuncs.py::TestQuantityMathFuncs::test_power_array_array3
astropy/table/tests/test_column.py::TestColumn::test_numpy_boolean_ufuncs[MaskedColumn]

cc @rcomer, @seberg, @mhvk since you've all had some great comments and help on the previous PRs!

Summary

This enables any ufunc numpy operations that are called on a MaskedArray to use the masked version of that function automatically without needing to resort to np.ma.func() calls directly.

Example

import numpy as np
a = np.ma.array(np.arange(10), [0, 1] * 5)
x = a < 10
# x.data == [True,  True,  True,  True,  True, False, False, False, False, False]
y = np.ma.less(a, 10)
# y.data == [True,  True,  True,  True,  True,  True, False,  True, False, True]

Even though data values in a are masked, they are still acted upon and evaluated. Changing that underlying data. However, calling the masked less version doesn't do anything with the data in the array. Note that this is a big change but to a somewhat ambiguous case of what happens to masked values under function evaluation.

The real reason I began looking into this was to not evaluate the function in the masked values because I'd already premasked the condition that would throw warnings at me. See #4959 for evaluating log10 at locations less than 0 that were under the mask, so still threw warnings.

import numpy as np
A = arange(-2, 5)/2
Am = numpy.ma.masked_less_equal(A, 0)
np.log10(Am) # Previously threw a warning, now works as expected
np.ma.log10(Am) # No warnings before or after

Implementation

I haven't added any documentation yet because I wanted to check and see if this was even the right way of approaching this idea. I struggled to figure out how to pass the out= object pointers around in a consistent manner, so I am definitely interested in suggestions on improving that. I had to update quite a few seemingly random areas of the Masked codebase to make that work everywhere. There are a few tests that I had to update because values under the mask were being compared, and those have now changed with this implementation applying the masked version of ufuncs everywhere.

This is a pretty major (underlying) API change, so I expect there to be quite a bit of discussion on the best approaches here.

Linked Issues

Fixes #4959
Closes #15200

This enables any ufunc numpy operations that are called on a MaskedArray to use the masked version of that function automatically without needing to resort to np.ma.func() calls.

Fixes the problem reported at numpy#21977 (comment) The reduce method here effectively calls itself with an unmasked MaskedArray (mask=nomask) and then expects either a MaskedArray or a scalar. This change ensures that an ordinary ndarray is converted to a MaskedArray, following the pattern already used in mean and var in this module.

Adapted from the problem reported at numpy#21977 (comment)

Now we are calling np.power() in std() which goes through the ufunc machinery, so we don't want to pass any additional unsafe casting kwargs that aren't allowed within the masked implementation.

Move the np-ufunc check to the top of the routine so we immediately go to super() when necessary. Before we were returning NotImplemented if an arg wasn't able to be handled. Update the arg instance check to defer for everything but another class that has implemented __array_ufunc__

This allows for subclasses to be handled correctly

This is handled in the C code now within the ufunc machinery.

mhvk · 2023-01-03T21:34:05Z

@greglucas - the astropy.units error I think we would need to work around - as is, we don't really treat MaskedArray correctly and it shows here; we will just have to deal with it...
But the astropy.table error is more perhaps more serious: MaskedColumn is a subclass of MaskedArray and it assumes it does not have to worry about __array_ufunc__. While we could likely define it and make things work again (indeed, that might be good), we may just be the canary in the coalmine: any other MaskedArray subclasses may well break too.

mhvk · 2023-01-03T21:46:12Z

Actually, I quickly tried to implement __array_ufunc__ myself on MaskedColumn, and it exposed a bug in the implementation here even for np.ma.MaskedArray:

In [2]: np.signbit(np.ma.MaskedArray([1., 2.], mask=[True, False]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 np.signbit(np.ma.MaskedArray([1., 2.], mask=[True, False]))

TypeError: operand type(s) all returned NotImplemented from __array_ufunc__(<ufunc 'signbit'>, '__call__', masked_array(data=[--, 2.0],
             mask=[ True, False],
       fill_value=1e+20)): 'MaskedArray'

The reason is that if the function is not defined in np.ma, you just hand off to super(), i.e., np.ndarray.__array_ufunc__. But that one returns NotImplemented because it does not know how to deal with MaskedArray. I think you'd have to extract the data and copy the mask over. Though perhaps easier is to define np.ma.signbit!

mhvk · 2023-01-03T21:47:37Z

p.s. Might want to check test completeness by checking all functions in np.testing.overrides.get_overridable_numpy_ufuncs() are covered...

seberg · 2023-01-04T08:15:57Z

numpy/ma/core.py

+        if ma_ufunc is np_ufunc:
+            # We didn't have a Masked version of the ufunc, so we need to
+            # call the ndarray version to prevent infinite recursion.
+            return super().__array_ufunc__(np_ufunc, method, *inputs, **kwargs)


You will need to unwrap the data arrays here and then call things again. On the plus side, it may actually make sense to always just do that.
Also, you should not use super() here, but rather call the original ufunc I think, because after unwrapping it could be that you need to dispatch to a different argument.

There is the small change that the ufuncs behave a bit different from the np.ma.ufunc version (and maybe even better?): see gh-22347.
So there might even be a point of just ignoring what np.ma.func does.

After the fallout from the "simple" gh-22046, I am a bit hesitant about changing these subtleties, we probably need to test very carefully and I am wondering if we should push it of for a "2.0" (which I hope to bring up now for in a year). In that case, a possible approach could be to hide these definitions behind an environment variable (feature flag) maybe.

mhvk · 2023-01-04T18:30:22Z

Maybe worth pulling @seberg's comment in the main thread: a question is whether the regular numpy functions should just call the np.ma ones, including the, to me, rather strange use of domains (see gh-22347), or rather do a simpler wrapping where masks of input arguments simply or-d together, without worry about the values of the inputs. Arguing against this of course is that one then is basically moving to a new class (which I did for astropy; see https://docs.astropy.org/en/latest/utils/masked/index.html and in particular the discussion of differences with np.ma.MaskedArray).

But probably it is easier to just add the few missing ufunc to np.ma and work with what you have!

greglucas · 2023-01-05T22:08:54Z

p.s. Might want to check test completeness by checking all functions in np.testing.overrides.get_overridable_numpy_ufuncs() are covered...

But probably it is easier to just add the few missing ufunc to np.ma and work with what you have!

Thanks for pointing me to that function! Somewhat surprisingly, there are actually 30 ufuncs that have no implementation in np.ma currently! Most of them are easy, but I don't think there is any wrapper class that handles multi-output functions yet (ex: https://numpy.org/doc/stable/reference/generated/numpy.frexp.html)

I went digging through some of the old mailing list archives that I wasn't aware of and came across this really nice discussion:
https://mail.python.org/archives/list/numpy-discussion@python.org/thread/OSGGDTJX6XZK3LXLNQYGZP766FD3YFQQ/#QNVWET3O7KXFGAB7NVZNWUGHE766TGNL
and if I'm parsing it all correctly it seems like people are basically on the same page and working out the final details is all that is left. I had also forgotten that @ahaldane had pointed me at their implementation of a new class in one of the previous PRs:
https://github.com/ahaldane/ndarray_ducktypes/blob/master/ndarray_ducktypes/MaskedArray.py

My goal for this was to have a bit of a stepping stone before going all-in on a new class. However, I am now possibly seeing that this might just be playing a game of whack-a-mole that I'm sure will change some behavior and upset someone, so maybe we should just do a hard-breaking-change all at once instead of introduce these small papercuts. (i.e. work on the better implementation for a 2.0 release rather than adding these now)

I think one of the things missing from many of the discussions is that many downstream libraries make use of np.sum(), np.stack(), etc... without thinking about what instance is being passed into them (various arrays from numpy, pandas, dask, masked), so I do think it is important to try and start dispatching the np.func to np.ma.func specific methods for the Masked class since that seems to be the odd one out at the moment...

My desires from a new class:

No auto-masking of "bad" computation locations, let the user do that apriori if they want

I haven't seen anyone argue for this yet, and this looks like it goes way back in git blame.

No warnings when doing a computation of a bad value under a mask

It would be annoying to get divide by zero warnings on your computations when you've already pre-masked the data (see my example at the top of the PR for the motivation here!)

np.func gets forwarded to np.ma.func whenever possible (implement __array_ufunc__ and __array_function__)

It is quite surprising right now to do np.stack(ma, ma) and get a Masked instance back, but with an incorrect mask! (it would be better to just get a plain ndarray IMO)

Proper subclass wrapping/heirarchy preservation

Should fall out of (3) if implemented properly with NotImplemented being returned.

I'll try and take a better look at @ahaldane's implementation in nd_duck_array and @mhvk's implementation in Astropy. I think right now we are missing having a clear path between where we are now and where we want to go, so something a little more concrete with actions of what is needed might be nice. I'll try and give it some thought.

mhvk · 2023-01-05T22:33:19Z

Thanks, that's a great summary! There is only one item I'd add: think carefully how to deal with underlying data other than ndarray. For instance, part of the reason I made a separate implementation in astropy was that the current MaskedArray simply cannot be used to create MaskedQuantity -- with MaskedArray holding a Quantity attributes like the unit become hidden (and I think @ahaldane's version did not make that easy either). I found it worked much better by having Quantity on top, effectively holding a masked array as data (in our case, with corresponding class hierarchy MaskedQuantity(Quantity, MaskedNDArray)). But for something like dask, it might be better to have both the data and the mask be dask arrays, while for Xarray, that would not make sense.

Overall, it certainly is not crazy to move in small steps! I think a bigger question here is whether when wrapping regular numpy ufunc, one wants to just get the np.ma.<ufunc> equivalent, or something that moves closer to a final goal (i.e., as you note, no masking of bad elements but also no warnings for masked ones [note that the latter may be quite tricky...]).

MuellerSeb · 2023-06-15T11:59:33Z

Any news here?

greglucas and others added 19 commits January 2, 2023 17:15

ENH: Adding __array_ufunc__ capability to MaskedArrays.

ab732d0

This enables any ufunc numpy operations that are called on a MaskedArray to use the masked version of that function automatically without needing to resort to np.ma.func() calls.

DOC: Adding improvement note for MaskedArray ufunc

63df460

nomask in nomask out

a63e97a

TST: add a test for ma.minimum.reduce with axis keyword

997d27d

Adapted from the problem reported at numpy#21977 (comment)

MNT: Remove __add__ and other unnecessary overrides

bd091a6

ENH: Add sign ufunc to masked arrays

b5b9ad3

ENH: Remove masked ufunc power restriction

d0ac064

Now we are calling np.power() in std() which goes through the ufunc machinery, so we don't want to pass any additional unsafe casting kwargs that aren't allowed within the masked implementation.

ENH: Add masked invert ufunc

dd74620

MNT: Remove some restrictions on masked array ufunc implementations

c638cdc

FIX: Pass kwargs through ufunc reduce

de22beb

MNT: Add power ufunc to masked arrays

9564f27

FIX: add back getmaskarray to __array_wrap__ of masked arrays

a7ba76f

FIX: view should return the type(self) rather than MaskedArray

fa6c56f

This allows for subclasses to be handled correctly

FIX: Revert sum and trace overrides from masked arrays

cbfd86f

ENH: Add rint and update round method

4a58583

FIX: Change back partition/argpartition

2acf530

FIX: Remove compress updates

e6e80e6

github-actions bot added the 01 - Enhancement label Jan 3, 2023

MAINT: Remove unused delegate_binop code

366dfc3

This is handled in the C code now within the ufunc machinery.

greglucas force-pushed the masked-ufunc-2 branch from 1bf272f to 366dfc3 Compare January 3, 2023 00:26

charris added the component: numpy.ma masked arrays label Jan 3, 2023

seberg reviewed Jan 4, 2023

View reviewed changes

charris mentioned this pull request Jan 20, 2023

ENH: Add __array_function__ implementation to MaskedArray #22913

Open

rhugonnet mentioned this pull request Mar 30, 2023

Update Raster array interface to work with masked arrays once available in NumPy GlacioHack/geoutils#358

Open

rcomer mentioned this pull request May 12, 2023

Cube.collapsed() should produce an un-masked Cube if the original is un-masked SciTools/iris#5314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Adding __array_ufunc__ capability to MaskedArrays (again) #22914

ENH: Adding __array_ufunc__ capability to MaskedArrays (again) #22914

greglucas commented Jan 3, 2023

mhvk commented Jan 3, 2023

mhvk commented Jan 3, 2023

mhvk commented Jan 3, 2023

seberg Jan 4, 2023

mhvk commented Jan 4, 2023

greglucas commented Jan 5, 2023

mhvk commented Jan 5, 2023

MuellerSeb commented Jun 15, 2023

ENH: Adding __array_ufunc__ capability to MaskedArrays (again) #22914

Are you sure you want to change the base?

ENH: Adding __array_ufunc__ capability to MaskedArrays (again) #22914

Conversation

greglucas commented Jan 3, 2023

Summary

Example

Implementation

Linked Issues

mhvk commented Jan 3, 2023

mhvk commented Jan 3, 2023

mhvk commented Jan 3, 2023

seberg Jan 4, 2023

Choose a reason for hiding this comment

mhvk commented Jan 4, 2023

greglucas commented Jan 5, 2023

mhvk commented Jan 5, 2023

MuellerSeb commented Jun 15, 2023