Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Adding __array_ufunc__ capability to MaskedArrays (again) #22914

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

greglucas
Copy link
Contributor

This is a rebase and update of several previous PRs: #21977 and #16022, which were both reverted due to issues downstream. I've kept a bunch of smaller commits showing the updates to comments to try and make it easier for review.

I've tested with Astropy and there are still two failing tests over there, which I don't quite understand yet, so I'm hoping a review might be able to point out some hints for what I'm missing. It looks like we do enter MaskedColumn.__array_wrap__, but with a context=None, so that wrap isn't undoing anything because it doesn't know which function it came from... I'm not sure if this means the update needs to be handled in numpy or astropy.

astropy/units/tests/test_quantity_ufuncs.py::TestQuantityMathFuncs::test_power_array_array3
astropy/table/tests/test_column.py::TestColumn::test_numpy_boolean_ufuncs[MaskedColumn]

cc @rcomer, @seberg, @mhvk since you've all had some great comments and help on the previous PRs!

Summary

This enables any ufunc numpy operations that are called on a MaskedArray to use the masked version of that function automatically without needing to resort to np.ma.func() calls directly.

Example

import numpy as np
a = np.ma.array(np.arange(10), [0, 1] * 5)
x = a < 10
# x.data == [True,  True,  True,  True,  True, False, False, False, False, False]
y = np.ma.less(a, 10)
# y.data == [True,  True,  True,  True,  True,  True, False,  True, False, True]

Even though data values in a are masked, they are still acted upon and evaluated. Changing that underlying data. However, calling the masked less version doesn't do anything with the data in the array. Note that this is a big change but to a somewhat ambiguous case of what happens to masked values under function evaluation.

The real reason I began looking into this was to not evaluate the function in the masked values because I'd already premasked the condition that would throw warnings at me. See #4959 for evaluating log10 at locations less than 0 that were under the mask, so still threw warnings.

import numpy as np
A = arange(-2, 5)/2
Am = numpy.ma.masked_less_equal(A, 0)
np.log10(Am) # Previously threw a warning, now works as expected
np.ma.log10(Am) # No warnings before or after

Implementation

I haven't added any documentation yet because I wanted to check and see if this was even the right way of approaching this idea. I struggled to figure out how to pass the out= object pointers around in a consistent manner, so I am definitely interested in suggestions on improving that. I had to update quite a few seemingly random areas of the Masked codebase to make that work everywhere. There are a few tests that I had to update because values under the mask were being compared, and those have now changed with this implementation applying the masked version of ufuncs everywhere.

This is a pretty major (underlying) API change, so I expect there to be quite a bit of discussion on the best approaches here.

Linked Issues

Fixes #4959
Closes #15200

greglucas and others added 19 commits January 2, 2023 17:15
This enables any ufunc numpy operations that are called on a
MaskedArray to use the masked version of that function automatically
without needing to resort to np.ma.func() calls.
Fixes the problem reported at
numpy#21977 (comment)

The reduce method here effectively calls itself with an unmasked
MaskedArray (mask=nomask) and then expects either a MaskedArray or
a scalar.  This change ensures that an ordinary ndarray is
converted to a MaskedArray, following the pattern already used in
mean and var in this module.
Now we are calling np.power() in std() which goes through the ufunc
machinery, so we don't want to pass any additional unsafe casting
kwargs that aren't allowed within the masked implementation.
Move the np-ufunc check to the top of the routine so we immediately
go to super() when necessary. Before we were returning NotImplemented
if an arg wasn't able to be handled.

Update the arg instance check to defer for everything but another
class that has implemented __array_ufunc__
This allows for subclasses to be handled correctly
This is handled in the C code now within the ufunc machinery.
@mhvk
Copy link
Contributor

mhvk commented Jan 3, 2023

@greglucas - the astropy.units error I think we would need to work around - as is, we don't really treat MaskedArray correctly and it shows here; we will just have to deal with it...
But the astropy.table error is more perhaps more serious: MaskedColumn is a subclass of MaskedArray and it assumes it does not have to worry about __array_ufunc__. While we could likely define it and make things work again (indeed, that might be good), we may just be the canary in the coalmine: any other MaskedArray subclasses may well break too.

@mhvk
Copy link
Contributor

mhvk commented Jan 3, 2023

Actually, I quickly tried to implement __array_ufunc__ myself on MaskedColumn, and it exposed a bug in the implementation here even for np.ma.MaskedArray:

In [2]: np.signbit(np.ma.MaskedArray([1., 2.], mask=[True, False]))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 np.signbit(np.ma.MaskedArray([1., 2.], mask=[True, False]))

TypeError: operand type(s) all returned NotImplemented from __array_ufunc__(<ufunc 'signbit'>, '__call__', masked_array(data=[--, 2.0],
             mask=[ True, False],
       fill_value=1e+20)): 'MaskedArray'

The reason is that if the function is not defined in np.ma, you just hand off to super(), i.e., np.ndarray.__array_ufunc__. But that one returns NotImplemented because it does not know how to deal with MaskedArray. I think you'd have to extract the data and copy the mask over. Though perhaps easier is to define np.ma.signbit!

@mhvk
Copy link
Contributor

mhvk commented Jan 3, 2023

p.s. Might want to check test completeness by checking all functions in np.testing.overrides.get_overridable_numpy_ufuncs() are covered...

if ma_ufunc is np_ufunc:
# We didn't have a Masked version of the ufunc, so we need to
# call the ndarray version to prevent infinite recursion.
return super().__array_ufunc__(np_ufunc, method, *inputs, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to unwrap the data arrays here and then call things again. On the plus side, it may actually make sense to always just do that.
Also, you should not use super() here, but rather call the original ufunc I think, because after unwrapping it could be that you need to dispatch to a different argument.

There is the small change that the ufuncs behave a bit different from the np.ma.ufunc version (and maybe even better?): see gh-22347.
So there might even be a point of just ignoring what np.ma.func does.

After the fallout from the "simple" gh-22046, I am a bit hesitant about changing these subtleties, we probably need to test very carefully and I am wondering if we should push it of for a "2.0" (which I hope to bring up now for in a year). In that case, a possible approach could be to hide these definitions behind an environment variable (feature flag) maybe.

@mhvk
Copy link
Contributor

mhvk commented Jan 4, 2023

Maybe worth pulling @seberg's comment in the main thread: a question is whether the regular numpy functions should just call the np.ma ones, including the, to me, rather strange use of domains (see gh-22347), or rather do a simpler wrapping where masks of input arguments simply or-d together, without worry about the values of the inputs. Arguing against this of course is that one then is basically moving to a new class (which I did for astropy; see https://docs.astropy.org/en/latest/utils/masked/index.html and in particular the discussion of differences with np.ma.MaskedArray).

But probably it is easier to just add the few missing ufunc to np.ma and work with what you have!

@greglucas
Copy link
Contributor Author

p.s. Might want to check test completeness by checking all functions in np.testing.overrides.get_overridable_numpy_ufuncs() are covered...

But probably it is easier to just add the few missing ufunc to np.ma and work with what you have!

Thanks for pointing me to that function! Somewhat surprisingly, there are actually 30 ufuncs that have no implementation in np.ma currently! Most of them are easy, but I don't think there is any wrapper class that handles multi-output functions yet (ex: https://numpy.org/doc/stable/reference/generated/numpy.frexp.html)


I went digging through some of the old mailing list archives that I wasn't aware of and came across this really nice discussion:
https://mail.python.org/archives/list/numpy-discussion@python.org/thread/OSGGDTJX6XZK3LXLNQYGZP766FD3YFQQ/#QNVWET3O7KXFGAB7NVZNWUGHE766TGNL
and if I'm parsing it all correctly it seems like people are basically on the same page and working out the final details is all that is left. I had also forgotten that @ahaldane had pointed me at their implementation of a new class in one of the previous PRs:
https://github.com/ahaldane/ndarray_ducktypes/blob/master/ndarray_ducktypes/MaskedArray.py

My goal for this was to have a bit of a stepping stone before going all-in on a new class. However, I am now possibly seeing that this might just be playing a game of whack-a-mole that I'm sure will change some behavior and upset someone, so maybe we should just do a hard-breaking-change all at once instead of introduce these small papercuts. (i.e. work on the better implementation for a 2.0 release rather than adding these now)

I think one of the things missing from many of the discussions is that many downstream libraries make use of np.sum(), np.stack(), etc... without thinking about what instance is being passed into them (various arrays from numpy, pandas, dask, masked), so I do think it is important to try and start dispatching the np.func to np.ma.func specific methods for the Masked class since that seems to be the odd one out at the moment...

My desires from a new class:

  1. No auto-masking of "bad" computation locations, let the user do that apriori if they want

I haven't seen anyone argue for this yet, and this looks like it goes way back in git blame.

  1. No warnings when doing a computation of a bad value under a mask

It would be annoying to get divide by zero warnings on your computations when you've already pre-masked the data (see my example at the top of the PR for the motivation here!)

  1. np.func gets forwarded to np.ma.func whenever possible (implement __array_ufunc__ and __array_function__)

It is quite surprising right now to do np.stack(ma, ma) and get a Masked instance back, but with an incorrect mask! (it would be better to just get a plain ndarray IMO)

  1. Proper subclass wrapping/heirarchy preservation

Should fall out of (3) if implemented properly with NotImplemented being returned.


I'll try and take a better look at @ahaldane's implementation in nd_duck_array and @mhvk's implementation in Astropy. I think right now we are missing having a clear path between where we are now and where we want to go, so something a little more concrete with actions of what is needed might be nice. I'll try and give it some thought.

@mhvk
Copy link
Contributor

mhvk commented Jan 5, 2023

Thanks, that's a great summary! There is only one item I'd add: think carefully how to deal with underlying data other than ndarray. For instance, part of the reason I made a separate implementation in astropy was that the current MaskedArray simply cannot be used to create MaskedQuantity -- with MaskedArray holding a Quantity attributes like the unit become hidden (and I think @ahaldane's version did not make that easy either). I found it worked much better by having Quantity on top, effectively holding a masked array as data (in our case, with corresponding class hierarchy MaskedQuantity(Quantity, MaskedNDArray)). But for something like dask, it might be better to have both the data and the mask be dask arrays, while for Xarray, that would not make sense.

Overall, it certainly is not crazy to move in small steps! I think a bigger question here is whether when wrapping regular numpy ufunc, one wants to just get the np.ma.<ufunc> equivalent, or something that moves closer to a final goal (i.e., as you note, no masking of bad elements but also no warnings for masked ones [note that the latter may be quite tricky...]).

@MuellerSeb
Copy link

Any news here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants