BUG: MaskedArray with bytes dtype cannot be compared if fill_value is set #21770

mhvk · 2022-06-15T23:34:27Z

Describe the issue:

Astropy is having a number of test failures as a result of #21041. Some are just due to the new FutureWarning, which are fine. But some are due to something rather strange - that MaskedArray comparisons depend on whether fill_value is set.

Reproduce the code example:

import numpy as np
ma = np.ma.MaskedArray(['a', 'b'])
ma > ma
# masked_array(data=[False, False],
#              mask=False,
#        fill_value=True)

ma
# masked_array(data=['a', 'b'],
#              mask=False,
#        fill_value='N/A',
#             dtype='<U1')

ma > ma
# ---------------------------------------------------------------------------
# TypeError                                 Traceback (most recent call last)
# Input In [6], in <cell line: 1>()
# ----> 1 ma > ma

# TypeError: '>' not supported between instances of 'MaskedArray' and 'MaskedArray'

The real problem in the above is that showing `ma` actually sets its `_fill_value` and that somehow causes problems. (Explicitly setting `._fill_value` also gives a `TypeError` - and in `astropy` it is explicitly set.)

ma._fill_value = None
ma > ma

masked_array(data=[False, False],

mask=False,

fill_value=True)

Error message:

No response

NumPy/Python version information:

1.24.0.dev0+185.g4cb688938 3.10.5 (main, Jun 8 2022, 09:26:22) [GCC 11.3.0]

The text was updated successfully, but these errors were encountered:

charris · 2022-06-15T23:40:46Z

Perhaps we should just back out #21041 for 1.23. It was a bit of a rush to backport it for the Numba folks and that may have been too optimistic.

seberg · 2022-06-16T00:41:10Z

Sorry, I had not realized there was so much oddity digging into those failures :(!

There are two main things going on:

We incorrectly silence the following error. This is because I missed to update the "error fixup" code to know that strings now go through the ufunc. The error is the following:

File ~/forks/numpy/build/testenv/lib/python3.9/site-packages/numpy/ma/core.py:3061, in MaskedArray.__array_finalize__(self, obj)
   3059 # Finalize the fill_value
   3060 if self._fill_value is not None:
-> 3061     self._fill_value = _check_fill_value(self._fill_value, self.dtype)
   3062 elif self.dtype.names is not None:
   3063     # Finalize the default fill_value for structured arrays
   3064     self._fill_value = _check_fill_value(None, self.dtype)

File ~/forks/numpy/build/testenv/lib/python3.9/site-packages/numpy/ma/core.py:467, in _check_fill_value(fill_value, ndtype)
    462         except (OverflowError, ValueError) as e:
    463             # Raise TypeError instead of OverflowError or ValueError.
    464             # OverflowError is seldom used, and the real problem here is
    465             # that the passed fill_value is not compatible with the ndtype.
    466             err_msg = "Cannot convert fill_value %s to dtype %s"
--> 467             raise TypeError(err_msg % (fill_value, ndtype)) from e
    468 return np.array(fill_value)

(may be pretty harmless on its own, but makes this a lot more confusing!)

In the above traceback we see __array_finalize__ now springing into action due to using the ufunc np.equal. That code actually seems to use self._fill_value. The fun part is that self._fill_value can be None, but if it is, it is automatically filled in when self.fill_value is used.
The first time around, it is None and things work. The second time around it is not None and the fill value is not compatible with the ufunc result.
(This likely indicates more subtle bugs, but...)

Now, that fill-value doesn't make sense for ufuncs that return a boolean output probably.

There is at least one more change:

The old code did not check for RICHCMP_GIVE_UP_IF_NEEDED. This code is not affected by that, but I am somewhat worried that might have subtle consequences.
Especially, because MaskedArray does only implement == and != but not the other comparisons.

My preference is that for the main branch, we aim for fixing the two issues:

Fixup the "error sanitizing" so that the above error is raised.
Add __le__, etc. to MaskedArray, because that will handle these paths (and any binop deferring) correctly.

For 1.23, I guess revering is the way to go. Fixing this seems like it would require both of the above fixes, and after that is gone there is still a tiny chance of subtleties around RICHCMP_GIVE_UP_IF_NEEDED

seberg · 2022-06-16T01:18:54Z

Actually, maybe the error sanitizing mainly toggles between FutureWarning and DeprecationWarning that might just cause subtle differences.
(And I am not sure yet which one is actually better; but I need to read the code closer probably.)

So it might just be the second one (plus another fixup) on main.

seberg · 2022-06-16T03:26:42Z

Now I am wondering if we should just add a special case to (I clearly should sleep on this :)):

numpy/numpy/ma/core.py

Lines 453 to 465 in c2db91c

    
           if isinstance(fill_value, str) and (ndtype.char not in 'OSVU'): 
        
               # Note this check doesn't work if fill_value is not a scalar 
        
               err_msg = "Cannot set fill value of string with array of dtype %s" 
        
               raise TypeError(err_msg % ndtype) 
        
           else: 
        
               # In case we want to convert 1e20 to int. 
        
               # Also in case of converting string arrays. 
        
               try: 
        
                   fill_value = np.array(fill_value, copy=False, dtype=ndtype) 
        
               except (OverflowError, ValueError) as e: 
        
                   # Raise TypeError instead of OverflowError or ValueError. 
        
                   # OverflowError is seldom used, and the real problem here is 
        
                   # that the passed fill_value is not compatible with the ndtype.

But of course, this becomes an obstacle course. The fill value is:

array('N/A', dtype='<U3')

and the reason for the failure is that:

>>> np.array(np.array('N/A', dtype='<U3'), dtype=bool)
ValueError: invalid literal for int() with base 10: 'N/A'

(just like casting) fails because it wants to interpret it as a number to cast...

So in the sense, the very long path to why this failure even occurs, is that our string to bool cast is still arguably broken :).

pllim · 2022-06-16T14:30:39Z

I cannot tell if this bug also causes all these other failures at https://github.com/astropy/astropy/runs/6918680788?check_suite_focus=true but the traceback suggests something about ufunc . If some are caused by other bugs, I can open new issues. Please let me know. Thanks!

xref astropy/astropy#13332 (comment)

p.s. We also see failures downstream of astropy that uses astropy.units, so this bug has far-reaching consequences for Astropy Project. E.g., see https://github.com/spacetelescope/jdaviz/runs/6919891933?check_suite_focus=true

seberg · 2022-06-16T14:37:50Z

Oh wow, I thought I had even run the astropy test-suite, but maybe the RuntimeWarnings were just warnings then.

I suspect all (or almost all) of them are related to gh-21437. That PR makes sure that we add warnings for all cast operations, that means basically things like:

np.array([np.nan]).astype(int)  # will give an "invalid value" RuntimeWarnig.
np.array([3e300]).astype(np.float32)  # will give an "overflow" RuntimeWarning

If the warning is raised, maybe the unit machinery will then think that a conversion is invalid (rather than overflowing?).

Lets continue on the astropy issue (don't hesitate to ping me there) or a new issue?

seberg · 2022-06-16T14:38:21Z

NOTE: I realize there is a bit of a "double" influx, but this change has of course nothing to do with the upcoming NumPy 1.23.

charris · 2022-06-16T15:01:41Z

I've reverted the #21041 backport for 1.23.0, see #21777.

mhvk · 2022-06-16T15:03:02Z

Just to confirm that these new bugs are unrelated to 1.23 and to this issue.

mhvk · 2022-06-16T15:05:53Z

@seberg - nice sleuthing on the issue. Though why is this now triggered and not before?

seberg · 2022-06-16T15:09:10Z

@mhvk ah, sorry... because before __lt__ etc. were not implemented, and the ndarray.__lt__ was not subclass aware:

In [2]: import numpy as np

In [3]: ma = np.ma.MaskedArray(["a", "b"])

In [4]: ma > ma
Out[4]: array([False, False])

seberg · 2022-06-16T15:19:03Z

I suppose the one "fix" short of reverting would be to convert the input arrays to bare NumPy arrays ahead of calling into np.equal, etc. (when the input is "string"). But, I am not really convinced that it is worth it.

mhvk · 2022-06-16T15:40:25Z

It sounds to me like it is best to hash this out for 1.24...

This defines the comparison operators (other than `==` and `!=`) explicitly for masked arrays. The mask is ignored for the resuling `res._data` (unlike `==` and `!=` which take the mask into account. Closes numpygh-21770, although the way that masked arrays propagate the fill-value seems generally broken and error prone.

mhvk added the 00 - Bug label Jun 15, 2022

mhvk changed the title ~~BUG: MaskedArray with bytes dtype cannot be compared if fill_value is unusual~~ BUG: MaskedArray with bytes dtype cannot be compared if fill_value is set Jun 15, 2022

mhvk mentioned this issue Jun 15, 2022

BUG: Table test failures with np 1.23.0rc3 astropy/astropy#13332

Closed

charris added this to the 1.23.0 release milestone Jun 16, 2022

charris mentioned this issue Jun 16, 2022

REV: Revert "ENH: Implement string comparison ufuncs (or almost) " #21777

Merged

charris modified the milestones: 1.23.0 release, 1.24.0 release Jun 21, 2022

seberg mentioned this issue Jun 21, 2022

BUG: Define <, <=, >, >= for masked arrays #21812

Merged

charris closed this as completed in #21812 Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: MaskedArray with bytes dtype cannot be compared if fill_value is set #21770

BUG: MaskedArray with bytes dtype cannot be compared if fill_value is set #21770

mhvk commented Jun 15, 2022

charris commented Jun 15, 2022

seberg commented Jun 16, 2022

seberg commented Jun 16, 2022

seberg commented Jun 16, 2022 •

edited

pllim commented Jun 16, 2022 •

edited

seberg commented Jun 16, 2022

seberg commented Jun 16, 2022

charris commented Jun 16, 2022

mhvk commented Jun 16, 2022

mhvk commented Jun 16, 2022

seberg commented Jun 16, 2022

seberg commented Jun 16, 2022

mhvk commented Jun 16, 2022

BUG: MaskedArray with bytes dtype cannot be compared if fill_value is set #21770

BUG: MaskedArray with bytes dtype cannot be compared if fill_value is set #21770

Comments

mhvk commented Jun 15, 2022

Describe the issue:

Reproduce the code example:

masked_array(data=[False, False],

mask=False,

fill_value=True)

Error message:

NumPy/Python version information:

charris commented Jun 15, 2022

seberg commented Jun 16, 2022

seberg commented Jun 16, 2022

seberg commented Jun 16, 2022 • edited

pllim commented Jun 16, 2022 • edited

seberg commented Jun 16, 2022

seberg commented Jun 16, 2022

charris commented Jun 16, 2022

mhvk commented Jun 16, 2022

mhvk commented Jun 16, 2022

seberg commented Jun 16, 2022

seberg commented Jun 16, 2022

mhvk commented Jun 16, 2022

seberg commented Jun 16, 2022 •

edited

pllim commented Jun 16, 2022 •

edited