Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cube.collapsed() should produce an un-masked Cube if the original is un-masked #5314

Open
trexfeathers opened this issue May 11, 2023 · 4 comments
Labels
Feature: Statistics Label for reduction-like operations e.g., collapsing, aggregrating, rolling-window Status: Decision Required

Comments

@trexfeathers
Copy link
Contributor

✨ Feature Request

Motivation

For consistency with Cube.aggregated_by(). Many aggregators return masked arrays regardless of the input array, and we do not want to unnecessarily propagate masks where they are not needed. Users that genuinely want masked arrays can do so by making sure the Cube is masked before collapsing. Detailed discussion here: #4970 (comment)

Additional context

Click to expand this section...

Relevant lines in Cube.aggregated_by():

iris/lib/iris/cube.py

Lines 4235 to 4239 in c0153ae

# Ensure plain ndarray is output if plain ndarray was input.
if ma.isMaskedArray(aggregateby_data) and not ma.isMaskedArray(
input_data
):
aggregateby_data = ma.getdata(aggregateby_data)

@trexfeathers trexfeathers added the Feature: Statistics Label for reduction-like operations e.g., collapsing, aggregrating, rolling-window label May 11, 2023
@bouweandela
Copy link
Member

bouweandela commented May 12, 2023

Wouldn't it make more sense to fix the aggregators instead of the function using them? To users writing their own custom aggregators, the current behaviour may be surprising.

Looking at the example aggregator in this comment #4970 (comment), it looks like the issue could in part be solved by using just the plain numpy functions, since they should dispatch to the appropriate array function, e.g.:

>>> import numpy as np
>>> np.average(np.ma.array([[1], [2.]]), axis=0)
masked_array(data=[1.5],
             mask=False,
       fill_value=1e+20)

There would also be no need to distinguish between the lazy/non-lazy version of the aggregator anymore if dispatch can be used, e.g.:

>>> import dask.array as da
>>> np.average(da.array([[1], [2.]]), axis=0)
dask.array<mean_agg-aggregate, shape=(1,), dtype=float64, chunksize=(1,), chunktype=numpy.ndarray>
>>> np.average(da.array(np.ma.array([[1], [2.]])), axis=0)
dask.array<mean_agg-aggregate, shape=(1,), dtype=float64, chunksize=(1,), chunktype=numpy.MaskedArray>

@rcomer
Copy link
Member

rcomer commented May 12, 2023

Masked arrays do not yet fully support NEP13/18, though there are some open PRs to address that numpy/numpy#22914, numpy/numpy#22913. Some plain numpy functions will work sensibly on masked arrays, but the implementation is piecemeal.

@bouweandela
Copy link
Member

You're right about the partial implementation. I just thought I should mention it. As a temporary workaround, you could write your own function that selects the right array module based on the input array, see e.g. cupy.get_array_module or NEP-37, and use that to select the right array module when implementing aggregators or other code that needs to handle different types of arrays. If at some point in the future NEP-18 is fully implemented, it would be easy to take that out again by just replacing the value returned by that function by numpy everywher. Of course, that still only works if all array modules implement the required statistics function.

@rcomer
Copy link
Member

rcomer commented May 22, 2023

It does seem worth checking which functions do dispatch properly and updating those, starting with MEAN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Statistics Label for reduction-like operations e.g., collapsing, aggregrating, rolling-window Status: Decision Required
Projects
Status: No status
Development

No branches or pull requests

4 participants