MAINT, ENH: Refactor percentile and quantile methods #19857

bzah · 2021-09-09T18:28:19Z

This is quite a big pull request especially for a new comer like me.
I hope it respect the standards of numpy.

This PR should solve #10736.
It is a rewrite of the work done on Ouranosinc/xclim#802, with some other cases handled.

API changes

The interpolation option for quantile/percentile has been expanded.
Interpolation can be passed as a string or a value of QuantileInterpolation Enum.
This gives more flexibility to add new interpolation methods and is clearer than using magic strings.

Note: In the issue #10736, it was asked to remove some existing methods but I don't think it should be part of the same PR.

Algorithm changes

indices computation

The logic has changed a bit for to compute indices (a.k.a virtual_indices).
Instead of doing q * (Nx - 1), it is now (q * Nx) - 1 which seems more reasonable, see below example.
However, to not break the existing behavior of nearest, midpoint, lower and higher, I kept it as is, even if it is probably giving wrong results.

Example

Let's say you want the 50th percentile of a 100 size ordered list.
If the array index start at zero,
doing 0.5 * ( 100 -1 ) would give 49.5
which is wrong because, the 50th percentile is exactly on the 49th index.
But doing 0.5 * 100 - 1 gives the expected 49

This has no consequence on the result of the default interpolation method (method 7). This is because it is taken into account with the formula using alpha == beta == 1

Rewrite of existing interpolation methods

The existing nearest, midpoint, lower and higher methods have been transformed to fit the QuantileInterpolation model.

Performance improvements

A performance improvement could be done for nanpercentile but it's no longer part of this PR.
The performance of nanpercentile/nanquantile have been significantly improved, especially when the wanted quantiles is a scalar or a short list of quantiles. In most case it should perform around 50 times faster than before.
I will add here a link to the performance report once I re-run the perf tests, in the meantime you can find the report of the Xclim version here: https://gist.github.com/bzah/2a84d050b8a1aed1b40a2ed1526e1f12. It might not be accurate thought.

numpy/lib/tests/test_perf.py

BvB93 · 2021-09-09T19:40:01Z

Just as an initial comment: this PR contains a lot of stylistic changes to the existing code base that are unrelated to the np.quantile reword. These changes shouldn't really be in this PR, and, honestly, the sheer amount of them makes the reviewing process a lot more difficult.

seberg · 2021-09-09T20:16:39Z

Nice! Sounds like there are a couple of nuts to crack still. Some notes:

I do think we have to retain the dtype for all non-interpolating methods (R may not care about this) as you noted.
The fuzz thing has me slightly worried, but I need to have a closer look. It feels like this may depend on the data type of q passed in by the user which is not ideal?
About the stacklevel: Generally it is 2 + each level of additional helper function. In this case the @array_function decorator adds one level, so its 3+each level. 4 is probably right. You can test this manually: If you trigger it in a session, it should point to your line of code, if you lower it by one, it should then point to somewhere inside NumPy.
I think you added drop_nans for the internal API? Could you "drop" that ;)? I don't doubt that it is useful to speed up nanquantile, but it adds complexity to an already complicated change.

Some API things we probably need to hash out, but I think these can be discussed later:

My instinct is to add a new method= keyword argument and deprecate interpolation entirely. This probably means adding method="legacy_nearest" or so that code can retain the (discouraged) behaviour if necessary and avoid the warning. (But this is a discussion for the mailing list and I think the current design makes it nice to modify it.)
What happens if the percentile is outside of the data range? My first instinct would be to raise an error, because it is easier to go from error -> extrapolation than the other way around. I think this may be your last point? I think some of the non-interpolating methods must fail for quantile={0, 1}.

EDIT: As Bas noted, it would be nice if you can avoid running black on the code (at least for unmodified lines).

bzah · 2021-09-09T20:43:11Z

Just as an initial comment: this PR contains a lot of stylistic changes to the existing code base that are unrelated to the np.quantile reword. These changes shouldn't really be in this PR, and, honestly, the sheer amount of them makes the reviewing process a lot more difficult.

Yes you are right, I'm used to run a auto linter all the time but I understand that it make the review difficult.
I will revert the unrelated changes and take care of the comments of @seberg tomorrow.

About drop_nans I completely agreed that it make the code barely readable. At least for the initial review I will remove it and maybe later if the performance boost is deemed worth it could be re-added or refactored in.

numpy/lib/function_base.py

numpy/lib/tests/test_function_base.py

numpy/lib/function_base.py

numpy/lib/tests/test_function_base.py

numpy/lib/function_base.py

bzah · 2021-09-14T09:13:44Z

I did quite a big refactoring, the implementation is now much closer to what the former _quantile was and is hopefully much easier to understand.
I still have a single unit test to fix before moving from draft to publish.

bzah · 2021-09-14T09:41:17Z

@seberg What is the expected behavior of _quantile when the input array is made of booleans ?
I'm asking this, because the failing unit test is a specific case of #19154.

The unit test was previously handled in a strange manner using if np.issubdtype(indices.dtype, np.integer).
Having indices as integers will almost never happen in real life IMO, especially when alpha and beta are floats.

So if I understand what is the expected behavior for booleans I could try to fix the unit test and #19154 in one commit.
Otherwise I can re-add if np.issubdtype(indices.dtype, np.integer) to workaround the unit test.

bashtage · 2021-09-14T09:58:53Z

@bzah Seems that quantile doesn't work at all for boolean arrays, although I don't think this is probably the correct behavior.

np.quantile(np.array([1,0,0,0,1,0,0,0,1,0,0,0],dtype=bool),0.25)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
...
TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

bzah · 2021-09-14T10:12:21Z

@bzah Seems that quantile doesn't work at all for boolean arrays, although I don't think this is probably the correct behavior.

np.quantile(np.array([1,0,0,0,1,0,0,0,1,0,0,0],dtype=bool),0.25)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
...
TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

Yes, in the former implementation of _quantile it only works with booleans if the indices are integers.
In other words it works if (ap.shape[axis] * q ) % 1 == 0
I think the only unit test for booleans is numpy/lib/tests/test_function_base.py::TestQuantile::test_correct_quantile_value
and it is a specific case of this.

seberg · 2021-09-14T14:10:24Z

There was a change, I think it used to work, but probably we can get away with doing what we think is right. My feeling is that the correct thing is to return the input dtype unmodified for any of the non-interpolating methods. For the interpolating methods, you will get whatever bool * float which is a float.

bzah · 2021-09-14T15:17:55Z

Well I don't get it.
Given:

arr = np.asarray([True,False])
result = np.percentile(arr, 0.5, interpolation="any linear interpolation")

If arr is cast into integers, this would give a result close to 0.5.
What can the user make of it ?
Does it really makes sense to have a float result for booleans input ?

I guess if someone has a real world use case for percentile on booleans it would help me understand it.

seberg · 2021-09-14T15:25:25Z

Does it really makes sense to have a float result for booleans input ?

No, to be honest, it doesn't. But NumPy (similar to Python) usually pretends that booleans are like integers and thus "numerical". Since it is already an error right now, I don't mind if you keep it there until a clear decision is made though.

So the argument is that True * 0.5 or np.mean([True, True]) make just as little sense, but both mostly work. We just apply the same (currently typical) logic to median. From a type safety perspective boolean probably should not be numerical, this is inherited from Python (which may well have got it because it did not have booleans in Python 1, I believe!)

bzah · 2021-09-14T15:48:03Z

Ok ok thanks @seberg !
I'm adding @pytest.mark.xfail to the broken unit test with a ref to #19154
I think the PR will be ready to leave draft then.

Thanks a lot for all the reviews already done !

The work on this branch has removed a handling of an unnecessary specific case (when indices are integers), which was covered by this test case. However, percentile/quantiles have been broken for a while when the input array is made of booleans, but it's not clear what should be done to fix this. The unit test case now behaves like any other boolean array and raise a TypeError. See - numpy#19857 (comment) - numpy#19154

bzah · 2021-09-15T08:25:32Z

I'm removing the fuzz correction.
When fuzz == epsilon * 4 (what is done in R), it is only to handle some edge cases where virtual_index is < 8.
When fuzz == epsilon (what I did), it handles virtual_index is < 2 which is worthless.

Moreover, by "handling" I mean it will force the result to be the left bound instead of interpolating to something very close to the left bound.
So unless I'm missing something, I really don't think it is necessary.

This also means there should not be any other issue with the license as far as I understand it.

bzah · 2021-09-17T09:56:00Z

On the CI, I get a cygwin error:

 Maximum tries of 3 reached. Throwing error.
Could not move the bad package to the failure directory. It will show as installed.
 C:\ProgramData\chocolatey\lib\Cygwin
 The error:
 (32) The process cannot access the file because it is being used by another process: [\\?\C:\ProgramData\chocolatey\lib-bad\Cygwin\tools\setup-x86_64.exe]

Is this usual ? Can I do something to fix it ?

charris · 2021-09-17T13:34:06Z

You can ignore the Cygwin error, it is on the CI side.

charris · 2021-09-18T15:58:24Z

Could you add a release note? It would go in doc/release/upcoming_changes where you can also see examples.

numpy/lib/function_base.py

numpy/lib/tests/test_function_base.py

numpy/lib/function_base.py

- Added the missing linear interpolation methods. - Updated the existing unit tests. - Added pytest.mark.xfail for boolean arrays See - numpy#19857 (comment) - numpy#19154

Hopefully fix the docstrings of percentile, nanpercentile, quantile, and nanquantile so that CircleCI passes.

- some changes were unrelated to the PR and have been reverted, including, renaming and moving the logic around. - Also renamed _quantile_ureduce_func to its original name

Also removed unused imports

Also added unit test for it.

On some platforms float128 and complex256 do not exist. Using (c)longdouble aliases should work on all platforms.

seberg · 2021-11-04T22:03:25Z

Thanks @bzah. Lets put this in. If there are any remaining concerns, please open an issue or mention them here so we can followup before the next release hopefully.

I am planning to follow up with a rename of interpolation to method before the NumPy 1.22 release (I think "interpolation" is not a good name and do not want to avoid users of the new methods to require moving.)

If anyone wants to take a closer look at the documentation to see if it can be improved to provide more guidelines that would be great. (I think of example the default method could be considered a sample quantile/percentile, while most others are various population estimates, if that is correct it might be helpful to explain earlier/clearer?)

jbrockmendel · 2021-11-08T16:39:10Z

@seberg are the quantile changes something we need to adapt to on our (pandas) end or to address here?

seberg · 2021-11-08T16:49:22Z

@jbrockmendel I am honestly not quite sure yet. The formula used for calculating seems "index" not ideal (potentially a few ULPs less imprecise than it could be for the default method). I am not sure if we can reorder the formula in general, or whether that would lead to worse characteristics for some methods.

@bzah suggested using a different (simplified/old one) formula for the current default, I suppose we could go through all methods to see if we can clearly simplify a few more. If the changes in pandas seem too strange to be true, I am happy to take that as enough reason that we should specialize at least the default.

One thing I am curious about: If this is the reason for the change, I wonder if the code that tests this will misbehave for all methods except the ones NumPy previously implemented! For all other methods, the repeating samples would change the result (my working hypothesis is that this happens, but I am not familiar with Series.expanding).

jbrockmendel · 2021-11-08T17:00:51Z

If the changes in pandas seem too strange to be true

The few test failures I've looked at have been small floating point errors https://github.com/pandas-dev/pandas/runs/4131499766?check_suite_focus=true#step:7:104 e.g.

>           assert q1 == q2
E           assert 0.18000000000000005 == 0.18000000000000002

seberg · 2021-11-08T20:37:13Z

@bzah I just realized that we already need the virtual_indices formula in the dictionary anyway for the non-interpolating methods. So I think the idea of special casing the formula for "linear" is good (maybe with a very short comment of why?)

Is alpha == beta == 1 special or is alpha == beta maybe already enough to have a simplified version?

Do you want to make a quick PR?

bzah · 2021-11-09T09:31:01Z

Yup, I'm on it.
edit: --> #20331

The test was fixed by numpygh-19857

bzah mentioned this pull request Sep 9, 2021

[#797] Update linear interpolation of percentiles Ouranosinc/xclim#802

Merged

7 tasks

seberg self-requested a review September 9, 2021 18:30

bzah commented Sep 9, 2021

View reviewed changes

numpy/lib/tests/test_perf.py Outdated Show resolved Hide resolved

bashtage reviewed Sep 10, 2021

View reviewed changes