Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT, ENH: Refactor percentile and quantile methods #19857

Merged
merged 12 commits into from Nov 4, 2021

Conversation

bzah
Copy link
Contributor

@bzah bzah commented Sep 9, 2021

This is quite a big pull request especially for a new comer like me.
I hope it respect the standards of numpy.

This PR should solve #10736.
It is a rewrite of the work done on Ouranosinc/xclim#802, with some other cases handled.

API changes

The interpolation option for quantile/percentile has been expanded.
Interpolation can be passed as a string or a value of QuantileInterpolation Enum.
This gives more flexibility to add new interpolation methods and is clearer than using magic strings.

Note: In the issue #10736, it was asked to remove some existing methods but I don't think it should be part of the same PR.

Algorithm changes

indices computation

The logic has changed a bit for to compute indices (a.k.a virtual_indices).
Instead of doing q * (Nx - 1), it is now (q * Nx) - 1 which seems more reasonable, see below example.
However, to not break the existing behavior of nearest, midpoint, lower and higher, I kept it as is, even if it is probably giving wrong results.

Example

Let's say you want the 50th percentile of a 100 size ordered list.
If the array index start at zero,
doing 0.5 * ( 100 -1 ) would give 49.5
which is wrong because, the 50th percentile is exactly on the 49th index.
But doing 0.5 * 100 - 1 gives the expected 49

This has no consequence on the result of the default interpolation method (method 7). This is because it is taken into account with the formula using alpha == beta == 1

Rewrite of existing interpolation methods

The existing nearest, midpoint, lower and higher methods have been transformed to fit the QuantileInterpolation model.

Performance improvements

A performance improvement could be done for nanpercentile but it's no longer part of this PR.
The performance of nanpercentile/nanquantile have been significantly improved, especially when the wanted quantiles is a scalar or a short list of quantiles. In most case it should perform around 50 times faster than before.
I will add here a link to the performance report once I re-run the perf tests, in the meantime you can find the report of the Xclim version here: https://gist.github.com/bzah/2a84d050b8a1aed1b40a2ed1526e1f12. It might not be accurate thought.

numpy/lib/tests/test_perf.py Outdated Show resolved Hide resolved
@BvB93
Copy link
Member

BvB93 commented Sep 9, 2021

Just as an initial comment: this PR contains a lot of stylistic changes to the existing code base that are unrelated to the np.quantile reword. These changes shouldn't really be in this PR, and, honestly, the sheer amount of them makes the reviewing process a lot more difficult.

@seberg
Copy link
Member

seberg commented Sep 9, 2021

Nice! Sounds like there are a couple of nuts to crack still. Some notes:

  • I do think we have to retain the dtype for all non-interpolating methods (R may not care about this) as you noted.
  • The fuzz thing has me slightly worried, but I need to have a closer look. It feels like this may depend on the data type of q passed in by the user which is not ideal?
  • About the stacklevel: Generally it is 2 + each level of additional helper function. In this case the @array_function decorator adds one level, so its 3+each level. 4 is probably right. You can test this manually: If you trigger it in a session, it should point to your line of code, if you lower it by one, it should then point to somewhere inside NumPy.
  • I think you added drop_nans for the internal API? Could you "drop" that ;)? I don't doubt that it is useful to speed up nanquantile, but it adds complexity to an already complicated change.

Some API things we probably need to hash out, but I think these can be discussed later:

  • My instinct is to add a new method= keyword argument and deprecate interpolation entirely. This probably means adding method="legacy_nearest" or so that code can retain the (discouraged) behaviour if necessary and avoid the warning. (But this is a discussion for the mailing list and I think the current design makes it nice to modify it.)
  • What happens if the percentile is outside of the data range? My first instinct would be to raise an error, because it is easier to go from error -> extrapolation than the other way around. I think this may be your last point? I think some of the non-interpolating methods must fail for quantile={0, 1}.

EDIT: As Bas noted, it would be nice if you can avoid running black on the code (at least for unmodified lines).

@bzah
Copy link
Contributor Author

bzah commented Sep 9, 2021

Just as an initial comment: this PR contains a lot of stylistic changes to the existing code base that are unrelated to the np.quantile reword. These changes shouldn't really be in this PR, and, honestly, the sheer amount of them makes the reviewing process a lot more difficult.

Yes you are right, I'm used to run a auto linter all the time but I understand that it make the review difficult.
I will revert the unrelated changes and take care of the comments of @seberg tomorrow.

About drop_nans I completely agreed that it make the code barely readable. At least for the initial review I will remove it and maybe later if the performance boost is deemed worth it could be re-added or refactored in.

numpy/lib/function_base.py Outdated Show resolved Hide resolved
numpy/lib/function_base.py Outdated Show resolved Hide resolved
numpy/lib/function_base.py Outdated Show resolved Hide resolved
numpy/lib/function_base.py Outdated Show resolved Hide resolved
numpy/lib/function_base.py Outdated Show resolved Hide resolved
numpy/lib/function_base.py Outdated Show resolved Hide resolved
numpy/lib/function_base.py Outdated Show resolved Hide resolved
numpy/lib/function_base.py Outdated Show resolved Hide resolved
numpy/lib/function_base.py Outdated Show resolved Hide resolved
numpy/lib/tests/test_function_base.py Outdated Show resolved Hide resolved
numpy/lib/function_base.py Outdated Show resolved Hide resolved
numpy/lib/function_base.py Outdated Show resolved Hide resolved
@bzah
Copy link
Contributor Author

bzah commented Sep 14, 2021

I did quite a big refactoring, the implementation is now much closer to what the former _quantile was and is hopefully much easier to understand.
I still have a single unit test to fix before moving from draft to publish.

@bzah
Copy link
Contributor Author

bzah commented Sep 14, 2021

@seberg What is the expected behavior of _quantile when the input array is made of booleans ?
I'm asking this, because the failing unit test is a specific case of #19154.

The unit test was previously handled in a strange manner using if np.issubdtype(indices.dtype, np.integer).
Having indices as integers will almost never happen in real life IMO, especially when alpha and beta are floats.

So if I understand what is the expected behavior for booleans I could try to fix the unit test and #19154 in one commit.
Otherwise I can re-add if np.issubdtype(indices.dtype, np.integer) to workaround the unit test.

@bashtage
Copy link
Contributor

@bzah Seems that quantile doesn't work at all for boolean arrays, although I don't think this is probably the correct behavior.

np.quantile(np.array([1,0,0,0,1,0,0,0,1,0,0,0],dtype=bool),0.25)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
...
TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

@bzah
Copy link
Contributor Author

bzah commented Sep 14, 2021

@bzah Seems that quantile doesn't work at all for boolean arrays, although I don't think this is probably the correct behavior.

np.quantile(np.array([1,0,0,0,1,0,0,0,1,0,0,0],dtype=bool),0.25)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
...
TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

Yes, in the former implementation of _quantile it only works with booleans if the indices are integers.
In other words it works if (ap.shape[axis] * q ) % 1 == 0
I think the only unit test for booleans is numpy/lib/tests/test_function_base.py::TestQuantile::test_correct_quantile_value
and it is a specific case of this.

@seberg
Copy link
Member

seberg commented Sep 14, 2021

There was a change, I think it used to work, but probably we can get away with doing what we think is right. My feeling is that the correct thing is to return the input dtype unmodified for any of the non-interpolating methods. For the interpolating methods, you will get whatever bool * float which is a float.

@bzah
Copy link
Contributor Author

bzah commented Sep 14, 2021

Well I don't get it.
Given:

arr = np.asarray([True,False])
result = np.percentile(arr, 0.5, interpolation="any linear interpolation")

If arr is cast into integers, this would give a result close to 0.5.
What can the user make of it ?
Does it really makes sense to have a float result for booleans input ?

I guess if someone has a real world use case for percentile on booleans it would help me understand it.

@seberg
Copy link
Member

seberg commented Sep 14, 2021

Does it really makes sense to have a float result for booleans input ?

No, to be honest, it doesn't. But NumPy (similar to Python) usually pretends that booleans are like integers and thus "numerical". Since it is already an error right now, I don't mind if you keep it there until a clear decision is made though.

So the argument is that True * 0.5 or np.mean([True, True]) make just as little sense, but both mostly work. We just apply the same (currently typical) logic to median. From a type safety perspective boolean probably should not be numerical, this is inherited from Python (which may well have got it because it did not have booleans in Python 1, I believe!)

@bzah
Copy link
Contributor Author

bzah commented Sep 14, 2021

Ok ok thanks @seberg !
I'm adding @pytest.mark.xfail to the broken unit test with a ref to #19154
I think the PR will be ready to leave draft then.

Thanks a lot for all the reviews already done !

bzah added a commit to bzah/numpy that referenced this pull request Sep 14, 2021
The work on this branch has removed a handling of an unnecessary specific case
(when indices are integers), which was covered by this test case.
However, percentile/quantiles have been broken for a while when the input array
is made of booleans, but it's not clear what should be done to fix this.

The unit test case now behaves like any other boolean array
and raise a TypeError.

See
- numpy#19857 (comment)
- numpy#19154
@bzah bzah marked this pull request as ready for review September 14, 2021 16:10
@bzah
Copy link
Contributor Author

bzah commented Sep 15, 2021

I'm removing the fuzz correction.
When fuzz == epsilon * 4 (what is done in R), it is only to handle some edge cases where virtual_index is < 8.
When fuzz == epsilon (what I did), it handles virtual_index is < 2 which is worthless.

Moreover, by "handling" I mean it will force the result to be the left bound instead of interpolating to something very close to the left bound.
So unless I'm missing something, I really don't think it is necessary.

This also means there should not be any other issue with the license as far as I understand it.

@bzah
Copy link
Contributor Author

bzah commented Sep 17, 2021

On the CI, I get a cygwin error:

 Maximum tries of 3 reached. Throwing error.
Could not move the bad package to the failure directory. It will show as installed.
 C:\ProgramData\chocolatey\lib\Cygwin
 The error:
 (32) The process cannot access the file because it is being used by another process: [\\?\C:\ProgramData\chocolatey\lib-bad\Cygwin\tools\setup-x86_64.exe]

Is this usual ? Can I do something to fix it ?

@charris
Copy link
Member

charris commented Sep 17, 2021

You can ignore the Cygwin error, it is on the CI side.

@charris charris changed the title Feature/rewrite percentile MAINT, ENH: Refactor percentile Sep 18, 2021
@charris
Copy link
Member

charris commented Sep 18, 2021

Could you add a release note? It would go in doc/release/upcoming_changes where you can also see examples.

numpy/lib/function_base.py Outdated Show resolved Hide resolved
numpy/lib/function_base.py Outdated Show resolved Hide resolved
numpy/lib/tests/test_function_base.py Outdated Show resolved Hide resolved
numpy/lib/function_base.py Outdated Show resolved Hide resolved
bzah and others added 12 commits November 4, 2021 14:50
- Added the missing linear interpolation methods.
- Updated the existing unit tests.

- Added pytest.mark.xfail for boolean arrays
See
- numpy#19857 (comment)
- numpy#19154
Hopefully fix the docstrings of percentile, nanpercentile,
quantile, and nanquantile so that CircleCI passes.
- some changes were unrelated to the PR and have been reverted, including, renaming and moving the logic around.
- Also renamed _quantile_ureduce_func to its original name
Also removed unused imports
Also added unit test for it.
On some platforms float128 and complex256 do not exist.
Using (c)longdouble aliases should work on all platforms.
@seberg seberg force-pushed the feature/rewrite_percentile branch 2 times, most recently from cdc2088 to f7911c6 Compare November 4, 2021 20:11
@seberg
Copy link
Member

seberg commented Nov 4, 2021

Thanks @bzah. Lets put this in. If there are any remaining concerns, please open an issue or mention them here so we can followup before the next release hopefully.

I am planning to follow up with a rename of interpolation to method before the NumPy 1.22 release (I think "interpolation" is not a good name and do not want to avoid users of the new methods to require moving.)

If anyone wants to take a closer look at the documentation to see if it can be improved to provide more guidelines that would be great. (I think of example the default method could be considered a sample quantile/percentile, while most others are various population estimates, if that is correct it might be helpful to explain earlier/clearer?)

@seberg seberg merged commit fe7b1dc into numpy:main Nov 4, 2021
@bzah bzah deleted the feature/rewrite_percentile branch November 5, 2021 08:06
@jbrockmendel
Copy link
Contributor

@seberg are the quantile changes something we need to adapt to on our (pandas) end or to address here?

@seberg
Copy link
Member

seberg commented Nov 8, 2021

@jbrockmendel I am honestly not quite sure yet. The formula used for calculating seems "index" not ideal (potentially a few ULPs less imprecise than it could be for the default method). I am not sure if we can reorder the formula in general, or whether that would lead to worse characteristics for some methods.

@bzah suggested using a different (simplified/old one) formula for the current default, I suppose we could go through all methods to see if we can clearly simplify a few more. If the changes in pandas seem too strange to be true, I am happy to take that as enough reason that we should specialize at least the default.

One thing I am curious about: If this is the reason for the change, I wonder if the code that tests this will misbehave for all methods except the ones NumPy previously implemented! For all other methods, the repeating samples would change the result (my working hypothesis is that this happens, but I am not familiar with Series.expanding).

@jbrockmendel
Copy link
Contributor

If the changes in pandas seem too strange to be true

The few test failures I've looked at have been small floating point errors https://github.com/pandas-dev/pandas/runs/4131499766?check_suite_focus=true#step:7:104 e.g.

>           assert q1 == q2
E           assert 0.18000000000000005 == 0.18000000000000002

@seberg
Copy link
Member

seberg commented Nov 8, 2021

@bzah I just realized that we already need the virtual_indices formula in the dictionary anyway for the non-interpolating methods. So I think the idea of special casing the formula for "linear" is good (maybe with a very short comment of why?)

Is alpha == beta == 1 special or is alpha == beta maybe already enough to have a simplified version?

Do you want to make a quick PR?

@bzah
Copy link
Contributor Author

bzah commented Nov 9, 2021

Yup, I'm on it.
edit: --> #20331

seberg added a commit to seberg/numpy that referenced this pull request Jun 10, 2022
seberg added a commit to seberg/numpy that referenced this pull request Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants