Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Return type for quantiles seems to depend on quantile method. #22323

Open
aschaffer opened this issue Sep 21, 2022 · 4 comments
Open

DOC: Return type for quantiles seems to depend on quantile method. #22323

aschaffer opened this issue Sep 21, 2022 · 4 comments

Comments

@aschaffer
Copy link

Describe the issue:

Per quantile dox:

"If the input contains integers or floats smaller than float64, the output data-type is float64. Otherwise, the output data-type is the same as that of the input."

For example, for an integer source array, the result should be converted to float64. Regardless of the selected method​.

However,

arr1 = np.array([1,2,2,40,1,1,2,1,0,10,3,3,40,15,3,7,5,4,7,3,5,1,0,9], dtype = int)
qs_arr = np.array([0.001, 0.37, 0.42, 0.67, 0.83, 0.99, 0.39, 0.49, 0.5])

r1 = np.quantile(arr1, qs_arr, method = 'inverted_cdf')
r1.dtype
dtype('int64')

r2 = np.quantile(arr1, qs_arr, method = 'interpolated_inverted_cdf')
r2.dtype
dtype('float64')

There's no mention in the dox that the output type depends on the selected method.

Reproduce the code example:

import numpy as np
arr1 = np.array([1,2,2,40,1,1,2,1,0,10,3,3,40,15,3,7,5,4,7,3,5,1,0,9], dtype = int)
qs_arr = np.array([0.001, 0.37, 0.42, 0.67, 0.83, 0.99, 0.39, 0.49, 0.5])

r1 = np.quantile(arr1, qs_arr, method = 'inverted_cdf')
r1.dtype

r2 = np.quantile(arr1, qs_arr, method = 'interpolated_inverted_cdf')
r2.dtype

Error message:

r1.dtype != r2.dtype
True

NumPy/Python version information:

1.23.0 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)
[GCC 10.3.0]

Context for the issue:

No response

@seberg seberg changed the title BUG: Return type for quantiles seems to depend on quantile method. DOC: Return type for quantiles seems to depend on quantile method. Sep 21, 2022
@seberg
Copy link
Member

seberg commented Sep 21, 2022

Thanks for the note @aschaffer, that is indeed incorrect. We fixed things up in gh-19857 (and followups). IIRC there was a tiny bit of back and forth here also over the time (at least for boolean inputs).

In any case, the comment is correct for all interpolating/continuous methods, I believe (should double check the code maybe). What is important is that all methods that give a "discontiguous" results have no interpolation and we (now?) retain the input dtype faithfully.

@seberg
Copy link
Member

seberg commented Sep 21, 2022

OK, double checking the rule for interpolated values: the actual "rule" is more complicated and drops implicitly out of the interpolation calculation.

However, unless you care about object or longdouble dtype (for q), the rule seems correct as stated.
I think the only niche difference is np.quantile([1, 2, 3], 1) (or 0) with q being integral). However, that seems very niche, and I do not think the exact behavior in that case can be considered "specified" or fixed.

@aschaffer
Copy link
Author

It's probably worth noting that for discontinuous methods who return one or the other end of the interval within which the quantile input falls, it might make sense to return results of same type as source array. But for continuous methods; or, discontinuous methods that could return mid-intervals (e.g., averaged-inverted-cdf) some conversion to floating point is necessary, if the source array is integer(-like).

@seberg
Copy link
Member

seberg commented Sep 22, 2022

For the discontinuous methods we "always" return the same dtype as the input. But, at least average_inverted_cdf method is actually both discontinuous and interpolated (sorry, I had forgotten about that). So when I say "discontinuous" above, that one is not included, because the important thing it is also "interpolated" (to a degree).

For the interpolated ones, we take into account the dtype of q in a bit of a round-about way. That somewhat makes sense (NumPy rarely ignores a dtype), but could be disputed and even changed.

Luckily, numerical types it mainly leads to that upcast to float64 with the only odd case being q=0 and q=1 and I can live with that being considered "undefined"...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants