Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deprecate scalar conversions for rank>0 arrays #10404

Closed
nschloe opened this issue Jan 14, 2018 · 36 comments · Fixed by #10615
Closed

deprecate scalar conversions for rank>0 arrays #10404

nschloe opened this issue Jan 14, 2018 · 36 comments · Fixed by #10615
Labels
07 - Deprecation 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified.

Comments

@nschloe
Copy link
Contributor

nschloe commented Jan 14, 2018

Numpy allows for conversion of arrays into scalars if they are size-1, i.e.,

float(numpy.array(1.0))  # 1.0
float(numpy.array([1.0]))  # 1.0
float(numpy.array([[[[1.0]]]]))  # 1.0

# TypeError: only size-1 arrays can be converted to Python scalars
float(numpy.array([1.0, 1.0]))

I'm fine with rank-0 conversions, but I found that discrimination of conversion based on array size can easily lead to inconsistencies downstream. See, for example, sympy/sympy#13924 where suddenly you have a different behavior of multiplication depending on the length of the input array.

Moreover, this feature is actually redundant for rank>0 arrays since the value can simply be retrieved from the array via x[0].

When grepping for "converted to Python scalar" in the numpy sources, one finds that numpy/lib/user_array.py has

"only rank-0 arrays can be converted to Python scalars."

which seems to be the more consistent solution.

I would hence like to suggest to deprecate scalar conversion for rank>0 size-1 arrays.

@seberg
Copy link
Member

seberg commented Jan 14, 2018

Personally agree with you (there is a reason we got rid of the operator.index equivalent thing, where it was causing obvious problems) and would encourage anyone to try it.
But… I expect there are some annoying cases to solve and it might have a pretty large downstream impact which would make it a very slow change indeed (I am not sure, I think I might have tried it at some time way back, heh).

@eric-wieser
Copy link
Member

eric-wieser commented Jan 15, 2018

or rank-1 and size-1

Not even - size == 1 is all that is checked:

>>> float(np.array([[[1]]]))
1.0

Personally agree with you

Me too. The deprecation warning could stick around for a very long time, and making noise downstream seems pretty reasonable if that noise identifies their bugs.

@njsmith
Copy link
Member

njsmith commented Jan 15, 2018

I agree with @seberg -- this change would indeed make the API cleaner, but I'm dubious about whether we can actually pull it off or the pain would be worth the benefit. The next step would be to try making the change and see how much stuff breaks (e.g. by running the test suites for big downstream projects like scipy, scikit-learn, astropy, ...). That'd at least give us some data to discuss.

@nschloe nschloe changed the title deprecate scalar conversions for rank-1 arrays deprecate scalar conversions for rank>0 arrays Jan 15, 2018
@nschloe
Copy link
Contributor Author

nschloe commented Jan 15, 2018

Thanks for the comments. I've adapted to original post to reflect the fact that arrays of larger rank can be converted to floats as well.

I understand that it can be painful to deprecate stuff, particularly if its not clear how much a functionality is actually used downstream. I would suggest that I make the change locally and compile scipy, sympy, pandas, scikit-learn, and astropy (any other suggestions?) against it and report back with the experience.

@nschloe
Copy link
Contributor Author

nschloe commented Jan 16, 2018

Alright, I've now dug into scipy and this is what I found:

  • The use of explicit int/float conversions on size-1 arrays is not that common. In replacing it with the proper expressions, I've even discovered a small bug in scipy.

  • When disallowing float conversions on size-1 arrays, the following code is illegal:

    a = numpy.random.rand(10, 1)
    a[0] = numpy.array([1.0])
    # ValueError: setting an array element with a sequence.

    While arguably unclean (you should just set a[0] = 1.0), the use of that is more common. This probably constitutes a roadblock for the suggested change.

I'll leave it up to you to comment and/or close this issue.

@shoyer
Copy link
Member

shoyer commented Jan 22, 2018

@nschloe I don't follow why that second case becomes illegal. It seems like that should be well defined, since the array a[0, :] has shape (1,). Maybe that was a bug in your implementation, possibly due to preexisting assumptions in Numpy?

@nschloe
Copy link
Contributor Author

nschloe commented Jan 22, 2018

I don't follow why that second case becomes illegal.

@shoyer I'll try to explain why I think it failed. The array element a[0] in

a = numpy.random.rand(10, 1)
a[0] = numpy.array([1.0])

is a scalar (i.e., empty shape (,)), and you're trying to set it with an object of shape (1,). For this to work, the rhs needs to be converted to a scalar (automatically). If scalar conversion is disabled, you'll get an error.

@eric-wieser
Copy link
Member

eric-wieser commented Jan 22, 2018

@nschloe: That might be considered a sort of reverse broadcasting, which currently succeeds in some cases but not others:

a = np.zeros((1, 3))
b = np.empty((3,))
b[...] = a # leading 1 dimensions are removed
b[...] += a  # but this fails!

Thinking again, allowing a size == 1 array decay to a scalar seems pretty consistent with broadcasting rules in general, to me - if () can magically promote to (1, 1, 1, 1), then is it any worse to allow (1, 1, 1, 1) to decay to ()?

@eric-wieser
Copy link
Member

@nschloe: in your branch, does a[0, ...] = np.array([0]) succeed?

@seberg
Copy link
Member

seberg commented Jan 22, 2018 via email

@eric-wieser
Copy link
Member

eric-wieser commented Jan 22, 2018

There may be a bit annoyance with object arrays

I think that collectively, we've hit the nail on the head there - reverse broadcasting has no place in scalar assignments, because it just doesn't generalize to object arrays. I think that

a[0] = some_1d_len1_array

ought to be forbidden, since its semantics are dtype dependent - code written this way for numeric arrays will already fail in unexpected ways for object arrays. If the reverse-broadcasting is desired by the user, then they should be forced to write

a[0, ...] = some_1d_len1_array

which does the same thing for all dtypes. Making them write it this way eliminates a class of bug, which is exactly what we should be aiming to do with DeprecationWarnings


Whether reverse broadcasting is actually desirable is also perhaps questionable, but you've made it clear to me that it's an orthogonal issue.

@seberg
Copy link
Member

seberg commented Jan 22, 2018

Yeah, its orthogonal. If you attempt this, you basically have to put into place the stuff that allows to deprecate the other, too. But that is all there is to it probably.

EDIT: OK it is not quite orthogonal, since for a bit the behaviour between some array likes would differ.

@njsmith
Copy link
Member

njsmith commented Jan 22, 2018

I think @shoyer's point was that if a has shape (10, 1), then a[0] should expand to a[0, ...] and have shape (1,), so it's not a scalar.

@eric-wieser
Copy link
Member

@njsmith: I mistakenly read that as shape (10,)

@nschloe
Copy link
Contributor Author

nschloe commented Feb 16, 2018

I've finally gotten around to setting up a branch for this, master...nschloe:bug-10404. Here are the results:

import numpy

a = numpy.random.rand(10, 1)
a[0] = numpy.array([1.0])       # okay
a[0] = numpy.array(1.0)         # okay
a[0] = 1.0                      # okay

b = numpy.random.rand(10)
b[0] = numpy.array([1.0])       # ValueError: setting an array element with a sequence.
b[0, ...] = numpy.array([1.0])  # okay
b[0] = numpy.array(1.0)         # okay
b[0] = 1.0                      # okay

This all makes sense, and the error message is accurate too. Instead of recommending to insert ... (as @eric-wieser suggests), I find the simple b[0] = 1.0 more appropriate. Perhaps there are cases where ... is actually needed though.

Yeah, so this is a breaking change that would need a long way of deprecation first. Should I make the ValueError a deprecation and submit a PR where we can discuss further, or are there other suggestions?

@seberg
Copy link
Member

seberg commented Feb 16, 2018

Nothing wrong with PRs in any case, we can always close them ;). It might be a very long and possibly painful deprecation, unfortunately. About the assignment thing, currently that type of broadcasting is also supported for non-scalar LHS, which may or may not be something to look into/synchronize.

The problem with these things is that often you only know how bad this kind of change is for downstream until they start seriously using it.

@eric-wieser
Copy link
Member

If you make a PR we can at least identify where we rely on this internally

@nschloe
Copy link
Contributor Author

nschloe commented Feb 16, 2018

➡️ PR #10615.

@EricCousineau-TRI
Copy link
Contributor

As another pain point, this behavior is a tad confusing when combined with pybind11, where you can define overloads.
When pybind11 evaluates its overloads, it first does a "no convert" pass, where it attempts to convert from Python to C++ with no (or very little) implicit conversion, passing over each overload. If this fails, then it passes over the same overload set, this time permitting implicit conversions. This overload behavior makes things great when you want to distinguish between int and float overloads, but it makes things painful when NumPy can implicitly convert its arrays to scalars, but lists themselves to do not get collapsed to scalars.

I ran into this when trying to write overloads for a function, say func, for argument types np.array[double] (returns "Vector") and int (returns "Int"). (Was actually trying to use this where one call constructed from an array, while another was sugar for creating a fresh array of a given size). Some example runs:

func(0) -> "Int"  # Good.
func([0.])  -> "Vector"  # Great.
func(np.array([0.]))  -> "Int"  # What???

I originally ran into this when writing a custom dtype, but found this was the behavior for floats too.

I will file a pybind11 issue as well, with a workaround that is specific to pybind11, but it would have been nice if NumPy were a tad more strict on how it treats scalars vs. single-item arrays.

@pv
Copy link
Member

pv commented Nov 30, 2019

The scalar conversion of size-1 arrays I think is a part of Numpy semantics that is quite central, and I believe it's not a small amount of code that assumes it (scipy/scipy#11147 also seems to suggest). IMO, this would be more like Numpy 2.0 stuff.

@eric-wieser
Copy link
Member

eric-wieser commented Nov 30, 2019

IMO, this would be more like Numpy 2.0 stuff.

Perhaps there's a compromise where we can still deprecate it in 1.x, but it's a far longer deprecation period than normal - perhaps PendingDeprecationWarning would be appropriate here

@seberg
Copy link
Member

seberg commented Nov 30, 2019

@eric-wieser how are we going to deprecate/futurewarn though? Return new scalars, which warn when in-place operations happen, or when someone tries to hash them? I suppose that is actually possible, but I guess it might be very noisy. OTOH, considering how much I dislike this behaviour, maybe there is a point in trying it out. Even if we would hide it behind some env variable for now.

@eric-wieser
Copy link
Member

eric-wieser commented Nov 30, 2019

Isn't it as simple as replacing the valueerror in the linked PR with a warning?

@pv
Copy link
Member

pv commented Nov 30, 2019

I don't think the length of the deprecation period matters here --- either the API is kept stable or not. The arguments what is gained by this seem not so clearly considered above (or at least I didn't get them by reading it twice :): Unlike with strict-casting and float indices, it's not as clear presence of this coercion in code would usually be a bug (only sympy and pybind11 corner cases are mentioned above, not sure if the object array assignment is the same).

@eric-wieser
Copy link
Member

The way I see it is:

  • The current behavior hides potential bugs (which I agree we need to expand upon before actually deprecating this, preferably in the release notes)
  • It is always possible to write code that does not rely on this behavior, in all remotely recent versions of numpy
  • Even if libraries like scipy do not have any bugs caused by this coercion, it is likely that they expose a similar class of pitfall to their callers
  • Users can always silence warnings

either the API is kept stable or not

Perhaps then this should be a UserWarning instead - indicating that the api is unlikely to change, but the user is nonetheless advised not to do what they're doing.

@nschloe
Copy link
Contributor Author

nschloe commented Dec 1, 2019

Perhaps then this should be a UserWarning instead -

👍

I think there no need for hard breaking people's code so you're right, a deprecation is perhaps a bit harsh. However, when using float conversion of rank>0 arrays, you're almost always making a mistake in how you treat your arrays. A typical case would be returning [val] from an objective function for optimization which really should be a scalar. Right now, numpy irons out that mistake for you, but it is my experience that your code quality improves when dealing with it (e.g., scipy/scipy#11147).

More typical anti-patterns caught by this change:

import numpy
a = numpy.array([1.0e-1], dtype=numpy.float16)
# ....
val = float(a)   # you want a[0]
import numpy
p = numpy.array([3.14])
a = numpy.array([1.0, 2*p, 1/p])  # whoops: a is of type obj
#
b = numpy.zeros((10, 3))
b[0] = a   # !

@pv
Copy link
Member

pv commented Dec 1, 2019

float(a) and a[0] mean different things --- the former is an explicit scalar and type cast, whereas the latter extracts (a potentially >0-dim) sub-array from an array, and these are not exchangeable. Safe scalar coercion can be done with a.reshape([]) ... but that's quite ugly.

@nschloe
Copy link
Contributor Author

nschloe commented Dec 1, 2019

float(a) and a[0] mean different thing

Indeed. And until now, in many cases where the user should have used a[0], you'll find float(a)/int(a) (even scipy). Now you get a warning for that.

@eric-wieser
Copy link
Member

@nschloe, I think your scipy PR might be fixing the symptom and not the cause - for anyone else reading along, I've been trying a different approach in scipy/scipy#11158.

@pv
Copy link
Member

pv commented Dec 1, 2019

In scipy/scipy#11147, majority of the changes do not appear to be actually bugs --- most of them are extracting a Python scalar of given type from an array known to have a size 1 (with exception if not), i.e. float(x) as short for float(x.item()).

@seberg
Copy link
Member

seberg commented Dec 2, 2019

In principle arr.item() means something very similar. Although, I think it always defaults to .item(0) and will not notify that the array has additional dimensions. The main issue with float() is that it uses reversed broadcasting rules. Also we already deprecated this for __nonzero__ due to the problems of inconsistent indexing.
I agree that there is lots of correct usage out there though. Similar to squeeze, although it has problems, it is often used correctly probably.

seberg pushed a commit that referenced this issue Apr 20, 2023
This PR reflects some of the progress achieved in issue #10404 and is used to asses the impact of the changes. 

With the changes in this PR, `float(numpy.array([1.0])` now gives a warning; likewise some other things:
```python
import numpy

a = numpy.random.rand(10, 1)
a[0] = numpy.array([1.0])       # okay
a[0] = numpy.array(1.0)         # okay
a[0] = 1.0                      # okay

b = numpy.random.rand(10)
b[0] = numpy.array([1.0])       # ValueError: setting an array element with a sequence.
b[0, ...] = numpy.array([1.0])  # okay
b[0] = numpy.array(1.0)         # okay
b[0] = 1.0                      # okay
```
This aligns the behavior of numpy arrays with that of lists:
```python
float([3.14]) 
```
```
TypeError: float() argument must be a string or a number, not 'list'
```
```python
import numpy as np

a = np.random.rand(5)
a[0] = [3.14]
```
```
ValueError: setting an array element with a sequence.
```

Fixes #10404.
@seberg
Copy link
Member

seberg commented Apr 20, 2023

I have decided to give the deprecation a shot. I think the situation is a bit better now (Scipy, pandas, matplotlib should be pretty much clean). Also the changes in SciPy were by far the largest, which gives me hope that downstream libraries are not as badly affected.

That said, I do suspect we should take this slow, and I am happy to revert if we notice that it is painful for downstream/end-users. (Or anyone here has concerns.)

@rgommers
Copy link
Member

Thanks! I like this approach, hopefully will not be too bumpy of a ride.

@josef-pkt
Copy link

This looks pretty horrible for statsmodels, we need to add a million squeeze to 1-d, one element arrays.
statsmodels/statsmodels#8821 (comment)

Does the same occur when assigining e.g a 2dim row to a 1-dim row slice of a 2dim array?
(I don't know whether statsmodels has cases where we have an extra dim in higher dim assignments.)

(Aside: the current test failures with this are only in 2 unit tests that check that no warning is raised)

@kalvdans
Copy link

kalvdans commented May 9, 2023

I tried the patch on our numpy-heavy codebase at my company, by entering
numpy @ git+https://github.com/numpy/numpy@main in requirements.txt. Of the unit tested 11065 code-lines, 12 lines needed to change. Most places were code that had a index remapping array, like linestrengths[elements == Z] where Z is the atomic number we are interested in and elements is a list of atomic numbers, the same length as the linestrengths array. This expression returns a one-element vector that were later used as a scalar.

@nschloe
Copy link
Contributor Author

nschloe commented May 9, 2023

@kalvdans Thanks for sharing! Having to explicitly convert the expression to a scalar makes sense to me. After all, the result of the operation could theoretically have any size.

idx = elements == Z
assert np.sum(idx) == 1
idx = idx[0]

res = linestrengths[idx]

Explicit is better than implicit

jarrodmillman added a commit to jarrodmillman/scikit-image that referenced this issue Aug 14, 2023
jarrodmillman added a commit to scikit-image/scikit-image that referenced this issue Aug 17, 2023
* Fix for NumPy 1.25

* Revert #6973

* Revert "Don't test numpy prerelease on azure (#6996)"

This reverts commit e57c23d.

* Skip warning check on older, noisier versions of NumPy

* Remove upperpin

* Use stable sort in _raveled_offsets_and_distances

NumPy enabled optimizations for quicksort on AVX-512 enabled
processors [1]. We sort based on neighbor distances which can be equal.
The default quicksort is not stable and could have different behavior on
AVX-512. Test this assumption.

[1] https://numpy.org/devdocs/release/1.25.0-notes.html#faster-np-argsort-on-avx-512-enabled-processors

* Fix forgotten doctest in _offsets_to_raveled_neighbors

after updating to stable sort internally.

* Fix for NumPy 1.25

* Revert #6973

* Revert "Don't test numpy prerelease on azure (#6996)"

This reverts commit e57c23d.

* Skip warning check on older, noisier versions of NumPy

* Remove upperpin

* Use stable sort in _raveled_offsets_and_distances

NumPy enabled optimizations for quicksort on AVX-512 enabled
processors [1]. We sort based on neighbor distances which can be equal.
The default quicksort is not stable and could have different behavior on
AVX-512. Test this assumption.

[1] https://numpy.org/devdocs/release/1.25.0-notes.html#faster-np-argsort-on-avx-512-enabled-processors

* Fix forgotten doctest in _offsets_to_raveled_neighbors

after updating to stable sort internally.

* Handle rank>0 scalar arrays

numpy/numpy#10404

* Use stable sort

* Use stable sort in max_tree (debug)

See if it does anything.

* Use stable sort in _get_high_intensity_peaks (debug)

See if it does anything.

* Upper pin numpy<1.26 for tests

---------

Co-authored-by: Stefan van der Walt <stefanv@berkeley.edu>
Co-authored-by: Lars Grüter <lagru@mailbox.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
07 - Deprecation 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified.
Projects
None yet
10 participants