-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Improve performance of np.broadcast_arrays and np.broadcast_shapes #26160
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! To me, it seems fine that arrays are only made read-only if their shape is actually changes, so what you have would seem an improvement, even if it means the tests have to be adjusted slightly.
Some queries inside about what gives the real improvement in time.
Also, small thing, but could you ensure the indentation remains correct?
@@ -546,13 +546,12 @@ def broadcast_arrays(*args, subok=False): | |||
# return np.nditer(args, flags=['multi_index', 'zerosize_ok'], | |||
# order='C').itviews | |||
|
|||
args = tuple(np.array(_m, copy=None, subok=subok) for _m in args) | |||
args = [np.array(_m, copy=None, subok=subok) for _m in args] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really help? I thought that these days list comprehension and creating a tuple via an iterator made very little speed difference, and *args
should be very slightly faster for a tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does help, although you are right the cost of list comprehensions has gone down. On Python 3.12:
In [16]: import dis
...: args=[1, 2, 3]
...: %timeit tuple(2*j for j in args)
...: %timeit tuple([2*j for j in args])
298 ns ± 22.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
108 ns ± 0.781 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
So about 200ns for a small size, which does matter in the benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I confirm this. It is funny because I really thought python had solved the speed difference, but clearly I was wrong.
|
||
shape = _broadcast_shape(*args) | ||
|
||
if all(array.shape == shape for array in args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice catch.
if all(array.shape == shape for array in args): | ||
# Common case where nothing needs to be broadcasted. | ||
return args | ||
result = [array if array.shape == shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above, I wonder if it is actually slower to directly return tuple(array if array.shape ...)
?
Just saw the emails and hadn't noticed this, so a brief chiming in. I am on edge about "unpredictable" readonly flag: That is the exact kind of things that the tuple vs. list return discussion was about, although I don't know if e.g. For (The magic EDIT: Actually, probably need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay - the changes look great to me now, thanks!
@@ -478,7 +478,7 @@ def broadcast_shapes(*args): | |||
>>> np.broadcast_shapes((6, 7), (5, 6, 1), (7,), (5, 1, 7)) | |||
(5, 6, 7) | |||
""" | |||
arrays = [np.empty(x, dtype=[]) for x in args] | |||
arrays = [np.empty(x, dtype=bool) for x in args] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may rely on the kernel giving us virtual memory very quickly, but that seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, didn't quite see what the old version did: dtype=[]
gives a structured dtype with size=0. Why is dtype=bool
better? Would it be equally fast if we defined the an empty_dtype=np.dtype([])
outside and used that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nice catch, that might remove the need for the micro-optimization and give us an (effectively) 0 sized array.
I think it's fine, but if someone follows up, that would be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #26599 - the effect is actually quite bad for large array sizes so worth fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhvk Thanks for picking this up! I agree with the analysis.
OK, fair, I guss sometimes it was alrady not readonly. Although I do wonder if creating the view always wouldn't be better. But thanks @eendebakpt and @mhvk. |
This makes the speed independent of the actual shapes (as it used to be before numpygh-26160), but still fast.
Results:
Benchmark script:
There are some tests failing/modified due to the results not being writable. Maybe that is ok, but I am not sure.
See DEP: finish deprecating readonly result from numpy.broadcast_arrays and the links in that issue.
Notes:
_broadcast_to
on arrays that do not require broadcasting. Also two generators are avoided.np.broadcast_arrays
matters for the argument parsing in the random distributions ofscipy.stats
.np.nditer
. Addingmakes the method much faster, but there are two failing tests. Both related to the result of
np.nditer
being read-only (output included in the details below). I could not find an option to make the output ofnp.nditer
writable.The
np.broadcast_shapes
is made faster (and more memory efficient) by selecting a more efficientdtype
for the helper arrays.