ENH: Improve performance of np.broadcast_arrays and np.broadcast_shapes #26160

eendebakpt · 2024-03-28T20:19:44Z

Results:

main:
broadcast_arrays(*args) 1.12 us
broadcast_arrays(*args_two) 6.98 us
broadcast_arrays(*args_multi) 9.01 us

PR:
broadcast_arrays(*args) 0.77 us
broadcast_arrays(*args_two) 3.30 us
broadcast_arrays(*args_multi) 3.72 us

Benchmark script:


import timeit
import numpy as np
from numpy import broadcast_arrays
args = (np.array([1.]), )
args_two = (np.array([1.]), np.arange(10) )
args_multi = (np.array([1.]), np.arange(10), np.arange(2, 12) )

number=100_000
dt = timeit.timeit(stmt="broadcast_arrays(*args)", globals=globals(), number=number)
print(f'broadcast_arrays(*args) {1e6*dt/number:.2f} us')
dt = timeit.timeit(stmt="broadcast_arrays(*args_two)", globals=globals(), number=number)
print(f'broadcast_arrays(*args_two) {1e6*dt/number:.2f} us')
dt = timeit.timeit(stmt="broadcast_arrays(*args_multi)", globals=globals(), number=number)
print(f'broadcast_arrays(*args_multi) {1e6*dt/number:.2f} us')

There are some tests failing/modified due to the results not being writable. Maybe that is ok, but I am not sure.
See DEP: finish deprecating readonly result from numpy.broadcast_arrays and the links in that issue.

Notes:

The PR avoids calling _broadcast_to on arrays that do not require broadcasting. Also two generators are avoided.
The performance of np.broadcast_arrays matters for the argument parsing in the random distributions of scipy.stats.
In the code there is a comment about possibly using np.nditer. Adding

    if len(args)< 33 and not subok:        
        return np.nditer(args, flags=['multi_index', 'zerosize_ok', 'reduce_ok',  'refs_ok'], order='C').itviews

makes the method much faster, but there are two failing tests. Both related to the result of np.nditer being read-only (output included in the details below). I could not find an option to make the output of np.nditer writable.

_____________________________________________ test_writeable ____________________________________________

    def test_writeable():
        # broadcast_to should return a readonly array
        original = np.array([1, 2, 3])
        result = broadcast_to(original, (2, 3))
        assert_equal(result.flags.writeable, False)
        assert_raises(ValueError, result.__setitem__, slice(None), 0)

        # but the result of broadcast_arrays needs to be writeable, to
        # preserve backwards compatibility
        for is_broadcast, results in [(False, broadcast_arrays(original,)),
                                      (True, broadcast_arrays(0, original))]:
            for result in results:
                # This will change to False in a future version
                if is_broadcast:
                    with assert_warns(FutureWarning):
                        assert_equal(result.flags.writeable, True)
                    with assert_warns(DeprecationWarning):
                        result[:] = 0
                    # Warning not emitted, writing to the array resets it
                    assert_equal(result.flags.writeable, True)
                else:
                    # No warning:
>                   assert_equal(result.flags.writeable, True)
E                   AssertionError:
E                   Items are not equal:
E                    ACTUAL: False
E                    DESIRED: True

is_broadcast = False
original   = array([1, 2, 3])
result     = array([1, 2, 3])
results    = (array([1, 2, 3]),)

numpy\lib\tests\test_stride_tricks.py:594: AssertionError
_____________________________________ test_writeable_memoryview _______________________________________

    def test_writeable_memoryview():
        # The result of broadcast_arrays exports as a non-writeable memoryview
        # because otherwise there is no good way to opt in to the new behaviour
        # (i.e. you would need to set writeable to False explicitly).
        # See gh-13929.
        original = np.array([1, 2, 3])

        for is_broadcast, results in [(False, broadcast_arrays(original,)),
                                      (True, broadcast_arrays(0, original))]:
            for result in results:
                # This will change to False in a future version
                if is_broadcast:
                    # memoryview(result, writable=True) will give warning but cannot
                    # be tested using the python API.
                    assert memoryview(result).readonly
                else:
>                   assert not memoryview(result).readonly
E                   assert not True
E                    +  where True = <memory at 0x0000029E1779FB80>.readonly
E                    +    where <memory at 0x0000029E1779FB80> = memoryview(array([1, 2, 3]))

is_broadcast = False
original   = array([1, 2, 3])
result     = array([1, 2, 3])
results    = (array([1, 2, 3]),)

numpy\lib\tests\test_stride_tricks.py:635: AssertionError
===================================================================================================== short test summary info =====================================================================================================
FAILED numpy/lib/tests/test_stride_tricks.py::test_writeable - AssertionError:
FAILED numpy/lib/tests/test_stride_tricks.py::test_writeable_memoryview - assert not True
========================================================================= 2 failed, 4360 passed, 165 skipped, 6 deselected, 4 xfailed, 1 xpassed in 9.38s =========================================================================

The np.broadcast_shapes is made faster (and more memory efficient) by selecting a more efficient dtype for the helper arrays.

%timeit broadcast_shapes( (10, 10), (1,1))

MAIN: 1.39 µs ± 94.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
PR: 1.08 µs ± 7.32 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

mhvk

Nice! To me, it seems fine that arrays are only made read-only if their shape is actually changes, so what you have would seem an improvement, even if it means the tests have to be adjusted slightly.

Some queries inside about what gives the real improvement in time.

Also, small thing, but could you ensure the indentation remains correct?

mhvk · 2024-04-04T00:11:56Z

numpy/lib/_stride_tricks_impl.py

@@ -546,13 +546,12 @@ def broadcast_arrays(*args, subok=False):
    # return np.nditer(args, flags=['multi_index', 'zerosize_ok'],
    #                  order='C').itviews

-    args = tuple(np.array(_m, copy=None, subok=subok) for _m in args)
+    args = [np.array(_m, copy=None, subok=subok) for _m in args]


Does this really help? I thought that these days list comprehension and creating a tuple via an iterator made very little speed difference, and *args should be very slightly faster for a tuple.

It does help, although you are right the cost of list comprehensions has gone down. On Python 3.12:

In [16]: import dis ...: args=[1, 2, 3] ...: %timeit tuple(2*j for j in args) ...: %timeit tuple([2*j for j in args]) 298 ns ± 22.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) 108 ns ± 0.781 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

So about 200ns for a small size, which does matter in the benchmarks.

Yes, I confirm this. It is funny because I really thought python had solved the speed difference, but clearly I was wrong.

mhvk · 2024-04-04T00:12:04Z

numpy/lib/_stride_tricks_impl.py


    shape = _broadcast_shape(*args)

-    if all(array.shape == shape for array in args):


This is a nice catch.

mhvk · 2024-04-04T00:12:38Z

numpy/lib/_stride_tricks_impl.py

-    if all(array.shape == shape for array in args):
-        # Common case where nothing needs to be broadcasted.
-        return args
+    result = [array if array.shape == shape


Similar to the above, I wonder if it is actually slower to directly return tuple(array if array.shape ...)?

seberg · 2024-04-04T06:43:29Z

fine that arrays are only made read-only if their shape is actually changes

Just saw the emails and hadn't noticed this, so a brief chiming in. I am on edge about "unpredictable" readonly flag: That is the exact kind of things that the tuple vs. list return discussion was about, although I don't know if e.g. numba encodes it (or we might ever encode it in static typing).
It seems fine and is a pretty small thing but I also don't think there is any up-side. If a user knows that it doesn't change, np.broadcast_to with the unchanged array shape makes more sense.

For nditer to set this you need to pass op_flags=[["writeonly"]]*len(arrs) or so, probably. TBH, it may also make sense to move broadcast_to, broadcast_shapes or something similar to C instead (with the goal of reusing at least the core broadcasting code in nditer one day... (the code there can use some love, IMO)

(The magic nargs "if" should be <=64 now.)

EDIT: Actually, probably need "readwrite" to allow the broadcasting, I first thought it might be the other way around.

eendebakpt · 2024-04-04T20:29:15Z

@mhvk @seberg For comparison I created #26214 which uses np.nditer to be perform the broadcasting. That PR makes all output of np.broadcast_arrays readonly, but I cannot judge wether that is acceptable (now or after some longer deprecation period).

mhvk

Sorry for the delay - the changes look great to me now, thanks!

mhvk · 2024-05-31T20:01:04Z

Overall, I think that this change is a nice simple one that doesn't really affect much, so would suggest to put it in. We can perhaps separately discuss #26214, of whether we should move to nditer? @seberg - what do you think?

seberg · 2024-06-02T07:19:14Z

numpy/lib/_stride_tricks_impl.py

@@ -478,7 +478,7 @@ def broadcast_shapes(*args):
    >>> np.broadcast_shapes((6, 7), (5, 6, 1), (7,), (5, 1, 7))
    (5, 6, 7)
    """
-    arrays = [np.empty(x, dtype=[]) for x in args]
+    arrays = [np.empty(x, dtype=bool) for x in args]


This may rely on the kernel giving us virtual memory very quickly, but that seems fine.

Actually, didn't quite see what the old version did: dtype=[] gives a structured dtype with size=0. Why is dtype=bool better? Would it be equally fast if we defined the an empty_dtype=np.dtype([]) outside and used that?

Ah, nice catch, that might remove the need for the micro-optimization and give us an (effectively) 0 sized array.
I think it's fine, but if someone follows up, that would be great!

See #26599 - the effect is actually quite bad for large array sizes so worth fixing.

@mhvk Thanks for picking this up! I agree with the analysis.

seberg · 2024-06-02T07:20:53Z

OK, fair, I guss sometimes it was alrady not readonly. Although I do wonder if creating the view always wouldn't be better. But thanks @eendebakpt and @mhvk.

This makes the speed independent of the actual shapes (as it used to be before numpygh-26160), but still fast.

github-actions bot added the 01 - Enhancement label Mar 28, 2024

eendebakpt changed the title ~~ENH: Improve performance of np.broadcast_arrays~~ Draft: ENH: Improve performance of np.broadcast_arrays Mar 28, 2024

eendebakpt changed the title ~~Draft: ENH: Improve performance of np.broadcast_arrays~~ ENH: Improve performance of np.broadcast_arrays Mar 28, 2024

eendebakpt added 3 commits April 3, 2024 23:49

ENH: Improve performance of np.broadcast_arrays

39ff8a4

modify tests

1d5beed

lint

4f348ca

eendebakpt force-pushed the broadcast branch from 79a215c to 4f348ca Compare April 3, 2024 21:49

mhvk reviewed Apr 4, 2024

View reviewed changes

eendebakpt added 2 commits April 4, 2024 21:22

whitespace

d598427

lint

ae0c151

eendebakpt mentioned this pull request Apr 4, 2024

Draft: ENH: Improve performance of np.broadcast_arrays (v2) #26214

Open

improve performance of broadcast_shapes

92b9cc9

eendebakpt changed the title ~~ENH: Improve performance of np.broadcast_arrays~~ ENH: Improve performance of np.broadcast_arrays and np.broadcast_shapes Apr 4, 2024

eendebakpt requested a review from mhvk May 11, 2024 22:24

mhvk approved these changes May 31, 2024

View reviewed changes

mhvk added the component: numpy.lib label May 31, 2024

seberg reviewed Jun 2, 2024

View reviewed changes

seberg merged commit a03e0ef into numpy:main Jun 2, 2024
62 checks passed

mhvk mentioned this pull request Jun 2, 2024

ENH: use size-zero dtype for broadcast-shapes #26599

Merged

mhvk added a commit to mhvk/numpy that referenced this pull request Jun 2, 2024

ENH: use size-zero dtype for broadcast-shapes

7a647ea

This makes the speed independent of the actual shapes (as it used to be before numpygh-26160), but still fast.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Improve performance of np.broadcast_arrays and np.broadcast_shapes #26160

ENH: Improve performance of np.broadcast_arrays and np.broadcast_shapes #26160

eendebakpt commented Mar 28, 2024 •

edited

mhvk left a comment

mhvk Apr 4, 2024

eendebakpt Apr 4, 2024

mhvk Apr 4, 2024

mhvk Apr 4, 2024

mhvk Apr 4, 2024

seberg commented Apr 4, 2024 •

edited

eendebakpt commented Apr 4, 2024

mhvk left a comment

mhvk commented May 31, 2024

seberg Jun 2, 2024

mhvk Jun 2, 2024

seberg Jun 2, 2024

mhvk Jun 2, 2024

eendebakpt Jun 2, 2024

seberg commented Jun 2, 2024


		shape = _broadcast_shape(*args)

		if all(array.shape == shape for array in args):

ENH: Improve performance of np.broadcast_arrays and np.broadcast_shapes #26160

ENH: Improve performance of np.broadcast_arrays and np.broadcast_shapes #26160

Conversation

eendebakpt commented Mar 28, 2024 • edited

mhvk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg commented Apr 4, 2024 • edited

eendebakpt commented Apr 4, 2024

mhvk left a comment

Choose a reason for hiding this comment

mhvk commented May 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg commented Jun 2, 2024

eendebakpt commented Mar 28, 2024 •

edited

seberg commented Apr 4, 2024 •

edited