Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add scalar special cases for boolean logical loops #8924

Closed
wants to merge 1 commit into from

Conversation

juliantaylor
Copy link
Contributor

Scalar logical loops just boil down to memcpy or memset. While not very
useful in general, some cases like masked arrays can profit when
creating zero stride boolean arrays for readonly views.

@juliantaylor
Copy link
Contributor Author

see gh-8910 for a masked array change that makes use of these loops.

@eric-wieser
Copy link
Member

How do timings compare for cases that do and don't take advantage of this?

@juliantaylor
Copy link
Contributor Author

it should easily be a factor 8-16, a SSE unit can perform 16 boolean ops at once.

@eric-wieser
Copy link
Member

eric-wieser commented Apr 11, 2017

I'm more concerned about the impact of the branching for small loop sizes when we aren't in the special case this is optimizing for

What determines the number of elements in the inner loop? Does the ufunc always pick a contiguous dimension, or the longest one, or what?

@juliantaylor
Copy link
Contributor Author

juliantaylor commented Apr 11, 2017

The ufunc machinery overhead is an order of magnitude higher than the cost of the inner loops. The code inside them only starts mattering when the arrays become larger than a few thousand elements.

The inner loop size is the full array if it can be expressed via the strides without buffering.
With buffering (casting, broadcasting, reductions) the inner loop size is 8192 contiguous elements (np.setbufsize)

@eric-wieser
Copy link
Member

eric-wieser commented Apr 12, 2017

This helps, but it's still slower than working with the full arrays (on my MSVC build, anyway).

Setup:

In [1]: b = np.ma.make_mask_none((100, 100), writeable=False)
In [2]: f = np.ma.make_mask_none((100, 100), writeable=True)

Before:

In [3]: %timeit b | b
100000 loops, best of 3: 12.7 µs per loop

In [4]: %timeit b | f
100000 loops, best of 3: 12.5 µs per loop

In [5]: %timeit f | f
1000000 loops, best of 3: 1.86 µs per loop

After

In [3]: %timeit b | b
100000 loops, best of 3: 8.83 µs per loop

In [4]: %timeit b | f
100000 loops, best of 3: 2.59 µs per loop

In [5]: %timeit f | f
1000000 loops, best of 3: 1.82 µs per loop

@juliantaylor juliantaylor force-pushed the scalar-bool branch 2 times, most recently from dd8f47e to 53f9d22 Compare April 12, 2017 10:24
@juliantaylor
Copy link
Contributor Author

I didn't add code for scalar | scalar as that probably doesn't happen in practice, the masked code usually checks for double nomask before doing a masked operation.
But I added it now and cleaned up the code a bit.

That it is still slower than full array is actually due to the iterator. In the full array case it uses a fastpath to skip the nditer setup. Apparently this does not trigger when a zero-d array is involved, so it sets up the expensive iterator.
It should still be good enough, the mask operation cost is only a fraction of the full masked array operation.

*/
if (steps[0] == 0) {
if (steps[1] == 0) {
BOOL_SCALAR_OP(1, 1, ip2, (*args[0] @OP@ *args[1]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want const bool val = *args[0] @OP@ *args[1], so that it doesn't get calculated repeatedly inside the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be a relevant code path, but why not.

Scalar logical loops just boil down to memcpy or memset. While not very
useful in general, some cases like masked arrays can profit when
creating zero stride boolean arrays for readonly views.
@charris
Copy link
Member

charris commented Jan 25, 2021

Does the SIMD work supercede this? @Qiyu8 @seiko2plus Thoughts?

@Qiyu8
Copy link
Member

Qiyu8 commented Jan 26, 2021

IMO, A memcpy&memset method is provided to replace SIMD/uloop method for special cases such as readonly masked arrays, but I don't see enough evidence to proves the benefits here. if It's shows better performance than SIMD/uloop, then we should accept it. OTOH, this PR reminds me that #16960 is extending current sse2 functions to universal intrinsics, we can have a bench test after #16960 is merged.

Base automatically changed from master to main March 4, 2021 02:03
@charris
Copy link
Member

charris commented Apr 21, 2021

@juliantaylor @eric-wieser I'm inclined to close this, thoughts?

@charris
Copy link
Member

charris commented Feb 20, 2023

I'm going to close this. Thanks Julian.

@charris charris closed this Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants