Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added libdivide for floor divide #17727

Merged
merged 32 commits into from Dec 2, 2020

Conversation

ganesh-k13
Copy link
Member

@ganesh-k13 ganesh-k13 commented Nov 7, 2020

Improvements:

download
Specs: ryzen 3600(12 cores), 32 GB memory

Changes:

  1. Added a macro check to use libdivide or not: NPY_USE_LIBDIVIDE
  2. Added libdivide to floor divide.
  3. Optimisation to remove % in floor divide code.
  4. *Added libdivide to timedelta.
  5. *Benchmark tests for ints to show improvements.
  6. *UT for ints and timedelta floor divide.

*new

Profiler code:

https://gist.github.com/ganesh-k13/6201227c5d3d65902c6eaf71357e72b1

Note: This is just a POC to show some performance improvements

resolves: #14959

cc:
@eric-wieser
@mattip
@seberg

@mattip
Copy link
Member

mattip commented Nov 7, 2020

It seems this PR conflates two optimizations:

  • using libidivide
  • optimize the loop for the case where is2 == 0.

I would be interested to see if the second optimization alone is enough to convince the compiler to optimize the division, as was mentioned in the original issue.

@seberg seberg marked this pull request as draft November 7, 2020 21:36
@seberg
Copy link
Member

seberg commented Nov 7, 2020

Hmm, that is true, maybe we should start off with some simple optimizations (see if we can remove the % operation and what happens if optimize (or let the compiler optimize) the is2 == 0 loop.

As far as I can tell, libdivide also supports AVX2 and AVX512 (if defines are set), which we could use but will require looking into the universal intrinsics probably.

@ganesh-k13
Copy link
Member Author

I used the PEP definition of // to write the logic:

  1. Cast to float to get float division value and floor it. Hence, no mod required.
  2. Used the constant loop logic as before for compiler optimization.

Unfortunately, no improvement or not anything visible, attaching my results:
download (1)

Please see the code changes I did in the latest commit, I expected at least a 20-30% improvement, is there an error in my logic?

@mattip
Copy link
Member

mattip commented Nov 8, 2020

According to the license of libdivide, we can use the zlib variant, which allows us to distribute and modify the source. We need to be careful to document any future modifications for Universal SIMD and should give attribution in our LICENSES_bundled.txt

@ganesh-k13
Copy link
Member Author

How can we set special build environment variables for the CI to pickup, modify the description? I want NPY_USE_LIBDIVIDE to be set.

@mattip
Copy link
Member

mattip commented Nov 8, 2020

Under which build conditions would it be better not to use NPY_USE_LIBDIVIDE? Perhaps it would be better to rethink why you might want that in the first place, if at all.

@ganesh-k13
Copy link
Member Author

ganesh-k13 commented Nov 8, 2020

Will do these later this week:

  • Add proper license
  • > document any future modifications for Universal SIMD

@mattip
Copy link
Member

mattip commented Nov 9, 2020

If the header and license is not included in the tarball when doing python setup.py sdist, it should be added to ./MANIFEST.in. We should set up a CI job to make sure we can build from the sdist tarball.

Edit: s/header/header and license/

@ganesh-k13
Copy link
Member Author

ganesh-k13 commented Nov 10, 2020

The files are getting included in the tar ball @mattip :

numpy-1.20.0.dev0+d3d932e/numpy/core/include/numpy/libdivide/                                                                                                                   
numpy-1.20.0.dev0+d3d932e/numpy/core/include/numpy/libdivide/LICENSE.txt                
numpy-1.20.0.dev0+d3d932e/numpy/core/include/numpy/libdivide/libdivide.h 

@ganesh-k13
Copy link
Member Author

ganesh-k13 commented Nov 10, 2020

Currently, I am trying to break down libdivide header by following: #13516 (comment)

I hope this is the right approach to support multi-platform SIMD

@mattip
Copy link
Member

mattip commented Nov 10, 2020

I think it is fine to leave universal simd for another PR. Other reviewers: should this hit the mailing list/get a release note?

@ganesh-k13
Copy link
Member Author

ganesh-k13 commented Nov 10, 2020

Oh, sure @mattip , it seems like theres more we can do with libdivide itself, different algos we can choose from, etc. Right now it's the most generic one, aimed to work with compatibility. Will raise future PRs with experimentations.

@seberg
Copy link
Member

seberg commented Nov 10, 2020

Since this doesn't change anything user-facing, I don't think we need to hit the mailing list or even add release note. It might be nice to make a release note to summarize some speed improved also due to SIMD (maybe we should add a performance category to the release notes, since performance changes always seem to interest everyone even though they are rarely important from a compatibility perspective).

We should double check that zlib is fine for inclusion, so that this has definitely no affect on the NumPy license.

@charris
Copy link
Member

charris commented Nov 10, 2020

It might be nice to make a release note to summarize some speed improved also due to SIMD

I want some such section in the release notes and would be grateful is someone else provides it.

npy_set_floatstatus_divbyzero();
*((@type@ *)op1) = 0;
}
else if (((in1 > 0) != (in2 > 0)) && (in1 % in2 != 0)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can libdivide compute in1 % in2 for you? It seems a bit silly to use libdivide only to then perform a remainder calculation without it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly think we can avoid this % and rewrite it as postprocessing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I still don't know how removing the % gave no performance boost :). The compiler is magically optimizing something. #17727 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting. @ganesh-k13 two things: First make sure you are dividing a positive by a negative number (or vice versa), otherwise this is not hit at all. Second, was the timing difference with libdivide? I guess it might be the compiler is smart enough to optimize the modulo away, but I would be surprised if it is smart enough when libdivide is being used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They seem to have not done it yet: ridiculousfish/libdivide#9

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this changes is subtract one for rounding purproses, Now unless there is some edge case again, I think you can just do without the subtract, and then move the if to later, so that if (res < 0) && (res * in2 != in1) { res -= 1}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right about not hitting the case, @seberg , seems like in the profile script I forgot to invert the signs. Above method seems to work, few edge cases to iron out(like <= 0, etc), will try them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found three edge cases:

  1. res is 0, then possible negative divisor/dividend
  2. divisor is 0 or -0, handled by putting inside else
  3. dividend is 0, handled by the same logic as 1.

Let me know if any more are there.
[EDIT]: Can use the same logic in sliding as well.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for all the quick followups. I am happy with the current loop macro setups (we can always change them easily anyway).

There is only one big issue remaining from my side, and that is the empty-array case, which we have to guard against in the constant branch. (And it is slightly scary tests did not find it, although it may be that we have tests for it, and we would just have to use valgrind or so to notice the issue reliably).

numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved
@@ -249,6 +249,29 @@ def test_division_int(self):
assert_equal(x // 100, [0, 0, 0, 1, -1, -1, -1, -1, -2])
assert_equal(x % 100, [5, 10, 90, 0, 95, 90, 10, 0, 80])

@pytest.mark.parametrize("input_dtype",
[np.int8, np.int16, np.int32, np.int64])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more than necessary, but that is fine. The different unit cases are not super interesting for the division code, but frankly a bit many tests are fine :).

numpy/core/tests/test_umath.py Outdated Show resolved Hide resolved
numpy/core/tests/test_umath.py Outdated Show resolved Hide resolved
self.x = np.arange(size)

def time_floor_divide(self, size):
self.x//8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose dividing by 8 is one of the best case scenarios? A bit curious how things behave for dividing by a less weird number? (but honestly, just curious, I trust that libdivide is worth it).

One thing I am more curious about is how much the speedup is for the small integers, like int8, etc? I guess there are not many specialized registers for those, so the upcast is probably just as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not happy with the initial tests, so I rewrote them, please let me know
[EDIT] After improved commit a5e1235

       before           after         ratio                                                                                                                                     
     [3dca0c71]       [8912ffd9]                                                                                                                                                
     <master>         <enh_14959-libdivide>                                                                                                                                     
-     2.36±0.01μs      2.13±0.04μs     0.90  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 0)                                              
-     2.11±0.01μs      1.84±0.05μs     0.87  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -43)                                            
-     2.11±0.02μs      1.79±0.01μs     0.85  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 43)                                             
-     2.19±0.01μs      1.66±0.01μs     0.76  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, -8)                                             
-     2.18±0.03μs      1.65±0.01μs     0.76  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int8'>, 8)                                              
-      45.1±0.3ms      25.6±0.05ms     0.57  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 0)                                             
-         285±2μs        137±0.2μs     0.48  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -43)                                           
-       287±0.5μs          138±1μs     0.48  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 43)                                            
-       121±0.8ms       53.5±0.2ms     0.44  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 43)
-       122±0.2ms       53.3±0.3ms     0.44  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -43)
-      115±0.08ms         44.8±1ms     0.39  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 43)
-       116±0.6ms       44.4±0.7ms     0.38  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -43)
-       304±0.5μs          107±8μs     0.35  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 8)
-         126±1ms       42.3±0.3ms     0.34  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, -8)
-       126±0.1ms      42.4±0.07ms     0.34  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int64'>, 8)
-         302±1μs          100±2μs     0.33  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, -8)
-      42.7±0.1ms      12.9±0.04ms     0.30  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 0)
-       120±0.1ms      35.1±0.09ms     0.29  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, 8)
-       118±0.2ms      34.6±0.05ms     0.29  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int32'>, -8)
-        98.8±1μs       22.5±0.2μs     0.23  bench_ufunc.CustomScalarFloorDivideInt.time_floor_divide_int(<class 'numpy.int16'>, 0)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did choose 8 to show nice results :) . But yeah we get some speedup for a number like 43 with int8 itself.

doc/release/upcoming_changes/17727.performance.rst Outdated Show resolved Hide resolved
@ganesh-k13
Copy link
Member Author

Hey @seberg, any other changes needed here?

@seberg seberg self-requested a review November 27, 2020 00:46
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I am happy to put this in, may merge in a bit. But maybe someone else wants to do a quick pass. (marked a few nitpicks, but nothing that matters). There might be some followup with other division loops possible, but not sure.

An exhaustive test for int32 might be neat (not as a test, just to be sure), but it should not be necessary.

LICENSES_bundled.txt Outdated Show resolved Hide resolved
numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved
numpy/core/src/umath/loops.c.src Outdated Show resolved Hide resolved
@ganesh-k13
Copy link
Member Author

Thanks @seberg , I was planning to work on the universal intrinsics. We still have not used the SSE and AVX versions of libdivide. I will make a POC like last time to see the improvements and we can take a call if it's worth the change.

some followup with other division loops possible

We have covered ints and timedeltas, are there more? Will be happy to port to them as well in a follow-up.

@seberg
Copy link
Member

seberg commented Dec 1, 2020

Hmm, might have thought in the wrong direction about other loops, I guess libdivide probably does not have any unsigned integer versions? Should just double check which divide related loops might be similar...

@ganesh-k13
Copy link
Member Author

libdivide does support unsigned 32 and 64 bit ints. I did a quick browse of the code, the other division loops are for float only. Universal intrinsics will be a follow up PR.
Regarding:

An exhaustive test for int32 might be neat

Anything in particular I can test here? Any testing strategy? The current added UT tests for all boundry case numbers.

@mattip mattip merged commit 03692a7 into numpy:master Dec 2, 2020
@mattip
Copy link
Member

mattip commented Dec 2, 2020

Thanks @ganesh-k13

@mattip
Copy link
Member

mattip commented Dec 2, 2020

Further implementation and test improvements can be in follow-on PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use libdivide to make integer division faster
5 participants