Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: improve performance of lexsort for integers. #26434

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

toobaz
Copy link

@toobaz toobaz commented May 14, 2024

This PR implements an alternative and more performant code path for lexsort when passed (appropriate - see below) integer keys.
It does so by compressing the levels into a single array using bit shifts, and then sorting the resulting array.

The result of this is, at a glance (details below):

  • ~1.75x speedup for large arrays (N>2000)
  • ... but a much large speedup (e.g. 8x in my application) can be obtained thanks to the possibility of selecting the desired sorting algorithm
  • ~0.75x slowdown for small arrays (N<1000)
  • very small slowdown for the (rare, I think) case where keys are integers but they are too large for the new code path (as reported in the docs, this is guaranteed not to happen for instance if all values are below MAX_INT / 2L, L being the number of levels)

The application I'm referring to is the use of lexsort inside COO sparse matrices in scipy.

The current PR was influenced by some feedback received by @ngoldbaum in the mailing list. However, I did not use C, because the Python code I wrote is vectorized anyway, and writing the same code in C would take me too much time (my C skills being full of rust) resulting in less readable code.

Assuming that, despite this, the approach is of interest for numpy (if this PR is not, I guess I will make an attempt directly in scipy?),

  • I must obviously solve the problem of where to position the Python code (I don't think the current file is right, since it contains lower level code and cleanly introducing the imports I need would result in a circular import - this is why there currently is a horrible import inside the function): guidance on this will be highly appreciated
  • I must also understand why at the end of the function I need to make axis explicit, but this is related to the previous point I think
  • I must write tests integrate the tests I wrote into the codebase

Few other things to note:

  • the approach is similar to the one I took in the MultiIndex engine for pandas
  • a previous version of this PR (before feedback from @ngoldbaum ) would allow the user to switch on or off the new code path, but indeed the checks required to establish if it is feasible add a negligible overhead
  • that same version would also switch to Python int if the keys values were too big to fit inside uint... but that brings a strong performance penalty over the old code path, so it makes no sense if the user doesn't have the choice
  • I think that it would be easy to patch the C code as to support different sorting algorithms... at least easy for someone a bit more fluent in C, definitely not for me. If such a patch is introduced, this PR will not guarantee the 8x speedup on nearly-sorted arrays, just the ~1.75x speedup.

Below are the result of the performance tests, where

  • "PR" is this branch, "base" is main (previous commit)
  • "mix" refers to random arrays, "sep" to arrays where two sorted (2-level) arrays are concatenated (so on which stable sort shines, because it has linear cost)
  • "sta" refers to tests with kind=stable (in the PR)
  • "big" refers to arrays where values are so large that the new code path cannot be taken (and a minimum of overhead is wasted in order to determine this is the case)

Absolute scale:
abs

Log scale:
log

Estimation of speedup ratio when the new code path is feasible:

speedup

... and when it is not (note that it is never taken for N < 1000, so the loss in efficiency is overhead):

speedup_big

The code used for these tests is available here.

@toobaz
Copy link
Author

toobaz commented May 14, 2024

I'm confused... the failing test is

         v = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
         assert_equal(np.lexsort(v), 0)

at https://github.com/numpy/numpy/blob/main/numpy/_core/tests/test_regression.py#L410

... but the docs state Each element of `keys` along the zeroth axis must be an array-like object of the same shape. So why are we checking the result - rather than expecting an error - in a case where keys is 1D (that is, elements of keys are scalar)?!

@ngoldbaum
Copy link
Member

Scalars are a kind of array-like. My reading of the docs is this case is allowed, with N=0. The test being triggered is 18 years old so this is behavior that's been stable for the entire life of the library it looks like.

@toobaz toobaz force-pushed the int_lexsort_kind branch 3 times, most recently from 9fb59dd to 0ca9b19 Compare May 14, 2024 23:40
@toobaz
Copy link
Author

toobaz commented May 15, 2024

(I see the failing test is also failing in #26436 , so I guess it's unrelated to this PR...)

# Check if we are only dealing with integers:
if all([isinstance(a, np.ndarray) and
np.issubdtype(a.dtype, np.integer)
for a in keys]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a tab here? The formatting is strange

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattip Outdated, but for the records: no, I just had aligned the content of the square brackets... something similar happens in the new version at https://github.com/numpy/numpy/pull/26434/files#diff-33968b6e8acadeaac162e52ce132395af75712779b23ad0e3cb1600fb36edcc6R659

@@ -434,11 +434,41 @@ def where(condition, x=None, y=None):


@array_function_from_c_func_and_dispatcher(_multiarray_umath.lexsort)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be using the __array_function__ dispatcher for python functions, and rename this function to be lexsort

@ngoldbaum
Copy link
Member

We looked at this at the triage meeting. Overall we think this is a neat idea but there are some issues.

First, can you add a one or two line comment above the first check that dispatches to the fast integer path explaining exactly what the fast path is doing. It would also be nice if you could link to an explanation of how the compression works, particularly if you can find any discussions about doing that for a sorting algorithm.

Do we really need the new kind argument? It's not great IMO to add a keyword argument which only applies based on particular properties of the input data.

We also found the coding style to be a little bit opaque. Can you try to break up some of the longer lines that combine many operations to make the algorithm a little clearer?

This is sort of on the bubble for the sort of contribution we'd take that improves performance. We have to balance maintainability too. Improvements that make this code easier to understand would tilt the balance towards merging this.

@ngoldbaum
Copy link
Member

Oh also it would be nice if you could add a new benchmark for lexsort to the existing benchmark suite and show that it improves performance for the case you're looking for and also that the performance hit is negligible for the un-optimized case. Ping here if you have trouble with the benchmark suite, happy to answer questions.

@toobaz
Copy link
Author

toobaz commented May 21, 2024

Again thanks for your great feedback @ngoldbaum

I will soon edit my PR accordingly, but first let me take a step back:

Do we really need the new kind argument? It's not great IMO to add a keyword argument which only applies based on particular properties of the input data.

Indeed, we do not really need it in this PR (at least for my purposes, since the default "stable" is precisely what I need to get the huge efficiency gain!).

This said, I decided to take a look at whether in fact supporting kind in the current code path was maybe a low hanging fruit.

Turned out it isn't (at least not low enough for my awful C skills). My understanding is that lexsort basically calls a normal mergesort on a rearranged version of the input keys. So I tried to just replace mergesort with the other kinds there... and I failed.

The attempt is here: main...toobaz:numpy:lexsort_kind

... but it fails brutally, already with
np.lexsort([[2, 7], [0, 0]], kind='h') which returns [1, 0]
(while np.lexsort([[2, 7], [0, 0]]), np.argsort([2, 7], kind='h') and np.lexsort([[2, 7], [0, 0]], kind='s') all correctly return [0, 1]).

If there's any chance someone good at C can maybe identify some stupid mistake I made, so that we get kind supported in general, great. Otherwise, if either the problem is deeper, or anyway nobody has time to check my (other) code, I will just remove support for kind from this PR.

@toobaz toobaz force-pushed the int_lexsort_kind branch 5 times, most recently from f104ce3 to 60e5b39 Compare June 3, 2024 12:59
@toobaz
Copy link
Author

toobaz commented Jun 3, 2024

OK, never mind (at least in this PR) the kind argument.

This new version drops it, makes code way more readable (I hope), and adds tests and benchmarks, here's the result:

| Change   | Before [76807f0e] <main>   | After [94dea95c] <int_lexsort_kind>   |   Ratio | Benchmark (Parameter)                                                      |
|----------|----------------------------|---------------------------------------|---------|----------------------------------------------------------------------------|
| +        | 1.70±0.02μs                | 2.07±0.02μs                           |    1.22 | bench_function_base.LexSort.time_lexsort('int64', 2, 100, 'mixed')         |
| +        | 1.81±0.03μs                | 2.20±0.02μs                           |    1.22 | bench_function_base.LexSort.time_lexsort('int64', 2, 100, 'separated')     |
| +        | 1.74±0.02μs                | 2.10±0.02μs                           |    1.21 | bench_function_base.LexSort.time_lexsort('int64', 2, 100, 'big')           |
| +        | 3.06±0.01μs                | 3.51±0.02μs                           |    1.15 | bench_function_base.LexSort.time_lexsort('int64', 4, 100, 'separated')     |
| +        | 2.38±0.08μs                | 2.71±0.01μs                           |    1.14 | bench_function_base.LexSort.time_lexsort('float64', 2, 100, 'mixed')       |
| +        | 2.50±0.06μs                | 2.85±0.01μs                           |    1.14 | bench_function_base.LexSort.time_lexsort('float64', 2, 100, 'separated')   |
| +        | 2.93±0.03μs                | 3.34±0.02μs                           |    1.14 | bench_function_base.LexSort.time_lexsort('int64', 4, 100, 'mixed')         |
| +        | 2.46±0.08μs                | 2.79±0.01μs                           |    1.13 | bench_function_base.LexSort.time_lexsort('float64', 2, 100, 'big')         |
| +        | 2.99±0.03μs                | 3.38±0.02μs                           |    1.13 | bench_function_base.LexSort.time_lexsort('int64', 4, 100, 'big')           |
| +        | 4.11±0.03μs                | 4.61±0.04μs                           |    1.12 | bench_function_base.LexSort.time_lexsort('float64', 4, 100, 'mixed')       |
| +        | 4.20±0.1μs                 | 4.67±0.03μs                           |    1.11 | bench_function_base.LexSort.time_lexsort('float64', 4, 100, 'big')         |
| -        | 12.6±0.03ms                | 7.12±0.06ms                           |    0.56 | bench_function_base.LexSort.time_lexsort('int64', 2, 100000, 'mixed')      |
| -        | 905±2μs                    | 490±5μs                               |    0.54 | bench_function_base.LexSort.time_lexsort('int64', 2, 10000, 'mixed')       |
| -        | 173±1ms                    | 92.8±0.8ms                            |    0.54 | bench_function_base.LexSort.time_lexsort('int64', 2, 1000000, 'mixed')     |
| -        | 171±2ms                    | 18.0±0.2ms                            |    0.11 | bench_function_base.LexSort.time_lexsort('int64', 2, 1000000, 'separated') |
| -        | 907±3μs                    | 85.7±1μs                              |    0.09 | bench_function_base.LexSort.time_lexsort('int64', 2, 10000, 'separated')   |
| -        | 12.7±0.1ms                 | 962±30μs                              |    0.08 | bench_function_base.LexSort.time_lexsort('int64', 2, 100000, 'separated')  |

As expected, small overhead costs on small arrays, large gains on large arrays.

@ngoldbaum I'm pretty sure that the code is still in the wrong file (forcing me to import numpy as np within the function) and that (probably relatedly) I'm misusing the "dispatcher"; but I don't understand your suggestion to rename _general_lexsort to lexsort - the latter is the name of the API method, which is now a Python method without corresponding C method. Should I rename the C method instead?

This said, there are two types of errors in the tests.

  • (some?) 32 bit platforms give wrong results, and I will investigate this as soon as I can find a test machine (I guess a Raspberry Pi will do)
  • the warnings.simplefilter("ignore", RuntimeWarning) at line 610 breaks some assertion... and on this I need help, because I really don't understand the error. It seems to refer to a warning being raised, but in my case a warning is being ignored!

[EDIT:] an example of the error I don't understand: https://github.com/numpy/numpy/actions/runs/9351623692/job/25738229144?pr=26434

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants