Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Optimize np.isin and np.in1d for integer arrays and add kind= #12065

Merged
merged 50 commits into from Jun 23, 2022
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
cedba62
MAINT: Optimize np.isin for integer arrays
MilesCranmer Oct 1, 2018
d2ea819
MAINT: Optimize np.isin for boolean arrays
MilesCranmer Oct 2, 2018
bcaabe1
TST: add tests for in1d/isin
MilesCranmer Oct 3, 2018
60c71bb
BENCH: Add benchmark for integer input to np.isin
MilesCranmer Dec 25, 2018
935e3d9
DOC: Add release notes for faster np.isin
MilesCranmer Dec 26, 2018
0f6108c
MAINT: Check for overflow in integral np.isin
MilesCranmer Dec 26, 2018
d643706
TST: Extend np.in1d tests to old algorithm
MilesCranmer Dec 27, 2018
afbcdf2
DOC: Move release notes for faster np.isin
MilesCranmer Jun 10, 2022
179d157
MAINT: Fix linting errors in in1d tests
MilesCranmer Jun 10, 2022
cbf7c9c
DOC: Undo change to old release notes
MilesCranmer Jun 10, 2022
f570065
MAINT: Change `_slow_integer` parameter to `method`
MilesCranmer Jun 10, 2022
68a1acf
Add check for methods
MilesCranmer Jun 10, 2022
d1a5309
MAINT: Update tests to use new `method` argument
MilesCranmer Jun 10, 2022
281dadc
DOC: Specify constraints of method in docstring
MilesCranmer Jun 10, 2022
a8da1ef
MAINT: Fix linting error in test
MilesCranmer Jun 10, 2022
bd41739
DOC: Describe default in docstring
MilesCranmer Jun 10, 2022
93e371f
MAINT: Fix use of dispatcher for isin
MilesCranmer Jun 10, 2022
8e14d1a
MAINT: Fix error message
MilesCranmer Jun 10, 2022
67ab480
DOC: Fix text in changelog
MilesCranmer Jun 10, 2022
530ccde
MAINT: Formatting changes for in1d
MilesCranmer Jun 10, 2022
d3081b6
DOC: Improve docstring explanation
MilesCranmer Jun 10, 2022
d7e2582
MAINT: bool instead of np.bool_ dtype
MilesCranmer Jun 10, 2022
9e6bc79
DOC: Clean up isin docstring
MilesCranmer Jun 10, 2022
7cb937c
DOC: Describe memory considerations in in1d/isin
MilesCranmer Jun 11, 2022
1ef3737
DOC: Clean up change log
MilesCranmer Jun 11, 2022
6d91753
MAINT: Add back in1d tests of old method
MilesCranmer Jun 11, 2022
7a1ee13
MAINT: Fix misplaced default in in1d test
MilesCranmer Jun 11, 2022
a8677bb
MAINT: Switch to old in1d for large memory usage
MilesCranmer Jun 17, 2022
cde60ce
MAINT: Switch parameter name to 'kind' over 'method'
MilesCranmer Jun 17, 2022
34a3358
TST: Use new "kind" argument over "method"
MilesCranmer Jun 17, 2022
8f57644
MAINT: kind now uses "mergesort" instead of "sort"
MilesCranmer Jun 17, 2022
c5db8e8
MAINT: Protect against integer overflow in in1d
MilesCranmer Jun 17, 2022
31f7395
MAINT: Clean up memory checking for in1d
MilesCranmer Jun 17, 2022
3533b86
MAINT: Clean up integer overflow check in in1d
MilesCranmer Jun 17, 2022
4b62918
MAINT: RuntimeError for unexpected integer overflow
MilesCranmer Jun 17, 2022
43b4daf
TST: validate in1d errors
MilesCranmer Jun 17, 2022
76bb035
DOC: Fix formatting issues in docstring
MilesCranmer Jun 17, 2022
0c339d4
DOC: Add missing indent in docstring
MilesCranmer Jun 18, 2022
d9d4fd5
DOC: Fix list format for sphinx
MilesCranmer Jun 18, 2022
4ed458f
MAINT: change kind names for in1d
MilesCranmer Jun 18, 2022
6a3c80f
DOC: Improve clarity of in1d docstring
MilesCranmer Jun 20, 2022
75dbbea
DOC: `assume_unique` does not affect table method
MilesCranmer Jun 20, 2022
4858094
DOC: Notes on `kind` to in1d/isin docstring
MilesCranmer Jun 20, 2022
bb71875
MAINT: Minor suggestions from code review
MilesCranmer Jun 22, 2022
64de8b2
MAINT: Remove positionality of kind in isin
MilesCranmer Jun 22, 2022
408e611
DOC: Clean up change log
MilesCranmer Jun 22, 2022
c4aa5d8
DOC: Rephrase docstring of in1d/isin
MilesCranmer Jun 22, 2022
6244c06
MAINT: Fix edgecase for bool containers
MilesCranmer Jun 22, 2022
3b117e7
TST: Reduce code re-use with pytest mark
MilesCranmer Jun 22, 2022
1d3bdd1
TST: Skip empty arrays for kind="table"
MilesCranmer Jun 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
19 changes: 19 additions & 0 deletions benchmarks/benchmarks/bench_lib.py
Expand Up @@ -137,3 +137,22 @@ def setup(self, array_size, percent_nans):

def time_unique(self, array_size, percent_nans):
np.unique(self.arr)


class Isin(Benchmark):
"""Benchmarks for `numpy.isin`."""

param_names = ["size", "highest_element"]
params = [
[10, 100000, 3000000],
[10, 10000, int(1e8)]
]

def setup(self, size, highest_element):
self.array = np.random.randint(
low=0, high=highest_element, size=size)
self.in_array = np.random.randint(
low=0, high=highest_element, size=size)

def time_isin(self, size, highest_element):
np.isin(self.array, self.in_array)
8 changes: 8 additions & 0 deletions doc/release/upcoming_changes/12065.performance.rst
@@ -0,0 +1,8 @@
Faster version of ``np.isin`` and ``np.in1d`` for integer arrays
----------------------------------------------------------------
``np.in1d`` (used by ``np.isin``) can now switch to a faster algorithm
(up to >10x faster) when it is fed two integer arrays.
MilesCranmer marked this conversation as resolved.
Show resolved Hide resolved
The algorithm bares similarities to a counting sort in that it
MilesCranmer marked this conversation as resolved.
Show resolved Hide resolved
MilesCranmer marked this conversation as resolved.
Show resolved Hide resolved
uses the ``test_elements`` argument to index a boolean helper
MilesCranmer marked this conversation as resolved.
Show resolved Hide resolved
array with ``True`` values where elements exist. The ``element``
argument simply indexes this array of booleans.
103 changes: 98 additions & 5 deletions numpy/lib/arraysetops.py
Expand Up @@ -516,12 +516,13 @@ def setxor1d(ar1, ar2, assume_unique=False):
return aux[flag[1:] & flag[:-1]]


def _in1d_dispatcher(ar1, ar2, assume_unique=None, invert=None):
def _in1d_dispatcher(ar1, ar2, assume_unique=None, invert=None,
MilesCranmer marked this conversation as resolved.
Show resolved Hide resolved
method='auto'):
return (ar1, ar2)


@array_function_dispatch(_in1d_dispatcher)
def in1d(ar1, ar2, assume_unique=False, invert=False):
def in1d(ar1, ar2, assume_unique=False, invert=False, method='auto'):
"""
Test whether each element of a 1-D array is also present in a second array.

Expand All @@ -544,6 +545,19 @@ def in1d(ar1, ar2, assume_unique=False, invert=False):
False where an element of `ar1` is in `ar2` and True otherwise).
Default is False. ``np.in1d(a, b, invert=True)`` is equivalent
to (but is faster than) ``np.invert(in1d(a, b))``.
method : {'auto', 'sort', 'dictionary'}, optional
The algorithm to use. This will not affect the final result,
but will affect the speed. Default is 'auto'.
MilesCranmer marked this conversation as resolved.
Show resolved Hide resolved

- If 'sort', will use a sort-based approach.
- If 'dictionary', will use a key-dictionary approach similar
to a radix sort. This is only available for boolean and
integer arrays.
- If 'auto', will automatically choose the method which is
expected to perform the fastest, which depends
on the size and range of `ar2`. For larger sizes,
'dictionary' is chosen. For larger range or smaller
sizes, 'sort' is chosen.

.. versionadded:: 1.8.0

Expand Down Expand Up @@ -593,6 +607,70 @@ def in1d(ar1, ar2, assume_unique=False, invert=False):
# Ensure that iteration through object arrays yields size-1 arrays
if ar2.dtype == object:
ar2 = ar2.reshape(-1, 1)
# Convert booleans to uint8 so we can use the fast integer algorithm
if ar1.dtype == np.bool_:
ar1 = ar1.view(np.uint8)
if ar2.dtype == np.bool_:
ar2 = ar2.view(np.uint8)
MilesCranmer marked this conversation as resolved.
Show resolved Hide resolved

# Check if we can use a fast integer algorithm:
integer_arrays = (np.issubdtype(ar1.dtype, np.integer) and
np.issubdtype(ar2.dtype, np.integer))
MilesCranmer marked this conversation as resolved.
Show resolved Hide resolved

if method not in ['auto', 'sort', 'dictionary']:
MilesCranmer marked this conversation as resolved.
Show resolved Hide resolved
raise ValueError(
"Invalid method: {0}. ".format(method)
+ "Please use 'auto', 'sort' or 'dictionary'.")

if integer_arrays and method in ['auto', 'dictionary']:
MilesCranmer marked this conversation as resolved.
Show resolved Hide resolved
ar2_min = np.min(ar2)
ar2_max = np.max(ar2)
ar2_size = ar2.size

# Check for integer overflow
with np.errstate(over='raise'):
try:
ar2_range = ar2_max - ar2_min

# Optimal performance is for approximately
# log10(size) > (log10(range) - 2.27) / 0.927.
# See discussion on
# https://github.com/numpy/numpy/pull/12065
optimal_parameters = (
np.log10(ar2_size) >
((np.log10(ar2_range + 1.0) - 2.27) / 0.927)
MilesCranmer marked this conversation as resolved.
Show resolved Hide resolved
)
except FloatingPointError:
optimal_parameters = False

# Use the fast integer algorithm
if optimal_parameters or method == 'dictionary':

if invert:
outgoing_array = np.ones_like(ar1, dtype=np.bool_)
else:
outgoing_array = np.zeros_like(ar1, dtype=np.bool_)

# Make elements 1 where the integer exists in ar2
if invert:
isin_helper_ar = np.ones(ar2_range + 1, dtype=np.bool_)
isin_helper_ar[ar2 - ar2_min] = 0
else:
isin_helper_ar = np.zeros(ar2_range + 1, dtype=np.bool_)
isin_helper_ar[ar2 - ar2_min] = 1

# Mask out elements we know won't work
basic_mask = (ar1 <= ar2_max) & (ar1 >= ar2_min)
outgoing_array[basic_mask] = isin_helper_ar[ar1[basic_mask] -
ar2_min]

return outgoing_array
elif method == 'dictionary':
raise ValueError(
"'dictionary' method is only supported for non-integer arrays. "
MilesCranmer marked this conversation as resolved.
Show resolved Hide resolved
"Please select 'sort' or 'auto' for the method."
)


# Check if one of the arrays may contain arbitrary objects
contains_object = ar1.dtype.hasobject or ar2.dtype.hasobject
Expand Down Expand Up @@ -637,12 +715,14 @@ def in1d(ar1, ar2, assume_unique=False, invert=False):
return ret[rev_idx]


def _isin_dispatcher(element, test_elements, assume_unique=None, invert=None):
def _isin_dispatcher(element, test_elements, assume_unique=None, invert=None,
method='auto'):
return (element, test_elements)


@array_function_dispatch(_isin_dispatcher)
def isin(element, test_elements, assume_unique=False, invert=False):
def isin(element, test_elements, assume_unique=False, invert=False,
method='auto'):
"""
Calculates ``element in test_elements``, broadcasting over `element` only.
Returns a boolean array of the same shape as `element` that is True
Expand All @@ -664,6 +744,19 @@ def isin(element, test_elements, assume_unique=False, invert=False):
calculating `element not in test_elements`. Default is False.
``np.isin(a, b, invert=True)`` is equivalent to (but faster
than) ``np.invert(np.isin(a, b))``.
method : {'auto', 'sort', 'dictionary'}, optional
The algorithm to use. This will not affect the final result,
but will affect the speed. Default is 'auto'.

- If 'sort', will use a sort-based approach.
- If 'dictionary', will use a key-dictionary approach similar
to a radix sort. This is only available for boolean and
integer arrays.
- If 'auto', will automatically choose the method which is
expected to perform the fastest, which depends
on the size and range of `ar2`. For larger sizes,
'dictionary' is chosen. For larger range or smaller
sizes, 'sort' is chosen.

Returns
-------
Expand Down Expand Up @@ -737,7 +830,7 @@ def isin(element, test_elements, assume_unique=False, invert=False):
"""
element = np.asarray(element)
return in1d(element, test_elements, assume_unique=assume_unique,
invert=invert).reshape(element.shape)
invert=invert, method=method).reshape(element.shape)


def _union1d_dispatcher(ar1, ar2):
Expand Down