API: Optimize np.isin and np.in1d for integer arrays and add `kind=` #12065

MilesCranmer · 2018-10-01T16:19:24Z

This commit adds some simple functionality to np.isin and np.in1d that greatly increases performance when both arrays are integral. It works by creating a boolean array with elements set to 1 where the parent array (ar2) has elements and 0 otherwise. This array is then indexed by the child array (ar1) to create the output.

MilesCranmer · 2018-10-01T18:17:04Z

Posted this to the mailing list as well

hameerabbasi · 2018-10-01T18:42:43Z

I believe the flag shouldn't be necessary at all. Your optimisation should simply work if both dtypes are integral. This shows how to check for that.

adeak · 2018-10-01T19:09:44Z

Have you done any checks to see what happens to performance when there are, say, a small number of large integers? I imagine at one point the cost of creating a huge and very sparse array might become noticeable. Of course I'd imagine such cases (assuming they exist) to be rare in nature.

MilesCranmer · 2018-10-01T19:12:49Z

Thanks @hameerabbasi, I will make it automatic.

@adeak you make a good point. I'll make some benchmarks on my machine and post them. I suppose we can use the metric float(np.max(ar2) - np.min(ar2))/ar2.size to decide whether to apply this optimization?

MilesCranmer · 2018-10-01T19:35:47Z

Here are some benchmarks: https://docs.google.com/spreadsheets/d/1vAx_zxQqqOTXOIFRMtafpUGnd-hquYOqQpxPa7JYR4c/edit?usp=sharing

(Red color shows when the normal version is twice as fast as the new version)

Suprisingly even in the very rare case of (np.max(ar) - np.min(ar)) = 10^10, but 10 elements, this version of np.isin() only takes 5 seconds on my machine. So maybe we can just take this for all integer arrays? And if one wants to use the regular algorithm, they can set it to be floating type. Thoughts?

numpy/lib/arraysetops.py

MilesCranmer · 2018-10-01T20:32:00Z

Sorry, I edited my above comment - instead I meant the worst performance is for 10 elements with a range of 10^10. Perhaps we should add some metric to bypass this algorithm in cases like this after all?

MilesCranmer · 2018-10-01T21:46:17Z

So in looking at the benchmarks where the original algorithm is not twice as good as the new one, I see the following figure. The y-axis is log10(array size) and the x-axis is log10(range) for the benchmarks. The color is the log10(speedup) - i.e., a value of 1 for color means the new code is 10x faster.

The dark part looks to have a linear top, aside from the rightmost column - I'm not sure what is going on there - maybe a memory problem? This is all pretty approximate and specific to my own machine so I think it is okay to be approximate here - we really just want to prevent the extreme slow behaviour from this addition.

To block out the space where the old code is <2x faster than the new code, the line:

log10(size) = (log10(range) - 2.27) / 0.927

cuts it appropriately. Therefore, I can add this check for whether the new code or old code should run.

Edit: The rightmost column is all 1 since the normal algorithm is used at 8e8 range!

MilesCranmer · 2018-10-01T22:02:41Z

With the conditional run on array parameters above, here is a histogram of the log10(speedups) for all benchmarks. As you can, the vast majority of settings result in speedups and there are no harsh slowdowns anywhere.

MilesCranmer · 2018-10-01T22:19:13Z

Added all suggestions. Please let me know what you think.

Cheers,
Miles

hameerabbasi

Some other feedback. I still have some concerns about the memory consumption of this algorithm versus the original one, but I won't stop this moving forward based on those.

numpy/lib/arraysetops.py

MilesCranmer · 2018-10-02T17:33:19Z

Thanks for the suggestions, I will implement them.

Regarding memory: the memory usage is already capped at 8x10^8 elements for the helper array before using the normal method so 800 MB max. To get to that point, you would need greater than:

size >= 10**((log10(range) - 2.27) / 0.927) = 14,303,321

So, minimum 14,303,321 elements in ar2 for it to be within the optimal parameter cutoff. In order to have values all the way up to 8x10^8 elements, you would need at least 32-bit integers, so your input array is minimum 58 MB already. So I think you would already be dealing with memory-heavy calculations in that case so maybe it is expected. What do you think?

hameerabbasi · 2018-10-02T17:43:46Z

800 MB

Do you have a comparison versus the original algorithm possibly? That would be more helpful. I'm just commenting with general tips, I know nothing of the algorithms used.

In case your algorithm is worse: 800 MB might not be much on today's multi-gigabyte-memory desktops, but we have to keep in mind IoT devices and things like Raspberry Pis as well, which is the use-case I'm more concerned about. (It might make sense to add back the "switch" I asked you to remove when I wasn't aware of this earlier, I apologize for backtracking.)

MilesCranmer · 2018-10-02T18:19:12Z

So from another benchmark, it looks like this new algorithm also does better for memory usage than the traditional method, as least for one case.

In the same google sheet (Sheet name: "Worst case memory usage") : (here), I show the memory usage as recorded by the package memory_profiler: https://pypi.org/project/memory_profiler/. The memory is given every 0.1 seconds (I add sleep(.1) at major steps in the code to increase the resolution).

Basically, for parameters:

range ~ 8x10^8 - 1
size ~ 15x10^6,

I record, for the normal algorithm:

max memory usage = 1.8 GB

and the fast algorithm:

max memory usage = 1.4 GB.

So perhaps I should just turn the range < 8e8 requirement off?

hameerabbasi · 2018-10-02T18:53:56Z

So perhaps I should just turn the range < 8e8 requirement off?

Sounds good to me, thanks for being so thorough. I appreciate your patience with my review. After my remaining comments are addressed, this gets a +1 from me.

MilesCranmer · 2018-10-02T21:42:49Z

All requests have been added. Thanks for your help @hameerabbasi!

Cheers,
Miles

hameerabbasi

Great, I'd make sure the tests hit the new area of the code, both with bool data types and one integer data type. Other than that, this is a +1 from me.

numpy/lib/arraysetops.py

MilesCranmer · 2018-10-03T15:35:00Z

The tests actually already hit the new functionality with their integer tests. I will add some integer tests which explicitly hit the old code (which requires range > 186 x 10^(0.927 x size) ) as well as boolean tests which don't exist yet.

MilesCranmer · 2018-10-03T16:15:59Z

Added suggestions and tests.

hameerabbasi

This gets a +1 from me -- Thanks @MilesCranmer for the PR.

MilesCranmer · 2018-10-09T17:57:19Z

Anything else I need to do/people I need to email to have this merged?

Thanks,
Miles

hameerabbasi · 2018-10-09T17:58:41Z

I don't have write access. Perhaps ping the mailing list?

mattip · 2018-10-09T18:42:18Z

Does this have any impact on existing asv benchmarks?

hameerabbasi

@mattip Makes a valid point about the benchmarks, if there aren't any asv benchmarks with np.isin, it would be nice to add them. This has my approval either way though: Don't want to make the barrier to entry too high.

numpy/lib/arraysetops.py

MilesCranmer · 2018-10-09T19:17:10Z

I just looked and I couldn't find any usage of isin or in1d in the benchmarks.

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

MilesCranmer · 2022-06-22T21:18:03Z

Thanks! All suggestions implemented.

seberg · 2022-06-22T22:37:56Z

I am still curious if the ar1 = ar1 + np.uint(8) is actually necessary, but it doesn't matter. Yes we can skip the pytest.raises() for now (I do think it is fine to mix it in a file).

The parametrization is causing some empty array tests to fail because [] is np.array([]) which is float64 (and thus gets correctly rejected). (Maybe more, but those are the ones I noticed.)

MilesCranmer · 2022-06-22T23:07:07Z

Thanks, fixed!

MilesCranmer · 2022-06-23T13:49:50Z

Anything else left or could this be merged?

seberg · 2022-06-23T14:14:47Z

Looks good to me now, lets give it a shot. Thanks @MilesCranmer and everyone who reviewed! Sorry it took a bit of iterating, new API makes things a bit slower to move.

It would be awesome to follow-up a bit on the tests especially, but not particularly important.

If anyone has concerns about the API addition or sees a fixup that would be good, please comment or open an issue!

MilesCranmer · 2022-06-23T14:16:23Z

Awesome, thanks so much @seberg and everyone else who helped with the PR!

InessaPawson · 2022-06-23T14:51:43Z

Hi-five on merging your first pull request to NumPy, @MilesCranmer! We hope you stick around!

WarrenWeckesser · 2022-06-23T23:12:26Z

This might have introduced a bug; take a look at #21841

MilesCranmer · 2022-06-24T00:06:21Z

Fixed on #21842. Mistake was not unit-testing for empty integer arrays–those are now added.

MilesCranmer · 2022-06-24T04:57:58Z

Other post-merge question: is 'timedelta64' technically a subtype of np.integer? If so I think that might break things.

(I'm trying to make a similar method for unique1d, but it has a unittest for timedelta64 which gets past the integer check)

Edit: yep, looks like this is a potential bug:

How should I check for pure integer arrays (i.e., no datetimes or other dtypes which can't get evaluated by int())?

adeak · 2022-06-24T11:17:09Z

How should I check for pure integer arrays (i.e., no datetimes or other dtypes which can't get evaluated by int())?

I'm not sure you can do much short of special-casing the "known bad" integer-looking dtypes. There's only this one built in NumPy.

This is what seems like a problem to me:

>>> np.timedelta64.mro()
[numpy.timedelta64,
 numpy.signedinteger,
 numpy.integer,
 numpy.number,
 numpy.generic,
 object]

>>> np.int64.mro()
[numpy.int64,
 numpy.signedinteger,
 numpy.integer,
 numpy.number,
 numpy.generic,
 object]

int64 and timedelta64 are sibling classes, which means you don't have any superdtype for int64 that doesn't include timedelta64. But it seems to be the only such edge case:

>>> np.integer.__subclasses__()
[numpy.signedinteger, numpy.unsignedinteger]

>>> np.signedinteger.__subclasses__()
[numpy.int8,
 numpy.int16,
 numpy.int32,
 numpy.int64,
 numpy.longlong,
 numpy.timedelta64]

>>> np.unsignedinteger.__subclasses__()
[numpy.uint8, numpy.uint16, numpy.uint32, numpy.uint64, numpy.ulonglong]

It seems clear to me that dropping the pure-python int range check is not an option, we should just force the weird integer-like dtypes to use the old code path.

So one question remains: can users have custom integer-looking dtypes that don't work with int? If yes, we would have to manually list the "known good" integer types, plus bool_. If we're happy with breaking on that, we could just test for "np.integer but not np.timedelta64".

Weeell, I guess you could sacrifice some purity, and pull out the ar2_range check before the if integer_arrays if. If the check fails, set integer_arrays to False...

shoyer · 2022-06-24T21:04:52Z

For what it's worth, timedelta64 definitely should not be a subclass of int. This can and should be changed.

I have an old PR to fix this that was never finished up if someone wants to help: #10701

The only real debate was whether timedelta64 and datetime64 should share a base class or not.

seberg · 2022-06-27T12:37:50Z

For what it's worth, timedelta64 definitely should not be a subclass of int. This can and should be changed.

Yes, but it is a change with somewhat unclear side-effects downstream? In this case, I would prefer adding an explicit solution here for now (and test for datetime/timedelta).

This PR fixes the issue discussed on #12065 and #21843 where 'timedelta64' was noted to be a subtype of numpy.integer. This in principle should detect any cases where int(np.min(ar2)) fails. This PR also adds unittests for these. * TST: Create in1d test for timedelta input * MAINT: fix in1d for timedelta input * TST: in1d raise ValueError for timedelta input * MAINT: Clean up type checking for isin kind="table" * TST: Add test for mixed boolean/integer in1d * MAINT: Increase readability of in1d type checking * STY: Apply small code style tweaks This is probably really mainly my personal opinion... Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

adeak reviewed Oct 1, 2018

View reviewed changes

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved

MilesCranmer force-pushed the master branch from 4c54ac7 to bdb6215 Compare October 1, 2018 22:16

MilesCranmer changed the title ~~ENH: Add fast_integers option to np.isin and np.in1d~~ ENH: Optimize np.isin and np.in1d for integer arrays Oct 1, 2018

MilesCranmer force-pushed the master branch from bdb6215 to 646ee56 Compare October 1, 2018 22:29

hameerabbasi reviewed Oct 2, 2018

View reviewed changes

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved

numpy/lib/arraysetops.py Show resolved Hide resolved

MilesCranmer force-pushed the master branch from 646ee56 to 0c947f1 Compare October 2, 2018 21:41

hameerabbasi approved these changes Oct 3, 2018

View reviewed changes

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved

MilesCranmer force-pushed the master branch from 0c947f1 to fde1b64 Compare October 3, 2018 16:07

hameerabbasi approved these changes Oct 4, 2018

View reviewed changes

hameerabbasi approved these changes Oct 9, 2018

View reviewed changes

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved

MilesCranmer and others added 6 commits June 22, 2022 16:53

MAINT: Minor suggestions from code review

bb71875

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>

MAINT: Remove positionality of kind in isin

64de8b2

DOC: Clean up change log

408e611

DOC: Rephrase docstring of in1d/isin

c4aa5d8

MAINT: Fix edgecase for bool containers

6244c06

TST: Reduce code re-use with pytest mark

3b117e7

TST: Skip empty arrays for kind="table"

1d3bdd1

seberg merged commit 91e753c into numpy:main Jun 23, 2022

seberg changed the title ~~MAINT: Optimize np.isin and np.in1d for integer arrays~~ API: Optimize np.isin and np.in1d for integer arrays and add kind= Jun 23, 2022

github-actions bot added the 30 - API label Jun 23, 2022

MilesCranmer mentioned this pull request Jun 23, 2022

BUG: Fix in1d for empty integer array as input #21842

Merged

MilesCranmer mentioned this pull request Jun 24, 2022

API: Optimize np.unique for integer arrays; add kind parameter #21843

Open

MilesCranmer mentioned this pull request Jun 27, 2022

BUG: Fix numpy.isin for timedelta dtype #21860

Merged

MilesCranmer mentioned this pull request Jun 30, 2022

Faster versions of isin, in1d, unique for integer arrays google/jax#11325

Closed

seberg mentioned this pull request Dec 23, 2022

BUG: numpy.isin does not function correctly with two arrays with different integer type #22877

Closed

MilesCranmer mentioned this pull request Dec 23, 2022

BUG: Fix integer overflow in in1d for mixed integer dtypes #22877 #22878

Merged

API: Optimize np.isin and np.in1d for integer arrays and add kind= #12065

API: Optimize np.isin and np.in1d for integer arrays and add kind= #12065

Conversation

MilesCranmer commented Oct 1, 2018

MilesCranmer commented Oct 1, 2018

hameerabbasi commented Oct 1, 2018

adeak commented Oct 1, 2018

MilesCranmer commented Oct 1, 2018

MilesCranmer commented Oct 1, 2018 • edited

MilesCranmer commented Oct 1, 2018

MilesCranmer commented Oct 1, 2018 • edited

MilesCranmer commented Oct 1, 2018

MilesCranmer commented Oct 1, 2018

hameerabbasi left a comment

Choose a reason for hiding this comment

MilesCranmer commented Oct 2, 2018

hameerabbasi commented Oct 2, 2018 • edited

MilesCranmer commented Oct 2, 2018 • edited

hameerabbasi commented Oct 2, 2018

MilesCranmer commented Oct 2, 2018

hameerabbasi left a comment

Choose a reason for hiding this comment

MilesCranmer commented Oct 3, 2018

MilesCranmer commented Oct 3, 2018

hameerabbasi left a comment

Choose a reason for hiding this comment

MilesCranmer commented Oct 9, 2018

hameerabbasi commented Oct 9, 2018

mattip commented Oct 9, 2018

hameerabbasi left a comment

Choose a reason for hiding this comment

MilesCranmer commented Oct 9, 2018

MilesCranmer commented Jun 22, 2022

seberg commented Jun 22, 2022

MilesCranmer commented Jun 22, 2022

MilesCranmer commented Jun 23, 2022

seberg commented Jun 23, 2022

MilesCranmer commented Jun 23, 2022

InessaPawson commented Jun 23, 2022

WarrenWeckesser commented Jun 23, 2022

MilesCranmer commented Jun 24, 2022

MilesCranmer commented Jun 24, 2022 • edited

adeak commented Jun 24, 2022 • edited

shoyer commented Jun 24, 2022

seberg commented Jun 27, 2022

API: Optimize np.isin and np.in1d for integer arrays and add `kind=` #12065

API: Optimize np.isin and np.in1d for integer arrays and add `kind=` #12065

MilesCranmer commented Oct 1, 2018 •

edited

MilesCranmer commented Oct 1, 2018 •

edited

hameerabbasi commented Oct 2, 2018 •

edited

MilesCranmer commented Oct 2, 2018 •

edited

MilesCranmer commented Jun 24, 2022 •

edited

adeak commented Jun 24, 2022 •

edited