Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Optimize np.isin and np.in1d for integer arrays and add kind= #12065

Merged
merged 50 commits into from Jun 23, 2022

Conversation

MilesCranmer
Copy link
Contributor

This commit adds some simple functionality to np.isin and np.in1d that greatly increases performance when both arrays are integral. It works by creating a boolean array with elements set to 1 where the parent array (ar2) has elements and 0 otherwise. This array is then indexed by the child array (ar1) to create the output.

@MilesCranmer
Copy link
Contributor Author

Posted this to the mailing list as well

@hameerabbasi
Copy link
Contributor

I believe the flag shouldn't be necessary at all. Your optimisation should simply work if both dtypes are integral. This shows how to check for that.

@adeak
Copy link
Contributor

adeak commented Oct 1, 2018

Have you done any checks to see what happens to performance when there are, say, a small number of large integers? I imagine at one point the cost of creating a huge and very sparse array might become noticeable. Of course I'd imagine such cases (assuming they exist) to be rare in nature.

@MilesCranmer
Copy link
Contributor Author

Thanks @hameerabbasi, I will make it automatic.

@adeak you make a good point. I'll make some benchmarks on my machine and post them. I suppose we can use the metric float(np.max(ar2) - np.min(ar2))/ar2.size to decide whether to apply this optimization?

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Oct 1, 2018

Here are some benchmarks: https://docs.google.com/spreadsheets/d/1vAx_zxQqqOTXOIFRMtafpUGnd-hquYOqQpxPa7JYR4c/edit?usp=sharing

(Red color shows when the normal version is twice as fast as the new version)

Suprisingly even in the very rare case of (np.max(ar) - np.min(ar)) = 10^10, but 10 elements, this version of np.isin() only takes 5 seconds on my machine. So maybe we can just take this for all integer arrays? And if one wants to use the regular algorithm, they can set it to be floating type. Thoughts?

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved
@MilesCranmer
Copy link
Contributor Author

Sorry, I edited my above comment - instead I meant the worst performance is for 10 elements with a range of 10^10. Perhaps we should add some metric to bypass this algorithm in cases like this after all?

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Oct 1, 2018

So in looking at the benchmarks where the original algorithm is not twice as good as the new one, I see the following figure. The y-axis is log10(array size) and the x-axis is log10(range) for the benchmarks. The color is the log10(speedup) - i.e., a value of 1 for color means the new code is 10x faster.

screen shot 2018-10-01 at 5 42 15 pm

The dark part looks to have a linear top, aside from the rightmost column - I'm not sure what is going on there - maybe a memory problem? This is all pretty approximate and specific to my own machine so I think it is okay to be approximate here - we really just want to prevent the extreme slow behaviour from this addition.

To block out the space where the old code is <2x faster than the new code, the line:

log10(size) = (log10(range) - 2.27) / 0.927

cuts it appropriately. Therefore, I can add this check for whether the new code or old code should run.

Edit: The rightmost column is all 1 since the normal algorithm is used at 8e8 range!

@MilesCranmer
Copy link
Contributor Author

With the conditional run on array parameters above, here is a histogram of the log10(speedups) for all benchmarks. As you can, the vast majority of settings result in speedups and there are no harsh slowdowns anywhere.

screen shot 2018-10-01 at 6 00 54 pm

@MilesCranmer
Copy link
Contributor Author

Added all suggestions. Please let me know what you think.

Cheers,
Miles

@MilesCranmer MilesCranmer changed the title ENH: Add fast_integers option to np.isin and np.in1d ENH: Optimize np.isin and np.in1d for integer arrays Oct 1, 2018
Copy link
Contributor

@hameerabbasi hameerabbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other feedback. I still have some concerns about the memory consumption of this algorithm versus the original one, but I won't stop this moving forward based on those.

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved
numpy/lib/arraysetops.py Outdated Show resolved Hide resolved
numpy/lib/arraysetops.py Outdated Show resolved Hide resolved
numpy/lib/arraysetops.py Show resolved Hide resolved
@MilesCranmer
Copy link
Contributor Author

Thanks for the suggestions, I will implement them.

Regarding memory: the memory usage is already capped at 8x10^8 elements for the helper array before using the normal method so 800 MB max. To get to that point, you would need greater than:

size >= 10**((log10(range) - 2.27) / 0.927) = 14,303,321

So, minimum 14,303,321 elements in ar2 for it to be within the optimal parameter cutoff. In order to have values all the way up to 8x10^8 elements, you would need at least 32-bit integers, so your input array is minimum 58 MB already. So I think you would already be dealing with memory-heavy calculations in that case so maybe it is expected. What do you think?

@hameerabbasi
Copy link
Contributor

hameerabbasi commented Oct 2, 2018

800 MB

Do you have a comparison versus the original algorithm possibly? That would be more helpful. I'm just commenting with general tips, I know nothing of the algorithms used.

In case your algorithm is worse: 800 MB might not be much on today's multi-gigabyte-memory desktops, but we have to keep in mind IoT devices and things like Raspberry Pis as well, which is the use-case I'm more concerned about. (It might make sense to add back the "switch" I asked you to remove when I wasn't aware of this earlier, I apologize for backtracking.)

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Oct 2, 2018

So from another benchmark, it looks like this new algorithm also does better for memory usage than the traditional method, as least for one case.

In the same google sheet (Sheet name: "Worst case memory usage") : (here), I show the memory usage as recorded by the package memory_profiler: https://pypi.org/project/memory_profiler/. The memory is given every 0.1 seconds (I add sleep(.1) at major steps in the code to increase the resolution).

Basically, for parameters:

range ~ 8x10^8 - 1
size ~ 15x10^6,

I record, for the normal algorithm:

max memory usage = 1.8 GB

and the fast algorithm:

max memory usage = 1.4 GB.

So perhaps I should just turn the range < 8e8 requirement off?

@hameerabbasi
Copy link
Contributor

So perhaps I should just turn the range < 8e8 requirement off?

Sounds good to me, thanks for being so thorough. I appreciate your patience with my review. After my remaining comments are addressed, this gets a +1 from me.

@MilesCranmer
Copy link
Contributor Author

All requests have been added. Thanks for your help @hameerabbasi!

Cheers,
Miles

Copy link
Contributor

@hameerabbasi hameerabbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I'd make sure the tests hit the new area of the code, both with bool data types and one integer data type. Other than that, this is a +1 from me.

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved
@MilesCranmer
Copy link
Contributor Author

The tests actually already hit the new functionality with their integer tests. I will add some integer tests which explicitly hit the old code (which requires range > 186 x 10^(0.927 x size) ) as well as boolean tests which don't exist yet.

@MilesCranmer
Copy link
Contributor Author

Added suggestions and tests.

Copy link
Contributor

@hameerabbasi hameerabbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets a +1 from me -- Thanks @MilesCranmer for the PR.

@MilesCranmer
Copy link
Contributor Author

Anything else I need to do/people I need to email to have this merged?

Thanks,
Miles

@hameerabbasi
Copy link
Contributor

I don't have write access. Perhaps ping the mailing list?

@mattip
Copy link
Member

mattip commented Oct 9, 2018

Does this have any impact on existing asv benchmarks?

Copy link
Contributor

@hameerabbasi hameerabbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattip Makes a valid point about the benchmarks, if there aren't any asv benchmarks with np.isin, it would be nice to add them. This has my approval either way though: Don't want to make the barrier to entry too high.

numpy/lib/arraysetops.py Outdated Show resolved Hide resolved
@MilesCranmer
Copy link
Contributor Author

I just looked and I couldn't find any usage of isin or in1d in the benchmarks.

@MilesCranmer
Copy link
Contributor Author

Thanks! All suggestions implemented.

@seberg
Copy link
Member

seberg commented Jun 22, 2022

I am still curious if the ar1 = ar1 + np.uint(8) is actually necessary, but it doesn't matter. Yes we can skip the pytest.raises() for now (I do think it is fine to mix it in a file).

The parametrization is causing some empty array tests to fail because [] is np.array([]) which is float64 (and thus gets correctly rejected). (Maybe more, but those are the ones I noticed.)

@MilesCranmer
Copy link
Contributor Author

Thanks, fixed!

@MilesCranmer
Copy link
Contributor Author

Anything else left or could this be merged?

@seberg
Copy link
Member

seberg commented Jun 23, 2022

Looks good to me now, lets give it a shot. Thanks @MilesCranmer and everyone who reviewed! Sorry it took a bit of iterating, new API makes things a bit slower to move.

It would be awesome to follow-up a bit on the tests especially, but not particularly important.

If anyone has concerns about the API addition or sees a fixup that would be good, please comment or open an issue!

@seberg seberg merged commit 91e753c into numpy:main Jun 23, 2022
@seberg seberg changed the title MAINT: Optimize np.isin and np.in1d for integer arrays API: Optimize np.isin and np.in1d for integer arrays and add kind= Jun 23, 2022
@MilesCranmer
Copy link
Contributor Author

Awesome, thanks so much @seberg and everyone else who helped with the PR!

@InessaPawson
Copy link
Member

Hi-five on merging your first pull request to NumPy, @MilesCranmer! We hope you stick around!

@WarrenWeckesser
Copy link
Member

This might have introduced a bug; take a look at #21841

@MilesCranmer
Copy link
Contributor Author

Fixed on #21842. Mistake was not unit-testing for empty integer arrays–those are now added.

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Jun 24, 2022

Other post-merge question: is 'timedelta64' technically a subtype of np.integer? If so I think that might break things.

(I'm trying to make a similar method for unique1d, but it has a unittest for timedelta64 which gets past the integer check)

Edit: yep, looks like this is a potential bug:
Screen Shot 2022-06-24 at 1 01 36 AM

How should I check for pure integer arrays (i.e., no datetimes or other dtypes which can't get evaluated by int())?

@adeak
Copy link
Contributor

adeak commented Jun 24, 2022

How should I check for pure integer arrays (i.e., no datetimes or other dtypes which can't get evaluated by int())?

I'm not sure you can do much short of special-casing the "known bad" integer-looking dtypes. There's only this one built in NumPy.

This is what seems like a problem to me:

>>> np.timedelta64.mro()
[numpy.timedelta64,
 numpy.signedinteger,
 numpy.integer,
 numpy.number,
 numpy.generic,
 object]

>>> np.int64.mro()
[numpy.int64,
 numpy.signedinteger,
 numpy.integer,
 numpy.number,
 numpy.generic,
 object]

int64 and timedelta64 are sibling classes, which means you don't have any superdtype for int64 that doesn't include timedelta64. But it seems to be the only such edge case:

>>> np.integer.__subclasses__()
[numpy.signedinteger, numpy.unsignedinteger]

>>> np.signedinteger.__subclasses__()
[numpy.int8,
 numpy.int16,
 numpy.int32,
 numpy.int64,
 numpy.longlong,
 numpy.timedelta64]

>>> np.unsignedinteger.__subclasses__()
[numpy.uint8, numpy.uint16, numpy.uint32, numpy.uint64, numpy.ulonglong]

It seems clear to me that dropping the pure-python int range check is not an option, we should just force the weird integer-like dtypes to use the old code path.

So one question remains: can users have custom integer-looking dtypes that don't work with int? If yes, we would have to manually list the "known good" integer types, plus bool_. If we're happy with breaking on that, we could just test for "np.integer but not np.timedelta64".

Weeell, I guess you could sacrifice some purity, and pull out the ar2_range check before the if integer_arrays if. If the check fails, set integer_arrays to False...

@shoyer
Copy link
Member

shoyer commented Jun 24, 2022

For what it's worth, timedelta64 definitely should not be a subclass of int. This can and should be changed.

I have an old PR to fix this that was never finished up if someone wants to help: #10701

The only real debate was whether timedelta64 and datetime64 should share a base class or not.

@seberg
Copy link
Member

seberg commented Jun 27, 2022

For what it's worth, timedelta64 definitely should not be a subclass of int. This can and should be changed.

Yes, but it is a change with somewhat unclear side-effects downstream? In this case, I would prefer adding an explicit solution here for now (and test for datetime/timedelta).

seberg added a commit that referenced this pull request Jun 29, 2022
This PR fixes the issue discussed on #12065 and #21843 where 'timedelta64' was noted to be a subtype of numpy.integer. This in principle should detect any cases where int(np.min(ar2)) fails. This PR also adds unittests for these.

* TST: Create in1d test for timedelta input

* MAINT: fix in1d for timedelta input

* TST: in1d raise ValueError for timedelta input

* MAINT: Clean up type checking for isin kind="table"

* TST: Add test for mixed boolean/integer in1d

* MAINT: Increase readability of in1d type checking

* STY: Apply small code style tweaks

This is probably really mainly my personal opinion...

Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 - Maintenance 30 - API 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. component: numpy.lib triaged Issue/PR that was discussed in a triage meeting
Projects
Development

Successfully merging this pull request may close these issues.

None yet