New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Optimize np.isin and np.in1d for integer arrays and add kind=
#12065
Conversation
Posted this to the mailing list as well |
I believe the flag shouldn't be necessary at all. Your optimisation should simply work if both dtypes are integral. This shows how to check for that. |
Have you done any checks to see what happens to performance when there are, say, a small number of large integers? I imagine at one point the cost of creating a huge and very sparse array might become noticeable. Of course I'd imagine such cases (assuming they exist) to be rare in nature. |
Thanks @hameerabbasi, I will make it automatic. @adeak you make a good point. I'll make some benchmarks on my machine and post them. I suppose we can use the metric float(np.max(ar2) - np.min(ar2))/ar2.size to decide whether to apply this optimization? |
Here are some benchmarks: https://docs.google.com/spreadsheets/d/1vAx_zxQqqOTXOIFRMtafpUGnd-hquYOqQpxPa7JYR4c/edit?usp=sharing (Red color shows when the normal version is twice as fast as the new version) Suprisingly even in the very rare case of (np.max(ar) - np.min(ar)) = 10^10, but 10 elements, this version of np.isin() only takes 5 seconds on my machine. So maybe we can just take this for all integer arrays? And if one wants to use the regular algorithm, they can set it to be floating type. Thoughts? |
Sorry, I edited my above comment - instead I meant the worst performance is for 10 elements with a range of 10^10. Perhaps we should add some metric to bypass this algorithm in cases like this after all? |
So in looking at the benchmarks where the original algorithm is not twice as good as the new one, I see the following figure. The y-axis is log10(array size) and the x-axis is log10(range) for the benchmarks. The color is the log10(speedup) - i.e., a value of 1 for color means the new code is 10x faster. The dark part looks to have a linear top, aside from the rightmost column - I'm not sure what is going on there - maybe a memory problem? This is all pretty approximate and specific to my own machine so I think it is okay to be approximate here - we really just want to prevent the extreme slow behaviour from this addition. To block out the space where the old code is <2x faster than the new code, the line: log10(size) = (log10(range) - 2.27) / 0.927 cuts it appropriately. Therefore, I can add this check for whether the new code or old code should run. Edit: The rightmost column is all 1 since the normal algorithm is used at 8e8 range! |
Added all suggestions. Please let me know what you think. Cheers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other feedback. I still have some concerns about the memory consumption of this algorithm versus the original one, but I won't stop this moving forward based on those.
Thanks for the suggestions, I will implement them. Regarding memory: the memory usage is already capped at 8x10^8 elements for the helper array before using the normal method so 800 MB max. To get to that point, you would need greater than: size >= 10**((log10(range) - 2.27) / 0.927) = 14,303,321 So, minimum 14,303,321 elements in ar2 for it to be within the optimal parameter cutoff. In order to have values all the way up to 8x10^8 elements, you would need at least 32-bit integers, so your input array is minimum 58 MB already. So I think you would already be dealing with memory-heavy calculations in that case so maybe it is expected. What do you think? |
Do you have a comparison versus the original algorithm possibly? That would be more helpful. I'm just commenting with general tips, I know nothing of the algorithms used. In case your algorithm is worse: 800 MB might not be much on today's multi-gigabyte-memory desktops, but we have to keep in mind IoT devices and things like Raspberry Pis as well, which is the use-case I'm more concerned about. (It might make sense to add back the "switch" I asked you to remove when I wasn't aware of this earlier, I apologize for backtracking.) |
So from another benchmark, it looks like this new algorithm also does better for memory usage than the traditional method, as least for one case. In the same google sheet (Sheet name: "Worst case memory usage") : (here), I show the memory usage as recorded by the package memory_profiler: https://pypi.org/project/memory_profiler/. The memory is given every 0.1 seconds (I add sleep(.1) at major steps in the code to increase the resolution). Basically, for parameters: range ~ 8x10^8 - 1 I record, for the normal algorithm: max memory usage = 1.8 GB and the fast algorithm: max memory usage = 1.4 GB. So perhaps I should just turn the range < 8e8 requirement off? |
Sounds good to me, thanks for being so thorough. I appreciate your patience with my review. After my remaining comments are addressed, this gets a +1 from me. |
All requests have been added. Thanks for your help @hameerabbasi! Cheers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I'd make sure the tests hit the new area of the code, both with bool
data types and one integer data type. Other than that, this is a +1 from me.
The tests actually already hit the new functionality with their integer tests. I will add some integer tests which explicitly hit the old code (which requires range > 186 x 10^(0.927 x size) ) as well as boolean tests which don't exist yet. |
Added suggestions and tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets a +1 from me -- Thanks @MilesCranmer for the PR.
Anything else I need to do/people I need to email to have this merged? Thanks, |
I don't have write access. Perhaps ping the mailing list? |
Does this have any impact on existing asv benchmarks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattip Makes a valid point about the benchmarks, if there aren't any asv benchmarks with np.isin
, it would be nice to add them. This has my approval either way though: Don't want to make the barrier to entry too high.
I just looked and I couldn't find any usage of isin or in1d in the benchmarks. |
Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
Thanks! All suggestions implemented. |
I am still curious if the The parametrization is causing some empty array tests to fail because |
Thanks, fixed! |
Anything else left or could this be merged? |
Looks good to me now, lets give it a shot. Thanks @MilesCranmer and everyone who reviewed! Sorry it took a bit of iterating, new API makes things a bit slower to move. It would be awesome to follow-up a bit on the tests especially, but not particularly important. If anyone has concerns about the API addition or sees a fixup that would be good, please comment or open an issue! |
kind=
Awesome, thanks so much @seberg and everyone else who helped with the PR! |
Hi-five on merging your first pull request to NumPy, @MilesCranmer! We hope you stick around! |
This might have introduced a bug; take a look at #21841 |
Fixed on #21842. Mistake was not unit-testing for empty integer arrays–those are now added. |
I'm not sure you can do much short of special-casing the "known bad" integer-looking dtypes. There's only this one built in NumPy. This is what seems like a problem to me: >>> np.timedelta64.mro()
[numpy.timedelta64,
numpy.signedinteger,
numpy.integer,
numpy.number,
numpy.generic,
object]
>>> np.int64.mro()
[numpy.int64,
numpy.signedinteger,
numpy.integer,
numpy.number,
numpy.generic,
object]
>>> np.integer.__subclasses__()
[numpy.signedinteger, numpy.unsignedinteger]
>>> np.signedinteger.__subclasses__()
[numpy.int8,
numpy.int16,
numpy.int32,
numpy.int64,
numpy.longlong,
numpy.timedelta64]
>>> np.unsignedinteger.__subclasses__()
[numpy.uint8, numpy.uint16, numpy.uint32, numpy.uint64, numpy.ulonglong] It seems clear to me that dropping the pure-python So one question remains: can users have custom integer-looking dtypes that don't work with Weeell, I guess you could sacrifice some purity, and pull out the |
For what it's worth, timedelta64 definitely should not be a subclass of int. This can and should be changed. I have an old PR to fix this that was never finished up if someone wants to help: #10701 The only real debate was whether timedelta64 and datetime64 should share a base class or not. |
Yes, but it is a change with somewhat unclear side-effects downstream? In this case, I would prefer adding an explicit solution here for now (and test for datetime/timedelta). |
This PR fixes the issue discussed on #12065 and #21843 where 'timedelta64' was noted to be a subtype of numpy.integer. This in principle should detect any cases where int(np.min(ar2)) fails. This PR also adds unittests for these. * TST: Create in1d test for timedelta input * MAINT: fix in1d for timedelta input * TST: in1d raise ValueError for timedelta input * MAINT: Clean up type checking for isin kind="table" * TST: Add test for mixed boolean/integer in1d * MAINT: Increase readability of in1d type checking * STY: Apply small code style tweaks This is probably really mainly my personal opinion... Co-authored-by: Sebastian Berg <sebastian@sipsolutions.net>
This commit adds some simple functionality to np.isin and np.in1d that greatly increases performance when both arrays are integral. It works by creating a boolean array with elements set to 1 where the parent array (ar2) has elements and 0 otherwise. This array is then indexed by the child array (ar1) to create the output.