Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fixes for three related stringdtype issues #26436

Merged
merged 5 commits into from
May 16, 2024

Conversation

ngoldbaum
Copy link
Member

Fixes #26420.

While working on the issue with where found in #26240, I noticed two other related issues. Since the fix for where depends on the other two fixes, I figured it would make sense to send them all as one PR.

First, it turns out that advanced indexing was broken for indexing any entry in an array that needs a heap allocation. On current main, this leads to errors or segfaults:

>>> import numpy as np
>>> a = np.array(["a", "c", "h"*25], dtype=np.dtypes.StringDType(na_object=None))
>>> a[[1,2]]
... elide traceback ...
MemoryError: Failed to load string in StringDType getitem

The fix is to properly set the output descriptor in the casting setup in array_subscript.

Second, I noticed that the nonzero function completely ignores nulls. I also noticed that the string to bool cast assumed all nulls are truthy and ignored the existence of nulls like None that should be falsey, following the behavior of object array:

>>> import numpy as np
>>> np.nonzero(np.array(['hello', np.nan, 'world'], dtype=object))
(array([0, 1, 2]),)
>>> np.nonzero(np.array(['hello', None, 'world'], dtype=object))
(array([0, 2]),)

This updates both nonzero and the string to bool cast to account for this and makes sure they behave identically.

Finally, the issue with where is also caused by not setting the input and output descriptors properly in the cast setup in PyArray_Where. The casting code here dates back to #23770, which was before stringdtype had an arena allocator. With the arena allocator, we need to be more careful about bookkeeping on input and output descriptors. This means we also need to setup a separate cast for the second input descriptor.

Also adds tests for all three fixed issues.

@ngoldbaum ngoldbaum requested a review from seberg May 14, 2024 21:25
@ngoldbaum ngoldbaum added the 09 - Backport-Candidate PRs tagged should be backported label May 14, 2024
@ngoldbaum
Copy link
Member Author

Hmm, looks like this breaks some other things, will investigate and update.

@ngoldbaum
Copy link
Member Author

Hmm, looks like this breaks some other things, will investigate and update.

Should be good now.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changes look good, I am not sure the has-ref logic is quite spot on and if we should refine it.
Adding that comment would be nice.

numpy/_core/src/multiarray/multiarraymodule.c Show resolved Hide resolved
@ngoldbaum
Copy link
Member Author

Also thanks so much for the careful review! I completely missed the issue with the strides.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-ups, I like them!

@charris charris merged commit 731a10a into numpy:main May 16, 2024
65 checks passed
charris pushed a commit to charris/numpy that referenced this pull request May 16, 2024
* BUG: fix broken fancy indexing for stringdtype

* BUG: fix incorrect casting for stringdtype in PyArray_Where

* BUG: ensure casting to bool and nonzero treats null strings the same way

* MNT: refactor so itemsizes are correct

* MNT: refactor so trivial copy check aligns with casting setup
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label May 16, 2024
charris added a commit that referenced this pull request May 16, 2024
BUG: fixes for three related stringdtype issues (#26436)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Passing a stringdtype array to np.where can lose string data
3 participants