New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451
base: main
Are you sure you want to change the base?
String dtype: implement object-dtype based StringArray variant with NumPy semantics #58451
Conversation
# Specifically for StringArrayNumpySemantics, validate here we have a valid array | ||
if isinstance(left.dtype, StringDtype) and left.dtype.storage == "python_numpy": | ||
assert np.all( | ||
[np.isnan(val) for val in left._ndarray[left_na]] # type: ignore[attr-defined] | ||
), "wrong missing value sentinels" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit a custom check (and we don't do anything similarly for other types), but given I initially overlooked a case where we were creating string arrays with the wrong missing value sentinel because the tests don't actually catch that (two arrays with different missing value sentinels still pass as equal in case of EAs), I would prefer keeping this in at least on the short term.
dtype = StringDtype(storage="pyarrow_numpy") | ||
dtype = StringDtype() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a bunch more of such changes needed to update the code paths behind if using_pyarrow_string_dtype()
(which is essentially future.infer_string = True
) to properly handle the case without pyarrow installed. But given that this is not needed to actually get the tests passing (we are not testing with infer_strings enabled), would prefer keeping this for a follow-up PR.
Similarly like #54533 added a variant of the pyarrow-backed string dtype to use numpy nullable (NaN) semantics (
StringDtype(storage="pyarrow_numpy")
withArrowStringArrayNumpySemantics
), this PR does the same for our object-dtype numpy array backed StringArray: a newStringDtype(storage="python_numpy")
with the correspondingStringArrayNumpySemantics
array.Illustration for discussion in #57073
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.