Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: fix Series.str.split when 'regex=None' for series having 'pd.ArrowDtype(pa.string())' dtype #58418

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Expand Up @@ -391,6 +391,7 @@ Conversion

Strings
^^^^^^^
- Bug in :meth:`Series.str.split` would not treat ``pat`` as regex when ``regex=None`` for series having ``pd.ArrowDtype(pa.string())`` dtype (:issue:`58321`)
- Bug in :meth:`Series.value_counts` would not respect ``sort=False`` for series having ``string`` dtype (:issue:`55224`)
-

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/arrays/arrow/array.py
Expand Up @@ -2579,7 +2579,7 @@ def _str_split(
n = None
if pat is None:
split_func = pc.utf8_split_whitespace
elif regex:
elif regex or (regex is None and len(pat) != 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR - I'm not sure this is the right fix though. Do you see where the behavior deviates between the different string types? This current fix seems like it would apply a behavior change to all types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. The behavior deviates here.

string[pyarrow] goes through


while pd.ArrowDtype(pa.string()) goes through
def _str_split(

The docstring of str.split says this about regex: "If None and pat length is not 1, treats pat as a regular expression."

This behavior has been implemented in the first _str_split, but not in the second _str_split. So I add this condition in the second _str_split to fix the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks that is helpful. Is there a way to make these implementations look more alike? I see what you are trying to accomplish here but its hard to tell the corner cases where these may still diverge. Is there a reason why the implementations need to differ at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial intention was to make as few changes as possible.

To make it more coherent, I would rather set regex=True for the corner case before calling _str_split in the code below. Do you think it's OK?

if is_re(pat):
regex = True
result = self._data.array._str_split(pat, n, expand, regex)

Copy link
Contributor Author

@yuanx749 yuanx749 Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I move outside the logic that determines if pat is a regex, so that the two _str_split look more alike. Coud you review again?

split_func = functools.partial(pc.split_pattern_regex, pattern=pat)
else:
split_func = functools.partial(pc.split_pattern, pattern=pat)
Expand Down
10 changes: 10 additions & 0 deletions pandas/tests/extension/test_arrow.py
Expand Up @@ -2296,6 +2296,16 @@ def test_str_split_pat_none(method):
tm.assert_series_equal(result, expected)


def test_str_split_regex_none():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this test to pandas/tests/strings/test_split_partition.py, so we can parametrize this with all the different string dtype implementations, ensuring the different ones all behave the same?

Copy link
Contributor Author

@yuanx749 yuanx749 Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the new commit.
But the tests look a bit ugly to me, because the expected output of pd.ArrowDtype(pa.string()) has different array dtype from the cases of other string dtypes. Maybe it's better to keep the test separate in test_arrow.py?

# GH 58321
ser = pd.Series(["230/270/270", "240-290-290"], dtype=ArrowDtype(pa.string()))
result = ser.str.split(r"/|-", regex=None)
expected = pd.Series(
ArrowExtensionArray(pa.array([["230", "270", "270"], ["240", "290", "290"]]))
)
tm.assert_series_equal(result, expected)


def test_str_split():
# GH 52401
ser = pd.Series(["a1cbcb", "a2cbcb", None], dtype=ArrowDtype(pa.string()))
Expand Down