Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: fix Series.str.split when 'regex=None' for series having 'pd.ArrowDtype(pa.string())' dtype #58418

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Expand Up @@ -412,6 +412,7 @@ Conversion

Strings
^^^^^^^
- Bug in :meth:`Series.str.split` would not treat ``pat`` as regex when ``regex=None`` for series having ``pd.ArrowDtype(pa.string())`` dtype (:issue:`58321`)
- Bug in :meth:`Series.value_counts` would not respect ``sort=False`` for series having ``string`` dtype (:issue:`55224`)
-

Expand Down
3 changes: 3 additions & 0 deletions pandas/core/strings/accessor.py
Expand Up @@ -920,6 +920,9 @@ def split(
)
if is_re(pat):
regex = True
elif pat is not None and regex is None:
# regex is None so link to old behavior #43563
regex = len(pat) != 1
result = self._data.array._str_split(pat, n, expand, regex)
if self._data.dtype == "category":
dtype = self._data.dtype.categories.dtype
Expand Down
8 changes: 1 addition & 7 deletions pandas/core/strings/object_array.py
Expand Up @@ -339,14 +339,8 @@ def _str_split(
new_pat: str | re.Pattern
if regex is True or isinstance(pat, re.Pattern):
new_pat = re.compile(pat)
elif regex is False:
new_pat = pat
# regex is None so link to old behavior #43563
else:
if len(pat) == 1:
new_pat = pat
else:
new_pat = re.compile(pat)
new_pat = pat
Comment on lines -342 to +343
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to keep this, otherwise it would not be a bugfix for pd.ArrowDtype(pa.string()) but changing behaviour for all other string dtypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed above #58418 (comment), the bug is caused by that this logic is not implemented in the _str_split method of pd.ArrowDtype(pa.string()). To fix it, and to make two _str_split implementations look more alike, I moved this logic outside before calling _str_split. So I think the behaviors for other dtypes have not been changed.

Moreover I think the existing tests have covered all combinations of parameters, and as long as they all pass, the behaviors should still be the same.


if isinstance(new_pat, re.Pattern):
if n is None or n == -1:
Expand Down
21 changes: 21 additions & 0 deletions pandas/tests/extension/test_arrow.py
Expand Up @@ -2296,6 +2296,27 @@ def test_str_split_pat_none(method):
tm.assert_series_equal(result, expected)


def test_str_split_regex_explicit():
# GH 58321
# adapted from tests/strings/test_split_partition.py
values = pd.Series("xxxjpgzzz.jpg", dtype=ArrowDtype(pa.string()))

# explicit regex = False split
result = values.str.split(r"\.jpg", regex=False)
exp = pd.Series(ArrowExtensionArray(pa.array([["xxxjpgzzz.jpg"]])))
tm.assert_series_equal(result, exp)

# non explicit regex split, pattern length == 1
result = values.str.split(r".")
exp = pd.Series(ArrowExtensionArray(pa.array([["xxxjpgzzz", "jpg"]])))
tm.assert_series_equal(result, exp)

# non explicit regex split, pattern length != 1
result = values.str.split(r".jpg")
exp = pd.Series(ArrowExtensionArray(pa.array([["xx", "zzz", ""]])))
tm.assert_series_equal(result, exp)


def test_str_split():
# GH 52401
ser = pd.Series(["a1cbcb", "a2cbcb", None], dtype=ArrowDtype(pa.string()))
Expand Down