Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: fix Series.str.split when 'regex=None' for series having 'pd.ArrowDtype(pa.string())' dtype #58418

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Expand Up @@ -397,6 +397,7 @@ Conversion

Strings
^^^^^^^
- Bug in :meth:`Series.str.split` would not treat ``pat`` as regex when ``regex=None`` for series having ``pd.ArrowDtype(pa.string())`` dtype (:issue:`58321`)
- Bug in :meth:`Series.value_counts` would not respect ``sort=False`` for series having ``string`` dtype (:issue:`55224`)
-

Expand Down
20 changes: 19 additions & 1 deletion pandas/conftest.py
Expand Up @@ -92,7 +92,6 @@
except ImportError:
has_pyarrow = False
else:
del pa
has_pyarrow = True

import zoneinfo
Expand Down Expand Up @@ -1367,6 +1366,25 @@ def any_string_dtype(request):
return request.param


@pytest.fixture(
params=[
"object",
"string[python]",
pytest.param("string[pyarrow]", marks=td.skip_if_no("pyarrow")),
pytest.param("string[pyarrow_numpy]", marks=td.skip_if_no("pyarrow")),
pytest.param(pd.ArrowDtype(pa.string()), marks=td.skip_if_no("pyarrow")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually prefer to just add the pd.ArrowDtype(pa.string()) to the existing string dtypes instead of copying and creating a new fixture. Guessing that causes a lot of other test failures?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I forgot that pd.ArrowDtype(pa.string()) was not actually in the fixture, so my suggestion lead you a bit in the wrong way. Sorry!

Right now adding this to the main any_string_dtype fixture will indeed give quite some failures, yes. I agree that it might be better to actually do that (and it would be interesting to see which tests actually fail), but that's for another PR / out of scope for this bug fix (doing so would also require removing some tests are now only exist for the arrow string dtype, to not keep things duplicated).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a lot of other tests need to be adjusted if adding ArrowDtype to the fixture.

So for this PR, should I just add test in pandas/tests/extension/test_arrow.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #58495 so we can track the larger issue

]
)
def any_string_dtype_2(request):
"""
Parametrized fixture for string dtypes.
* 'object'
* 'string[python]'
* 'string[pyarrow]'
"""
return request.param


@pytest.fixture(params=tm.DATETIME64_DTYPES)
def datetime64_dtype(request):
"""
Expand Down
3 changes: 3 additions & 0 deletions pandas/core/strings/accessor.py
Expand Up @@ -910,6 +910,9 @@ def split(
)
if is_re(pat):
regex = True
elif pat is not None and regex is None:
# regex is None so link to old behavior #43563
regex = len(pat) != 1
result = self._data.array._str_split(pat, n, expand, regex)
if self._data.dtype == "category":
dtype = self._data.dtype.categories.dtype
Expand Down
8 changes: 1 addition & 7 deletions pandas/core/strings/object_array.py
Expand Up @@ -339,14 +339,8 @@ def _str_split(
new_pat: str | re.Pattern
if regex is True or isinstance(pat, re.Pattern):
new_pat = re.compile(pat)
elif regex is False:
new_pat = pat
# regex is None so link to old behavior #43563
else:
if len(pat) == 1:
new_pat = pat
else:
new_pat = re.compile(pat)
new_pat = pat
Comment on lines -342 to +343
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to keep this, otherwise it would not be a bugfix for pd.ArrowDtype(pa.string()) but changing behaviour for all other string dtypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed above #58418 (comment), the bug is caused by that this logic is not implemented in the _str_split method of pd.ArrowDtype(pa.string()). To fix it, and to make two _str_split implementations look more alike, I moved this logic outside before calling _str_split. So I think the behaviors for other dtypes have not been changed.

Moreover I think the existing tests have covered all combinations of parameters, and as long as they all pass, the behaviors should still be the same.


if isinstance(new_pat, re.Pattern):
if n is None or n == -1:
Expand Down
32 changes: 24 additions & 8 deletions pandas/tests/strings/test_split_partition.py
Expand Up @@ -17,6 +17,10 @@
object_pyarrow_numpy,
)

pa = pytest.importorskip("pyarrow")

from pandas.core.arrays.arrow.array import ArrowExtensionArray


@pytest.mark.parametrize("method", ["split", "rsplit"])
def test_split(any_string_dtype, method):
Expand Down Expand Up @@ -59,27 +63,39 @@ def test_split_regex(any_string_dtype):
tm.assert_series_equal(result, exp)


def test_split_regex_explicit(any_string_dtype):
def test_split_regex_explicit(any_string_dtype_2):
# explicit regex = True split with compiled regex
regex_pat = re.compile(r".jpg")
values = Series("xxxjpgzzz.jpg", dtype=any_string_dtype)
result = values.str.split(regex_pat)
exp = Series([["xx", "zzz", ""]])
tm.assert_series_equal(result, exp)
values = Series("xxxjpgzzz.jpg", dtype=any_string_dtype_2)

if not isinstance(any_string_dtype_2, pd.ArrowDtype):
# ArrowDtype does not support compiled regex
result = values.str.split(regex_pat)
exp = Series([["xx", "zzz", ""]])
tm.assert_series_equal(result, exp)

# explicit regex = False split
result = values.str.split(r"\.jpg", regex=False)
exp = Series([["xxxjpgzzz.jpg"]])
if not isinstance(any_string_dtype_2, pd.ArrowDtype):
exp = Series([["xxxjpgzzz.jpg"]])
else:
exp = Series(ArrowExtensionArray(pa.array([["xxxjpgzzz.jpg"]])))
tm.assert_series_equal(result, exp)

# non explicit regex split, pattern length == 1
result = values.str.split(r".")
exp = Series([["xxxjpgzzz", "jpg"]])
if not isinstance(any_string_dtype_2, pd.ArrowDtype):
exp = Series([["xxxjpgzzz", "jpg"]])
else:
exp = Series(ArrowExtensionArray(pa.array([["xxxjpgzzz", "jpg"]])))
tm.assert_series_equal(result, exp)

# non explicit regex split, pattern length != 1
result = values.str.split(r".jpg")
exp = Series([["xx", "zzz", ""]])
if not isinstance(any_string_dtype_2, pd.ArrowDtype):
exp = Series([["xx", "zzz", ""]])
else:
exp = Series(ArrowExtensionArray(pa.array([["xx", "zzz", ""]])))
tm.assert_series_equal(result, exp)

# regex=False with pattern compiled regex raises error
Expand Down