Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_csv with engine pyarrow parsing multiple date columns #50056

Merged
merged 14 commits into from May 18, 2023

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 added IO CSV read_csv, to_csv Arrow pyarrow functionality labels Dec 4, 2022
@@ -107,7 +107,7 @@ def _finalize_pandas_output(self, frame: DataFrame) -> DataFrame:
multi_index_named = False
frame.columns = self.names
# we only need the frame not the names
frame.columns, frame = self._do_date_conversions(frame.columns, frame)
_, frame = self._do_date_conversions(frame.columns, frame)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives us back the frame with already changed column names?

Copy link
Member Author

@lithomas1 lithomas1 Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, _do_date_conversions changes the names in the data_dict/frame too.

I'm actually not too sure why names is returned from this function again. (I guess it might have been before dicts were ordered???)
EDIT: It's probably related to making a multi-index from the columns for the other engines. _do_date_conversions can always fix the frame directly, so this isn't relevant for the pyarrow engine.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2023

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 5, 2023
@simonjayhawkins
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@lithomas1 lithomas1 requested a review from phofl April 24, 2023 16:59
@lithomas1 lithomas1 removed the Stale label Apr 24, 2023
@lithomas1 lithomas1 added this to the 2.1 milestone Apr 24, 2023
# In case of dict, we don't want to propagate through, so
# just set to pyarrow default of None

# Ideally, in future we disable pyarrow dtype inference (read in as string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to create a conversion option that is arrow only for this, otherwise we incur a big performance penalty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's no way to disable the parsing, it'll only get parsed once as a pyarrow timestamp/date.

I think in your other PR, you set it so that date parsing will be bypassed for Arrow Timestamp cols so we won't have double parsing of the input.

So there won't be a perf penalty, just a wrong result if you didn't want pyarrow to parse the date.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am misunderstanding this, but we have to convert to NumPy to parse with to_datetime? This is what I meant with slow.

What happens if dtype_backend is set to pyarrow in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I went back and checked the output, and it looks right, except for the one case where parse_dates is a dict (that does a no-op, e.g. it maps the column to itself). This is a very uncommon case, though (I'm not sure why you would want to do that, instead of passing a list).

Do you think it'd be better to fix the root cause #52545, than to special case here (I can try to take a look at that sometime soon)?

Here's what I get in the REPL btw.

>>> import pandas as pd
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow")
       index         A         B         C         D
0 2000-01-03  0.980269  3.685731 -0.364217 -1.159738
1 2000-01-04  1.047916 -0.041232 -0.161812  0.212549
2 2000-01-05  0.498581  0.731168 -0.537677  1.346270
3 2000-01-06  1.120202  1.567621  0.003641  0.675253
4 2000-01-07 -0.487094  0.571455 -1.611639  0.103469
5 2000-01-10  0.836649  0.246462  0.588543  1.062782
6 2000-01-11 -0.157161  1.340307  1.195778 -1.097007
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow")
                 index         A         B         C         D
0  2000-01-03 00:00:00  0.980269  3.685731 -0.364217 -1.159738
1  2000-01-04 00:00:00  1.047916 -0.041232 -0.161812  0.212549
2  2000-01-05 00:00:00  0.498581  0.731168 -0.537677  1.346270
3  2000-01-06 00:00:00  1.120202  1.567621  0.003641  0.675253
4  2000-01-07 00:00:00 -0.487094  0.571455 -1.611639  0.103469
5  2000-01-10 00:00:00  0.836649  0.246462  0.588543  1.062782
6  2000-01-11 00:00:00 -0.157161  1.340307  1.195778 -1.097007
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow").dtypes # Default, OK
index    timestamp[s][pyarrow]
A              double[pyarrow]
B              double[pyarrow]
C              double[pyarrow]
D              double[pyarrow]
dtype: object
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow", parse_dates=["index"]).dtypes # Parse dates as list OK
index    timestamp[s][pyarrow]
A              double[pyarrow]
B              double[pyarrow]
C              double[pyarrow]
D              double[pyarrow]
dtype: object
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow", parse_dates=False).dtypes # The bug I was talking about, no way to disable dates from being parsed
index    timestamp[s][pyarrow]
A              double[pyarrow]
B              double[pyarrow]
C              double[pyarrow]
D              double[pyarrow]
dtype: object
>>> pd.read_csv("pandas/tests/io/data/csv/test1.csv", engine="pyarrow", dtype_backend="pyarrow", parse_dates={"a": ["index"]}) # Buggy, returns datetime64[ns]
a     datetime64[ns]
A    double[pyarrow]
B    double[pyarrow]
C    double[pyarrow]
D    double[pyarrow]
dtype: object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl OK to punt on this and fix the remaining issues in a follow-up?

pandas/io/parsers/base_parser.py Outdated Show resolved Hide resolved
@lithomas1 lithomas1 requested a review from phofl May 1, 2023 15:29
@phofl phofl merged commit 634b940 into pandas-dev:main May 18, 2023
34 checks passed
@phofl
Copy link
Member

phofl commented May 18, 2023

thx @lithomas1

can you open an issue about the remaining case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_csv with engine pyarrow doesn't handle parsing multiple date columns
3 participants