GH-41502: [Python] Fix reading column index with decimal values #41503

jrversteegh · 2024-05-02T16:56:39Z

Convert pandas "decimal" to "object" in numpy.

GitHub Issue: [Python] pandas roundtrip failing with decimal in column index #41502

github-actions · 2024-05-02T16:57:03Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2024-05-02T16:58:58Z

⚠️ GitHub issue #41502 has been automatically assigned in GitHub to PR creator.

AlenkaF · 2024-05-07T07:13:22Z

Hi @jrversteegh, thank you for the contribution!

I think this is an elegant solution. Not sure if this was discussed before or not, can't find any similar issue on our issue tracker. I am sure @jorisvandenbossche will know straight away if this change fits or not.

From my side, a test needs to be added in python/pyarrow/tests/parquet/test_pandas.py.

jorisvandenbossche · 2024-05-07T09:06:43Z

That indeed looks like a good fix.

The error itself should already happen with just a roundtrip from pandas->pyarrow->pandas (without parquet), so you can add a test for this in python/pyarrow/tests/test_pandas.py.

jrversteegh · 2024-05-09T13:21:33Z

@AlenkaF

From my side, a test needs to be added in python/pyarrow/tests/parquet/test_pandas.py.

Thanks for that suggestion. I tried, but this issue appears more involved than I expected. It looks like pyarrow expects column names to be strings. If not, it converts them (in turn because the parquet format expects this?).
My fix avoids the exception, but still converts the column names from decimal to string, which is better, but still undesirable. I'll have to see whether there is something I can do about that. To be continued.

jrversteegh · 2024-05-09T17:39:33Z

@AlenkaF @jorisvandenbossche I've added a test and restored the decimal index from strings. This looks like a bit of a kludge. I think it's because both numpy and pandas don't understand Decimal. It's just an object, like any other, so you might expect similar issues for any other object type used as a column index. However, Decimal is not just any object and it does make some sense to use as an index, just like dates, so I feel it's a warranted addition. Let me know what you think.

Fix reading column index with decimal values

d475d87

github-actions bot added Component: Python awaiting review Awaiting review labels May 2, 2024

jrversteegh changed the title ~~Fix reading column index with decimal values~~ GH-41502: [Python] Fix reading column index with decimal values May 2, 2024

jrversteegh added 2 commits May 9, 2024 19:29

Add test for columns (multi) index with decimal values

5dfb521

Restore decimal index from strings

32c14a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-41502: [Python] Fix reading column index with decimal values #41503

GH-41502: [Python] Fix reading column index with decimal values #41503

jrversteegh commented May 2, 2024 •

edited by github-actions bot

github-actions bot commented May 2, 2024

github-actions bot commented May 2, 2024

AlenkaF commented May 7, 2024

jorisvandenbossche commented May 7, 2024

jrversteegh commented May 9, 2024

jrversteegh commented May 9, 2024

GH-41502: [Python] Fix reading column index with decimal values #41503

Are you sure you want to change the base?

GH-41502: [Python] Fix reading column index with decimal values #41503

Conversation

jrversteegh commented May 2, 2024 • edited by github-actions bot

github-actions bot commented May 2, 2024

github-actions bot commented May 2, 2024

AlenkaF commented May 7, 2024

jorisvandenbossche commented May 7, 2024

jrversteegh commented May 9, 2024

jrversteegh commented May 9, 2024

jrversteegh commented May 2, 2024 •

edited by github-actions bot