Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41502: [Python] Fix reading column index with decimal values #41503

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jrversteegh
Copy link

@jrversteegh jrversteegh commented May 2, 2024

Fix for #41502

Convert pandas "decimal" to "object" in numpy.

Copy link

github-actions bot commented May 2, 2024

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@jrversteegh jrversteegh changed the title Fix reading column index with decimal values GH-41502: [Python] Fix reading column index with decimal values May 2, 2024
Copy link

github-actions bot commented May 2, 2024

⚠️ GitHub issue #41502 has been automatically assigned in GitHub to PR creator.

@AlenkaF
Copy link
Member

AlenkaF commented May 7, 2024

Hi @jrversteegh, thank you for the contribution!

I think this is an elegant solution. Not sure if this was discussed before or not, can't find any similar issue on our issue tracker. I am sure @jorisvandenbossche will know straight away if this change fits or not.

From my side, a test needs to be added in python/pyarrow/tests/parquet/test_pandas.py.

@jorisvandenbossche
Copy link
Member

That indeed looks like a good fix.

The error itself should already happen with just a roundtrip from pandas->pyarrow->pandas (without parquet), so you can add a test for this in python/pyarrow/tests/test_pandas.py.

@jrversteegh
Copy link
Author

@AlenkaF

From my side, a test needs to be added in python/pyarrow/tests/parquet/test_pandas.py.

Thanks for that suggestion. I tried, but this issue appears more involved than I expected. It looks like pyarrow expects column names to be strings. If not, it converts them (in turn because the parquet format expects this?).
My fix avoids the exception, but still converts the column names from decimal to string, which is better, but still undesirable. I'll have to see whether there is something I can do about that. To be continued.

@jrversteegh
Copy link
Author

@AlenkaF @jorisvandenbossche I've added a test and restored the decimal index from strings. This looks like a bit of a kludge. I think it's because both numpy and pandas don't understand Decimal. It's just an object, like any other, so you might expect similar issues for any other object type used as a column index. However, Decimal is not just any object and it does make some sense to use as an index, just like dates, so I feel it's a warranted addition. Let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants