Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[parquet_derive] support OPTIONAL (def_level = 1) columns by default #5716

Open
double-free opened this issue May 4, 2024 · 7 comments · May be fixed by #5717
Open

[parquet_derive] support OPTIONAL (def_level = 1) columns by default #5716

double-free opened this issue May 4, 2024 · 7 comments · May be fixed by #5717
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@double-free
Copy link

double-free commented May 4, 2024

Problem Description

I'm working on parquet files written by pyarrow (embedded in pandas). I came across parquet_derive and it avoids boilerplates in my project.
The problem is, it doesn't work on the parquet files that is written by pandas with default setup, it throws error information:

Parquet error: must specify definition levels

After digging into this, I found that the problem is the parquet file generated by pyarrow has def_level=1, i.e., every column, even without a null value, is OPTIONAL.

image

However, the macro generate code that does not allow definition level, thus it fails to parsing columns with OPTIONAL value, even there is no actual NULL values:

typed.read_records(num_records, None, None, &mut vals)?;

The API it calls is: https://docs.rs/parquet/latest/parquet/column/reader/struct.GenericColumnReader.html#method.read_records .

My Solution

The solution is straight-forward. I have fixed the problem locally, I'm willing to contribute a pull request, but I don't know if this solution is reasonable in the scope of the whole arrow project.

Basically, I think we need to provide definition level in read_record:

typed.read_records(num_records, None /*should use a Some(&mut Vec<i16>)*/, None, &mut vals)?;

In one word, with this solution, parquet_derive can now handle:

  1. (already supported) parquet file with all columns REQUIRED
  2. (new introduced) parquet file with OPTIONAL columns but are always guaranteed to be valid.

Pros

  • This solution does not break current features
  • This solution makes parquet_derive more general in handling parquet files.

It can pass the tests in parquet_derive_tests. I also add checks against the parsed records and valid records, to avoid abusing it for columns with NULLs.

Cons

  • It will be slightly slower since it allocates an extra Vec<i16> for each column when invoking read_from_row_group.

I don't think it is a big deal, though, compared to the inconvenience of not supporting OPTIONAL columns. Moreover, we can make use of the max_def_levels (for REQUIRED column, it is 0) to skip creating the Vec.

@double-free double-free added the enhancement Any new improvement worthy of a entry in the changelog label May 4, 2024
@tustvold
Copy link
Contributor

tustvold commented May 4, 2024

What happens if you specify the field as Option to match the schema?

@double-free
Copy link
Author

What happens if you specify the field as Option to match the schema?

I have tried, it does not work, because:

  1. the OPTIONAL, REQUIRED, REPEATED tags are generated automatically in the macro.
  2. even I workaround 1 by enforce the field to be OPTIONAL, the actual parsing code, i.e.:
typed.read_records(num_records, None, None, &mut vals)?;

is not dynamic and still provide no definition level.

@tustvold
Copy link
Contributor

tustvold commented May 4, 2024

Hmm... My knowledge of parquet_derive is limited, but I would have expected an optional field to map to Option<T> on the annotated struct.

It might also be that it simply doesn't support nulls, in which case I'm a little unsure about adding partial support in a manner that might prevent adding such support in future

@double-free
Copy link
Author

double-free commented May 4, 2024

Hmm... My knowledge of parquet_derive is limited, but I would have expected an optional field to map to Option<T> on the annotated struct.

It might also be that it simply doesn't support nulls, in which case I'm a little unsure about adding partial support in a manner that might prevent adding such support in future

I understand your concern, and that is why I said this in the description:

but I don't know if this solution is reasonable in the scope of the whole arrow project.

Yes, I agree it is more natural and rigorous to map and OPTIONAL column to Option<T> in theory. The pipeline should be like this:

OPTIONAL columns --> struct with Option<T> (by parquet_derive, not implemented) --> struct without Option<T> (by user)

However, in practice, when write a parquet file, the default attribute of a column is OPTIONAL (see https://arrow.apache.org/docs/python/generated/pyarrow.field.html), no matter whether there is a NULL. Those APIs are not even exposed to user in higher level APIs. So, most of the parquet files users deal with have OPTIONAL columns everywhere, though values in those columns are in fact all valid. I don't think it is handy for users to declare a struct with all fields in Option<T>.

Another point I want to say is, this change DOES NOT mean that the reader support Option<T> in struct now, if you derive ParquetRecordReader for a struct with Option<T>, it fails to compile as before:

 help: message: not implemented: Unsupported: Option(...)

The only difference is that if a parquet file is written with OPTIONAL columns, but in fact the values in those columns are in fact all valid, the reader should still be able to load the records into a struct WITHOUT Option<T>, i.e., the pipeline becomes:

OPTIONAL columns --> struct without Option<T> (by parquet_derive, with checks)

This change is only to relax parquet_derive's restriction against parquet input, without introducing risk since checks are done after parsing. If user's input does have NULL values, the parser will panic, like what it is doing now.

To sum up, I think this change does not make things worse or unsafe. I really appreciate your time to review this issue, and even better if you can invite more experts to review it.

@tustvold
Copy link
Contributor

tustvold commented May 4, 2024

However, in practice, when write a parquet file, the default attribute of a column is OPTIONAL

Is this the case if you set nullable to false? If so I would probably raise a bug on pyarrow as that is incorrect.

This change is only to relax parquet_derive's restriction against parquet input, without introducing risk since checks are done after parsing. If user's input does have NULL values, the parser will panic, like what it is doing now.

So long as we don't regress performance for existing workloads I suppose this is an acceptable workaround. I will try to take a look next week at your PR, although I will need to allocate enough time to get up to speed on that crate (there isn't really anyone maintaining it actively anymore).

FWIW reading parquet via the arrow interface will be faster, especially for string columns, but appreciate if you'd rather stick to a row-oriented model

@double-free
Copy link
Author

Is this the case if you set nullable to false? If so I would probably raise a bug on pyarrow as that is incorrect.

pyarrow is not the culprit. I believe "convention" is the one to blame. pandas has a to_parquet interface, and it does not accept schema before pandas-dev/pandas#30270 . Even after this improvement, most people use the default schema, which falls in to OPTIONAL columns by default (nullable=True in pyarrow).

I will try to take a look next week at your PR, although I will need to allocate enough time to get up to speed on that crate

I really appreciate your time, my use case requires a row-oriented model, and this crate is still very useful to our project, that's why I raise this issue. I'm willing to spend time polishing this pull request.

@double-free
Copy link
Author

double-free commented May 17, 2024

Hi @tustvold have you got a chance to take a look at the PR? I had rebased it to latest master to resolve some conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants