[parquet_derive] support OPTIONAL (def_level = 1) columns by default #5716

double-free · 2024-05-04T10:53:09Z

Problem Description

I'm working on parquet files written by pyarrow (embedded in pandas). I came across parquet_derive and it avoids boilerplates in my project.
The problem is, it doesn't work on the parquet files that is written by pandas with default setup, it throws error information:

Parquet error: must specify definition levels

After digging into this, I found that the problem is the parquet file generated by pyarrow has def_level=1, i.e., every column, even without a null value, is OPTIONAL.

However, the macro generate code that does not allow definition level, thus it fails to parsing columns with OPTIONAL value, even there is no actual NULL values:

typed.read_records(num_records, None, None, &mut vals)?;

The API it calls is: https://docs.rs/parquet/latest/parquet/column/reader/struct.GenericColumnReader.html#method.read_records .

My Solution

The solution is straight-forward. I have fixed the problem locally, I'm willing to contribute a pull request, but I don't know if this solution is reasonable in the scope of the whole arrow project.

Basically, I think we need to provide definition level in read_record:

typed.read_records(num_records, None /*should use a Some(&mut Vec<i16>)*/, None, &mut vals)?;

In one word, with this solution, parquet_derive can now handle:

(already supported) parquet file with all columns REQUIRED
(new introduced) parquet file with OPTIONAL columns but are always guaranteed to be valid.

Pros

This solution does not break current features
This solution makes parquet_derive more general in handling parquet files.

It can pass the tests in parquet_derive_tests. I also add checks against the parsed records and valid records, to avoid abusing it for columns with NULLs.

Cons

It will be slightly slower since it allocates an extra Vec<i16> for each column when invoking read_from_row_group.

I don't think it is a big deal, though, compared to the inconvenience of not supporting OPTIONAL columns. Moreover, we can make use of the max_def_levels (for REQUIRED column, it is 0) to skip creating the Vec.

The text was updated successfully, but these errors were encountered:

tustvold · 2024-05-04T11:13:44Z

What happens if you specify the field as Option to match the schema?

double-free · 2024-05-04T11:21:12Z

What happens if you specify the field as Option to match the schema?

I have tried, it does not work, because:

the OPTIONAL, REQUIRED, REPEATED tags are generated automatically in the macro.
even I workaround 1 by enforce the field to be OPTIONAL, the actual parsing code, i.e.:

typed.read_records(num_records, None, None, &mut vals)?;

is not dynamic and still provide no definition level.

tustvold · 2024-05-04T12:04:40Z

Hmm... My knowledge of parquet_derive is limited, but I would have expected an optional field to map to Option<T> on the annotated struct.

It might also be that it simply doesn't support nulls, in which case I'm a little unsure about adding partial support in a manner that might prevent adding such support in future

double-free · 2024-05-04T12:49:37Z

Hmm... My knowledge of parquet_derive is limited, but I would have expected an optional field to map to Option<T> on the annotated struct.

It might also be that it simply doesn't support nulls, in which case I'm a little unsure about adding partial support in a manner that might prevent adding such support in future

I understand your concern, and that is why I said this in the description:

but I don't know if this solution is reasonable in the scope of the whole arrow project.

Yes, I agree it is more natural and rigorous to map and OPTIONAL column to Option<T> in theory. The pipeline should be like this:

OPTIONAL columns --> struct with Option<T> (by parquet_derive, not implemented) --> struct without Option<T> (by user)

However, in practice, when write a parquet file, the default attribute of a column is OPTIONAL (see https://arrow.apache.org/docs/python/generated/pyarrow.field.html), no matter whether there is a NULL. Those APIs are not even exposed to user in higher level APIs. So, most of the parquet files users deal with have OPTIONAL columns everywhere, though values in those columns are in fact all valid. I don't think it is handy for users to declare a struct with all fields in Option<T>.

Another point I want to say is, this change DOES NOT mean that the reader support Option<T> in struct now, if you derive ParquetRecordReader for a struct with Option<T>, it fails to compile as before:

 help: message: not implemented: Unsupported: Option(...)

The only difference is that if a parquet file is written with OPTIONAL columns, but in fact the values in those columns are in fact all valid, the reader should still be able to load the records into a struct WITHOUT Option<T>, i.e., the pipeline becomes:

OPTIONAL columns --> struct without Option<T> (by parquet_derive, with checks)

This change is only to relax parquet_derive's restriction against parquet input, without introducing risk since checks are done after parsing. If user's input does have NULL values, the parser will panic, like what it is doing now.

To sum up, I think this change does not make things worse or unsafe. I really appreciate your time to review this issue, and even better if you can invite more experts to review it.

tustvold · 2024-05-04T14:25:37Z

However, in practice, when write a parquet file, the default attribute of a column is OPTIONAL

Is this the case if you set nullable to false? If so I would probably raise a bug on pyarrow as that is incorrect.

This change is only to relax parquet_derive's restriction against parquet input, without introducing risk since checks are done after parsing. If user's input does have NULL values, the parser will panic, like what it is doing now.

So long as we don't regress performance for existing workloads I suppose this is an acceptable workaround. I will try to take a look next week at your PR, although I will need to allocate enough time to get up to speed on that crate (there isn't really anyone maintaining it actively anymore).

FWIW reading parquet via the arrow interface will be faster, especially for string columns, but appreciate if you'd rather stick to a row-oriented model

double-free · 2024-05-04T14:53:11Z

Is this the case if you set nullable to false? If so I would probably raise a bug on pyarrow as that is incorrect.

pyarrow is not the culprit. I believe "convention" is the one to blame. pandas has a to_parquet interface, and it does not accept schema before pandas-dev/pandas#30270 . Even after this improvement, most people use the default schema, which falls in to OPTIONAL columns by default (nullable=True in pyarrow).

I will try to take a look next week at your PR, although I will need to allocate enough time to get up to speed on that crate

I really appreciate your time, my use case requires a row-oriented model, and this crate is still very useful to our project, that's why I raise this issue. I'm willing to spend time polishing this pull request.

double-free · 2024-05-17T13:52:27Z

Hi @tustvold have you got a chance to take a look at the PR? I had rebased it to latest master to resolve some conflicts.

double-free added the enhancement Any new improvement worthy of a entry in the changelog label May 4, 2024

double-free linked a pull request May 4, 2024 that will close this issue

feat: support reading OPTIONAL column in parquet_derive #5717

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[parquet_derive] support OPTIONAL (def_level = 1) columns by default #5716

[parquet_derive] support OPTIONAL (def_level = 1) columns by default #5716

double-free commented May 4, 2024 •

edited

tustvold commented May 4, 2024

double-free commented May 4, 2024

tustvold commented May 4, 2024

double-free commented May 4, 2024 •

edited

tustvold commented May 4, 2024

double-free commented May 4, 2024

double-free commented May 17, 2024 •

edited

[parquet_derive] support OPTIONAL (def_level = 1) columns by default #5716

[parquet_derive] support OPTIONAL (def_level = 1) columns by default #5716

Comments

double-free commented May 4, 2024 • edited

Problem Description

My Solution

Pros

Cons

tustvold commented May 4, 2024

double-free commented May 4, 2024

tustvold commented May 4, 2024

double-free commented May 4, 2024 • edited

tustvold commented May 4, 2024

double-free commented May 4, 2024

double-free commented May 17, 2024 • edited

double-free commented May 4, 2024 •

edited

double-free commented May 4, 2024 •

edited

double-free commented May 17, 2024 •

edited