New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[parquet_derive] support OPTIONAL (def_level = 1) columns by default #5716
Comments
What happens if you specify the field as |
I have tried, it does not work, because:
typed.read_records(num_records, None, None, &mut vals)?; is not dynamic and still provide no definition level. |
Hmm... My knowledge of parquet_derive is limited, but I would have expected an optional field to map to It might also be that it simply doesn't support nulls, in which case I'm a little unsure about adding partial support in a manner that might prevent adding such support in future |
I understand your concern, and that is why I said this in the description:
Yes, I agree it is more natural and rigorous to map and OPTIONAL column to
However, in practice, when write a parquet file, the default attribute of a column is OPTIONAL (see https://arrow.apache.org/docs/python/generated/pyarrow.field.html), no matter whether there is a NULL. Those APIs are not even exposed to user in higher level APIs. So, most of the parquet files users deal with have OPTIONAL columns everywhere, though values in those columns are in fact all valid. I don't think it is handy for users to declare a struct with all fields in Another point I want to say is, this change DOES NOT mean that the reader support
The only difference is that if a parquet file is written with OPTIONAL columns, but in fact the values in those columns are in fact all valid, the reader should still be able to load the records into a struct WITHOUT
This change is only to relax parquet_derive's restriction against parquet input, without introducing risk since checks are done after parsing. If user's input does have NULL values, the parser will panic, like what it is doing now. To sum up, I think this change does not make things worse or unsafe. I really appreciate your time to review this issue, and even better if you can invite more experts to review it. |
Is this the case if you set
So long as we don't regress performance for existing workloads I suppose this is an acceptable workaround. I will try to take a look next week at your PR, although I will need to allocate enough time to get up to speed on that crate (there isn't really anyone maintaining it actively anymore). FWIW reading parquet via the arrow interface will be faster, especially for string columns, but appreciate if you'd rather stick to a row-oriented model |
I really appreciate your time, my use case requires a row-oriented model, and this crate is still very useful to our project, that's why I raise this issue. I'm willing to spend time polishing this pull request. |
Hi @tustvold have you got a chance to take a look at the PR? I had rebased it to latest master to resolve some conflicts. |
Problem Description
I'm working on parquet files written by
pyarrow
(embedded inpandas
). I came acrossparquet_derive
and it avoids boilerplates in my project.The problem is, it doesn't work on the parquet files that is written by
pandas
with default setup, it throws error information:After digging into this, I found that the problem is the parquet file generated by
pyarrow
has def_level=1, i.e., every column, even without a null value, is OPTIONAL.However, the macro generate code that does not allow definition level, thus it fails to parsing columns with OPTIONAL value, even there is no actual NULL values:
The API it calls is: https://docs.rs/parquet/latest/parquet/column/reader/struct.GenericColumnReader.html#method.read_records .
My Solution
The solution is straight-forward. I have fixed the problem locally, I'm willing to contribute a pull request, but I don't know if this solution is reasonable in the scope of the whole
arrow
project.Basically, I think we need to provide definition level in
read_record
:In one word, with this solution,
parquet_derive
can now handle:Pros
It can pass the tests in
parquet_derive_tests
. I also add checks against the parsed records and valid records, to avoid abusing it for columns with NULLs.Cons
Vec<i16>
for each column when invokingread_from_row_group
.I don't think it is a big deal, though, compared to the inconvenience of not supporting OPTIONAL columns. Moreover, we can make use of the max_def_levels (for REQUIRED column, it is 0) to skip creating the Vec.
The text was updated successfully, but these errors were encountered: