Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-8492: Read Parquet Microsecond Columns as Bigint #2907

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

handmadecode
Copy link
Contributor

DRILL-8492: Read parquet microsecond columns as bigint

Description

Two new configuration options, store.parquet.reader.time_micros_as_int64 and store.parquet.reader.timestamp_micros_as_int64, have been added.

When reading Parquet columns of type time_micros and timestamp_micros, the returned value will be the original 64-bit integer value instead of a timestamp value truncated to milliseconds if the configuration option for the column type is true.

The implementation closely follows how the existing option store.parquet.reader.int96_as_timestamp is handled.

Both new options have the default value false to preserve existing behaviour as default.

Documentation

The new options can be set in "drill-module.conf", and also be set through the Web UI's Options page.

Testing

Unit tests have been added to org.apache.drill.exec.store.parquet.TestMicrosecondColumns.

It could be worth noting that microsecond columns can be compared to time or timestamp literals even if the option for reading the column's value as a 64-bit value is true:

SELECT * FROM file.parquet WHERE TO_TIME(time_micros_column/1000) > '09:32:58.174';
and
SELECT * FROM file.parquet WHERE TO_TIMESTAMP(timestamp_micros_column/1000000) > '2024-04-26 13:17:41.421';

@cgivre cgivre changed the title Read parquet microsecond columns as bigint DRILL-8492: Read Parquet Microsecond Columns as Bigint Apr 26, 2024
@cgivre cgivre added enhancement PRs that add a new functionality to Drill doc-impacting PRs that affect the documentation labels Apr 26, 2024
Copy link
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1 Thanks for the contribution!
Can you please update the documentation for the Parquet reader to include this? Otherwise looks good!

@handmadecode
Copy link
Contributor Author

LGTM +1 Thanks for the contribution! Can you please update the documentation for the Parquet reader to include this? Otherwise looks good!

Happy to contribute!

Do you mean the documentation in the drill-site repo? (https://github.com/apache/drill-site/blob/master/_docs/en/data-sources-and-file-formats/040-parquet-format.md)

@cgivre
Copy link
Contributor

cgivre commented May 7, 2024

LGTM +1 Thanks for the contribution! Can you please update the documentation for the Parquet reader to include this? Otherwise looks good!

Happy to contribute!

Do you mean the documentation in the drill-site repo? (https://github.com/apache/drill-site/blob/master/_docs/en/data-sources-and-file-formats/040-parquet-format.md)

That's the one!

Copy link
Contributor

@jnturton jnturton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work. I can backport this too because you've left default behaviour unchanged (and it's self contained). My only question is about ParquetReaderConfig -- you've added switches to the global config system so I don't think we also need them to appear in the storage plugin config?

@handmadecode
Copy link
Contributor Author

Awesome work. I can backport this too because you've left default behaviour unchanged (and it's self contained). My only question is about ParquetReaderConfig -- you've added switches to the global config system so I don't think we also need them to appear in the storage plugin config?

Thanks.

The reason I added the configuration properties to ParquetReaderConfig is that they are needed in org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector::addColumnMetadata (to decide whether the metadata will be compared to timestamps with millisecond precision or to raw microsecond values). I couldn't find an easy way to access the global configuration values from there, but the ParquetReaderConfig instance was already available.

However, I could very well have overlooked some way to access the global configuration, and I'd be grateful for any pointers.

@jnturton
Copy link
Contributor

jnturton commented May 13, 2024

However, I could very well have overlooked some way to access the global configuration, and I'd be grateful for any pointers.

We should existing find examples in the Parquet format plugin. E.g. the "old" reader is affected by the option store.parquet.reader.pagereader.async.

Follow up: I see use of the FragmentContext class for accessing config options in the old Parquet reader, perhaps it's a good a vehicle...

@handmadecode
Copy link
Contributor Author

Follow up: I see use of the FragmentContext class for accessing config options in the old Parquet reader, perhaps it's a good a vehicle...

FragmentContext is used to access the new config options where an instance already was available, i.e. in ColumnReaderFactory, ParquetToDrillTypeConverter, and DrillParquetGroupConverter.

FileMetadataCollector doesn't have access to a FragmentContext, only to a ParquetReaderConfig.
I can only find a FragmentContext in one of the two call paths to FileMetadataCollector::addColumnMetadata, so trying to inject a FragmentContext into FileMetadataCollector will probably have an impact on quite a few other classes.

@jnturton
Copy link
Contributor

jnturton commented May 17, 2024

It's always bugged me that we don't have a global way of accessing at least one of DrillbitContext, QueryContext, FragmentContext or just OptionManager. We hardly want to have to spray these things through APIs everywhere in Drill. I'll take a look at whether something can be done...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-impacting PRs that affect the documentation enhancement PRs that add a new functionality to Drill
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants