Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve file metadata lookup in Parquet SDF #4979

Open
clairemcginty opened this issue Sep 5, 2023 · 0 comments
Open

Improve file metadata lookup in Parquet SDF #4979

clairemcginty opened this issue Sep 5, 2023 · 0 comments
Labels
enhancement New feature or request parquet

Comments

@clairemcginty
Copy link
Contributor

When using the new Parquet SplittableDoFn implementation to read a large # of files, the file metadata lookup (required to break down individual files into parallelizable row groups) can be a performance bottleneck because it's pretty much single threaded+sequential: if you look at the worker graph, you'll see a single worker just doing metadata lookups for 10-20 min before the actual splitting operations kick in. Using the ParquetReadConfiguration.SplitGranularityFile option can remediate this, but at the cost of available parallelism

Can we improve this? Some ideas:

  1. Simplest -- just do file lookups in parallel.
  2. Introduce an option like ParquetReadConfiguration.UseEstimatedRowGroupSize -- basically, instead of reading every file's metadata, we can just sample a few files, and use their average value to extrapolate the rest.
  3. Write some kind of a manifest file/metastore entry that maps individual files --> [# row groups, group byte size]
@clairemcginty clairemcginty added enhancement New feature or request parquet labels Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request parquet
Projects
None yet
Development

No branches or pull requests

1 participant