New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet: ByteArrayEncoder allocates large unused FallbackEncoder for Parquet 2 #5755
Comments
This is likely a hold over from when buffers weren't resizable, itself a hold over from when parquet-rs was ported from parquet-mr. The size of this buffer can probably be drastically reduced without impacting performance |
just changed @tustvold do you think that's a valuable contribution (+ a name change for the const)?
|
Reducing to 1MB makes a lot of sense to me, given that is a more typical page size |
I'll open a PR |
Describe the bug
ByteArrayEncoder
creates aFallbackEncoder
unconditionally:arrow-rs/parquet/src/arrow/arrow_writer/byte_array.rs
Line 407 in 1c86921
This results in allocating at least 10MB
MAX_BIT_WRITER_SIZE
per column viaDeltaBitPackEncoder:
arrow-rs/parquet/src/encodings/encoding/mod.rs
Lines 314 to 316 in 1c86921
If the fallback encoder isn't used this results in huge unused allocations.
To Reproduce
Use
ArrowWriter
to write Strings with Parquet version 2 whereDELTA_BINARY_PACKED
is default.Expected behavior
Less allocations, maybe lazily create the fallback encoder?
Additional context
The text was updated successfully, but these errors were encountered: