persist: all* columnar encodings, with new Schema2
trait
#27084
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces new
ColumnEncoder
,ColumnDecoder
, andSchema2
traits in Persist, which are very similar to the existingPartEncoder
,PartDecoder
, andSchema
traits, but the new version does not use theData
trait orDynStruct
at all. Instead of generics the new traits rely on enums andarrow-rs
types directly. I opted to make this refactor because the more time I've spent in this part of the code base, I've started to believe the additional complexity the abstraction introduces is not worth the benefit.Using the new traits, this PR also introduces a structured columnar encoder that handles all types of
Datum
s. Importantly (for performance) the new encoders downcast only once per Part, just the existing ones. Micro-benchmarks show the encoding and decoding performance is exactly the same between the two.One benchmark I added was encoding and decoding JSON. The new structured encoders are about 2.5x faster than the existing decoding via Protobuf.
TODOs
There are still a few very small TODOs in this PR, but I figured it would be useful to get it up for an initial review since I'm going to spend some cycles on other work. Those TODOs are:
Numeric
encoding. I have a PR up persist: PackedNumeric encoding #27474 that introduces a possible encoding for Numeric, but need to iterate once more on it. Right now we just encoding usingProtoNumeric
.Range
encoding. Haven't put much thought in here, but right now we're just usingProtoRange
.Range
s are not common so it's probably fine to stick with this, but wanted to call it out.Motivation
Progress towards: #24830
Simplify our columnar/stats code. Caveat, it's easier to write code than it is to read code! I find this refactoring a bit simpler but other folks might not, which is totally fine!
Tips for reviewer
This PR is split into a few commits, you should spend the most energy on 1 and 2:
SourceData
, largely just a wrapper around the types in 2Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.