persist: all* columnar encodings, with new `Schema2` trait #27084

ParkMyCar · 2024-05-14T16:16:51Z

This PR introduces new ColumnEncoder, ColumnDecoder, and Schema2 traits in Persist, which are very similar to the existing PartEncoder, PartDecoder, and Schema traits, but the new version does not use the Data trait or DynStruct at all. Instead of generics the new traits rely on enums and arrow-rs types directly. I opted to make this refactor because the more time I've spent in this part of the code base, I've started to believe the additional complexity the abstraction introduces is not worth the benefit.

Using the new traits, this PR also introduces a structured columnar encoder that handles all types of Datums. Importantly (for performance) the new encoders downcast only once per Part, just the existing ones. Micro-benchmarks show the encoding and decoding performance is exactly the same between the two.

One benchmark I added was encoding and decoding JSON. The new structured encoders are about 2.5x faster than the existing decoding via Protobuf.

TODOs

There are still a few very small TODOs in this PR, but I figured it would be useful to get it up for an initial review since I'm going to spend some cycles on other work. Those TODOs are:

Final Numeric encoding. I have a PR up persist: PackedNumeric encoding #27474 that introduces a possible encoding for Numeric, but need to iterate once more on it. Right now we just encoding using ProtoNumeric.
Final Range encoding. Haven't put much thought in here, but right now we're just using ProtoRange. Ranges are not common so it's probably fine to stick with this, but wanted to call it out.
More benchmarks.
Statistics. Still need to implement statistics for all of the new types.

Motivation

Progress towards: #24830

Simplify our columnar/stats code. Caveat, it's easier to write code than it is to read code! I find this refactoring a bit simpler but other folks might not, which is totally fine!

Tips for reviewer

This PR is split into a few commits, you should spend the most energy on 1 and 2:

New traits, and persist specific implementations for them.
New Columnar Datum encoders and decoders.
Small supporting bits for the encoders
Encoder and decoder for SourceData, largely just a wrapper around the types in 2
Test changes, additional benchmarks, and generated files

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:
- N/a

ParkMyCar · 2024-06-07T16:29:42Z

Removed Dan because he's on parental leave, more than happy to make any follow up changes if he has future comments!

ParkMyCar · 2024-06-07T16:31:47Z

@rjobanp if you have some time, I would appreciate a quick glance at commit 2 which is all of the Arrow related stuff

ParkMyCar requested review from danhhz and bkirwi May 14, 2024 16:17

ParkMyCar force-pushed the persist/refactor-out-dynstruct branch from cc566de to 9377a89 Compare May 14, 2024 16:22

ParkMyCar force-pushed the persist/refactor-out-dynstruct branch from 4d68c76 to fe27dab Compare June 7, 2024 16:10

ParkMyCar added 6 commits June 7, 2024 12:11

start, Persist types and traits, introduces the Schema2 trait

5b5310b

new DatumColumnar encoders and decoders

dd8d0c4

small supporting bits to the Datum encoders and decoders

52e7a50

storage specific types, e.g. encoder for SourceData

f8f2ed3

tests, benches, and generated files

b4c396b

clippy fixes

0925b78

ParkMyCar force-pushed the persist/refactor-out-dynstruct branch from fe27dab to 0925b78 Compare June 7, 2024 16:11

ParkMyCar changed the title ~~[dnm] persist: explore removing DynStruct~~ persist: all* columnar encodings, with new Schema2 trait Jun 7, 2024

ParkMyCar removed the request for review from danhhz June 7, 2024 16:28

ParkMyCar marked this pull request as ready for review June 7, 2024 16:29

ParkMyCar requested review from a team as code owners June 7, 2024 16:29

ParkMyCar requested a review from rjobanp June 7, 2024 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persist: all* columnar encodings, with new `Schema2` trait #27084

persist: all* columnar encodings, with new `Schema2` trait #27084

ParkMyCar commented May 14, 2024 •

edited

ParkMyCar commented Jun 7, 2024

ParkMyCar commented Jun 7, 2024

persist: all* columnar encodings, with new Schema2 trait #27084

Are you sure you want to change the base?

persist: all* columnar encodings, with new Schema2 trait #27084

Conversation

ParkMyCar commented May 14, 2024 • edited

TODOs

Motivation

Tips for reviewer

Checklist

ParkMyCar commented Jun 7, 2024

ParkMyCar commented Jun 7, 2024

persist: all* columnar encodings, with new `Schema2` trait #27084

persist: all* columnar encodings, with new `Schema2` trait #27084

ParkMyCar commented May 14, 2024 •

edited