Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist: all* columnar encodings, with new Schema2 trait #27084

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ParkMyCar
Copy link
Member

@ParkMyCar ParkMyCar commented May 14, 2024

This PR introduces new ColumnEncoder, ColumnDecoder, and Schema2 traits in Persist, which are very similar to the existing PartEncoder, PartDecoder, and Schema traits, but the new version does not use the Data trait or DynStruct at all. Instead of generics the new traits rely on enums and arrow-rs types directly. I opted to make this refactor because the more time I've spent in this part of the code base, I've started to believe the additional complexity the abstraction introduces is not worth the benefit.

Using the new traits, this PR also introduces a structured columnar encoder that handles all types of Datums. Importantly (for performance) the new encoders downcast only once per Part, just the existing ones. Micro-benchmarks show the encoding and decoding performance is exactly the same between the two.

One benchmark I added was encoding and decoding JSON. The new structured encoders are about 2.5x faster than the existing decoding via Protobuf.

TODOs

There are still a few very small TODOs in this PR, but I figured it would be useful to get it up for an initial review since I'm going to spend some cycles on other work. Those TODOs are:

  • Final Numeric encoding. I have a PR up persist: PackedNumeric encoding #27474 that introduces a possible encoding for Numeric, but need to iterate once more on it. Right now we just encoding using ProtoNumeric.
  • Final Range encoding. Haven't put much thought in here, but right now we're just using ProtoRange. Ranges are not common so it's probably fine to stick with this, but wanted to call it out.
  • More benchmarks.
  • Statistics. Still need to implement statistics for all of the new types.

Motivation

Progress towards: #24830

Simplify our columnar/stats code. Caveat, it's easier to write code than it is to read code! I find this refactoring a bit simpler but other folks might not, which is totally fine!

Tips for reviewer

This PR is split into a few commits, you should spend the most energy on 1 and 2:

  1. New traits, and persist specific implementations for them.
  2. New Columnar Datum encoders and decoders.
  3. Small supporting bits for the encoders
  4. Encoder and decoder for SourceData, largely just a wrapper around the types in 2
  5. Test changes, additional benchmarks, and generated files

Checklist

@ParkMyCar ParkMyCar requested review from danhhz and bkirwi May 14, 2024 16:17
@ParkMyCar ParkMyCar force-pushed the persist/refactor-out-dynstruct branch from cc566de to 9377a89 Compare May 14, 2024 16:22
@ParkMyCar ParkMyCar force-pushed the persist/refactor-out-dynstruct branch from 4d68c76 to fe27dab Compare June 7, 2024 16:10
@ParkMyCar ParkMyCar force-pushed the persist/refactor-out-dynstruct branch from fe27dab to 0925b78 Compare June 7, 2024 16:11
@ParkMyCar ParkMyCar changed the title [dnm] persist: explore removing DynStruct persist: all* columnar encodings, with new Schema2 trait Jun 7, 2024
@ParkMyCar ParkMyCar removed the request for review from danhhz June 7, 2024 16:28
@ParkMyCar
Copy link
Member Author

Removed Dan because he's on parental leave, more than happy to make any follow up changes if he has future comments!

@ParkMyCar ParkMyCar marked this pull request as ready for review June 7, 2024 16:29
@ParkMyCar ParkMyCar requested review from a team as code owners June 7, 2024 16:29
@ParkMyCar ParkMyCar requested a review from rjobanp June 7, 2024 16:29
@ParkMyCar
Copy link
Member Author

@rjobanp if you have some time, I would appreciate a quick glance at commit 2 which is all of the Arrow related stuff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant