Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

breaking: varint encoding for column index and field lengths #314

Open
michaelkirk opened this issue Sep 29, 2023 · 2 comments
Open

breaking: varint encoding for column index and field lengths #314

michaelkirk opened this issue Sep 29, 2023 · 2 comments

Comments

@michaelkirk
Copy link
Collaborator

Currently column idx are u16 and field lengths (for Strings, Binary, etc.) are u32. I expect in practice that column indexes would almost always fit in a 1 byte varint and field lengths typically in 3 bytes (if not 2).

The properties data is already not random access, it must be processed serially. So there's no loss of functionality there.

This would be a major breaking change, so I don't expect it to be adopted anytime soon, but if you end up making a breaking format release in #81, you should consider piling this on.

I made a prototype here: https://github.com/michaelkirk/flatgeobuf/tree/mkirk/varint

I was working with openaddresses data which is a lot of point geometries with short string columns. Using varints for columns and field lengths outputs a file 85% the size of the original.

@bjornharrtell
Copy link
Member

85%! Ouch... :S I can definitely admit to that the properties encoding deserved a bit more thought. I made it quickly after discovering that try to encode it into a generic flatbuffers schema was very space wasteful.

But yeah, a breaking change isn't likely to happen anytime soon or if ever. Might as well make a new format entirely, perhaps a custom binary encoding. I've been thinking lately and from the discussion at #291 that Flatbuffers (and protobuf) primary function is to allow for evolving schemas but as I see it now it's not an important feature - when a format becomes stable and more or less widespread there is no room for evolution, even backwards compatible.

@bjornharrtell
Copy link
Member

That said, alot of short string columns isn't perhaps the most clever data representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants