New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Alternate APIs with stronger typing #5700
Comments
I believe the narrow crate has some similar goals and might be worth checking out. FWIW I likewise found the lack of strong typing annoying when I first encountered the crate but have come to appreciate it for a number of reasons:
My 2 cents is that arrow makes little sense in statically typed contexts, specialized code will almost always win out both for ergonomics and performance. Perhaps using crates like serde-arrow to convert back and forth where necessary |
Thanks for your reply!
I may eventually be taking this to narrow instead, then, but I thought it might be worthwhile to explore whether the One True Arrow implementation from Apache wants it first. Your implementation has the most visibility, it's what the Apache Arrow web page links to, and it's what people wanting to know what arrow feels like in Rust will end up finding first. It is therefore unfortunate that it has such bad UX, needlessly discouraging people from using arrow in Rust programs.
I don't see how the API that I proposed so far leaks more implementation details than the current design of having &dyns that must be downcasted to specific types (which, in the eyes of the users, are very much implementation details). Just check out the example in the documentation of
If you don't call that implementation details leaking through, I don't know what it is! :)
I agree that having a type erased API is a good thing. What I disagree with is having that as the only option.
Generic code that is written with compile-time performance in mind (with dynamic dispatch in non-perf-critical sections) only significantly increase compilation time if they are instantiated a lot, for a lot of different concrete arguments. This is typically a problem for APIs that take I don't believe this is as much of a problem for container-like types, however, which are what you are building in arrow. The reason is that container types tend to be often instantiated with the same arguments, or in our case to have recursive instantiations that are themselves instantiated with the same arguments. The compiler knows how to deduplicate such instances. Also, it's a lot easier to end up with thousands of calls to iterator adapters with different callback types, than it is to end up with thousands of data columns with unrelated data types on your storage. Further, in a container type whose primary purpose is I/O, like Arrow, you can afford to use a lot more dynamic dispatch internally than a general-purpose container like
Here, you need to bear in mind that the API which I am proposing can be built as a relatively thin layer on top of the existing arrow code. Most complexity would be in the trait machinery required to turn e.g. a push of tuple into multiple pushes of tuple elements followed by a finish. This would largely be resolved at compile time, with run-time mostly dispatching into the optimized code that you already have. And in cases where it's not sufficient (e.g. complex layouts like list of tuples of lists), I'm proposing lower-level APIs that let you more directly target the underlying ListBuilders and StructBuilders for performance, at the expense of worse API ergonomics.
For sure, nothing beats specialized code, but efficient I/O is also a lot of code to write. You obviously put in a lot of care into having an efficient on-disk format and I/O patterns in parquet, and it would be a shame not to be able to reuse it. Or is there an alternative columnar I/O library with support for nested structs and lists that you can suggest which has equally developed Rust support ?
In my opinion, the idea of serde-arrow is elegant, but cannot be implemented efficiently and therefore shouldn't be the only ergonomic option, for multiple reasons:
|
If you wish to just add a strongly typed version of StructBuilder, that would be most welcome and I think well received. The builders have not been a major focus of efforts to this point, few kernels make use of them, and I think if you wanted to make iterative improvements to them that would be most welcome. I have mostly been encouraging people to just use the array constructors. However, the wording of your initial proposal was suggesting a more wholesale refactoring of the array abstractions and that was what I was responding to, aside from being incredibly disruptive to our downstream dependencies, there are real reasons why they are designed the way they are, which I articulated. I would also add that taking a very confrontational/derogatory tone is unlikely to do you favors in this endeavor |
Please excuse my tone, it was triggered by frustration from using the current API. Instead of pointlessly venting it on you, I'll try to use it as motivation fuel to get the API to the state where I would've liked to find it. My intent is indeed to build a strong typing facade that lies on top of the builders and array types that you already have. That is the only viable option from an API compatibility point of view, and it is also necessary because some arrow functionality cannot be exposed in a strongly typed manner (e.g. if people want to open a file whose content they know ~nothing about and query basic metadata or run datafusion SQL queries on it). Basically, in my dream scenario, the arrow-rs user tutorial story would eventually go from the current "use the high-level array API if you can, and the low-level buffer API if you must" advice, to a three-staged "use the high-level strongly typed API if you can, the mid-level dynamically typed array API if you need more dynamism, and the low-level buffers API if you must" advice. |
So, I have a design question. Currently, the Is this considered to be a desirable usage pattern, or should I forbid it in the strongly typed API by only allowing the semantic equivalent of A possible middle ground is to allow it, but not through the highest level API (i.e. if you really want to do it you can do it, but you will need to go through lower-level APIs to do so). |
Its important there is a way to do it, as the spec allows it and certain kernels like nullif could create such constructs, however, I see no reason for a high level API to permit it |
Hi @HadrienG2 My current approach is use derive macros on struct of arrays
This would implement where
The reason for returning a intermediary value with methods instead of a tuple or a struct with values is to not access any memory that the user may not want I also using the same approach for |
Hi @gstvg . Right now I am working on array builders, because that seemed like a better starting point since there was already a stronger-typed API available to constrain the problem. But the intent is to eventually cover array access too. I have strong feelings against using proc macros for this, for reasons that I explained in the OP:
In retrospect, one point which I should have added, although it was somewhat covered by "hard for users to reason about", is "hard to document". With proc macros (or macros in general indeed), it is hard to explain to users what methods the type generated by a macro will have, what these methods do, what is their usage contract if they are unsafe, etc. This is not to say that the process is super-easy with sufficiently generic code, but I think it's much easier and getting there in my prototype. For this reason, my intent is that if I provide proc macros at all, it will only be as an optional extension for support of user-defined types like custom structs and enums.
My preferred way to handle this problem (and a few other issues of row-based access like memory access inefficiency caused by all the transposes and bad compiler optimizations of complex tuples) is to provide slice-based accessors. For example, the data from a builder of tuples Overall, my plan so far has been to submit a WIP prototype here once I'm confident that I have a design that can scale to all the current builder/array types that arrow offers. Right now I have nulls, booleans, But if you want, I can push the code before that, so that you can have an early look at it. |
Motivation
There are a number of things that make using parquet+arrow feel like writing Python in Rust syntax today:
&dyn Any
s that must be downcasted to concrete types that can only be known by checking the documentation.i32
ori64
, even in situations where there is no sane use case for negative offsets.MapBuilder
you must insert a key with one function, insert a value with a different function, and insert an entry with a third function. Forget any of these, or insert unrelated calls in the middle, and either you will get a panic or it will not do what you want.Because of this, the API of the arrow and parquet crates often feels very foreign and frustrating to the Rust developer that is accustomed to strongly typed I/O abstractions like serde. As a result, in a recent project of mine that uses parquet for disk I/O, I ended up choosing to reimplementing a columnar in-memory data format myself, which is only converted to arrow arrays at I/O time, simply because the resulting ergonomics were so much better. In my opinion, this is a failure of the current arrow API design.
While some of this weak typing frustration is sometimes necessary (e.g. I don't think we can allow a user to open and analyze arbitrary parquet files of unknown content without it), it seems to me that in almost every use case, people actually know what is the type of the columns of the files they are reading and writing, and would be willing to write a tiny amount of code to specify it at the time where files are opened or arrays are created, in exchange for the convenience of dealing away with most of the above ceremony.
Proposal
To avoid making the initial post of this issue excessively long, I'll only cover array creation, since that is where arrow is the closest to strong typing today, with structs being the main pain point, and thus the explanation is quickest. However, I believe the API design that I'm describing can scale to all other arrow operations, including e.g. readout of parquet files.
Today, arrow needs about 100 entities (types + typedefs...) in
arrow::builder
to achieve some degree of strong typing, and still does so at the expense of subpar struct support. Instead, I believe better overall ergonomics and struct support could be achieved with a singleTypedBuilder<T>
type, at the expense of accepting the usability tradeoff that not allTypedBuilder
methods are available for all type parametersT
(a Rust API design feature whose use is not terribly common, but still occasionally seen in e.g. the standard library).Let's first see how this works with primitive types and collections of primitive types first, then I'll build upon it to explain how this formalism can be expanded to greatly improve the support for structs in arrow.
Primitive types and nullability
First, given
P
a primitive type, the API ofTypedBuilder<P>
is largely equivalent to that of the existingPrimitiveBuilder
, but with support for nulls disabled (there is no way to insert a null in the array being built).Since we're going through the trouble of redesigning the API, though, I would suggest using the opportunity to switch to "push" instead of "append" as a verb for adding things, because that terminology is universal in Rust's array-like types and arrow-rs is the outlier here.
If you want support for nulls, you can use
TypedBuilder<Option<P>>
instead, which allows for nulls via the standardSome
/None
mechanism. So far, this is all very easy.Lists
Want an array of lists of primitive types? You can use
TypedBuilder<List<P>>
. In this flavor of theTypedBuilder
API, the method for inserting a single element is replaced with two methods:push(&mut self, &[P])
method that appends a new list to the array in a single transaction, orthogonal to other methods. But it requires you to have a pre-made slice, which is not always the case, and therefore you also have...start_push(&mut self) -> ListBuilder<'_, P>
that returns an RAII guard that you can use to push elements to the list one by one. While this RAII guard is active, other methods ofTypedBuilder<List<P>>
, including other calls tostart_push
, cannot be used. Once the RAII guard is dropped, it terminates the current list like the existingarrow::builder::GenericListBuilder::append(true)
workflow.Fixed-sized lists are more interesting because this notion can be mapped into the Rust type system in two different ways:
TypedBuilder<FixedList<P, Const<N>>
. In this flavor of the fixed list API, we can have methods with verify at compile time that the sublists that are being pushed have the right lenght at compile time, using API designs likepush(&mut self, &[P; N])
.TypedBuilder<FixedList<P, Dyn>>
, or evenTypedBuilder<FixedList<P>>
if we makeDyn
a default type parameter ofFixedList
. This flavor of the fixed list API mostly looks and feels like the dynamically sizedTypedBuilder<List<P>>
flavor, except the size of sublists is specified at construction time and checked at insertion time, as in the existingarrow::builder::FixedSizeListBuilder
.Structs
Finally, we get to the point where I can talk about structs, which I believe are the biggest improvement provided by this API redesign.
So far, the arrow-rs API has largely modeled structs as vectors of fields in the Rust type system. This is maximally flexible, because it allows structs to have a number of fields that is not known at compile time. But I believe that in >90% of use cases, this flexibility is not needed, and users would be able to do what they want with an API that models structs as tuples of types. The outcome being, of course, better integration into a strongly typed programming language like Rust.
Unfortunately, this is also the point where we start to hit more serious limitations of the Rust type system, such as lack of variadic generics which prevents the existence of a
Struct<T, U, V, ...>
type. Nothing that can't be worked around with tuples, though.Primitive members
Let's start easy with structs whose members are all primitive types. In the proposed API, this would be modeled as a typed builder of tuples, i.e.
TypedBuilder<(P1, P2, P3)>
.As you would expect, this typed builder would have a
push(&mut self, (P1, P2, P3))
method that takes a tuple of elements. However, as you may know, this can easily result in bad LLVM codegen if many elements need to be inserted.Therefore, for performance-sensitive use cases, we'll also likely want to have an alternate
push_slices(&mut self, (&[P1], &[P2], &[P3]))
bulk insertion method which checks that all slices have the same length and then forwards the slices to the underlying primitive array builders.Non-primitive members
The
push(&mut self, (P1, P2, P3))
API design can scale to any type for which we can devise an element-wisepush()
method. Which, as we've seen, is pretty much always the case. However, it will not always scale efficiently.As I've alluded to above, if you're starting from columnar data, repeatedly pushing tuples of elements is often less efficient than bulk-pushing tuples of slices of elements. That's an LLVM limitation that will hopefully be lifted someday. But what will likely never be efficient on any compiler backend is when the struct fields have inner structure themselves. Think nullable fields, or tuples of lists of tuples.
This inefficiency may not a problem if the eventual goal is to write down the data to disk, because storage is so much slower than RAM that the extra overhead of reshuffling data in RAM is invisible. However, it can become unacceptable if the freshly built array is to be immediately used for in-RAM computations.
In this scenario, what we will likely want is for
TypedBuilder<(T, U, V)>
to have is astart_pushes(&mut self) -> (FieldBuilder<'_, T>, FieldBuilder<'_, U>, FieldBuilder<'_, V>)
method which transforms the builder of tuples into a tuple of builder of fields, essentially exposing the way a struct builder is implemented under the hood.The resulting
FieldBuilder
s, would mostly behave like aTypedBuilder
of the corresponding field type, but also act as an RAII guard that, when dropped, would perform the following validation actions:Since panics in destructors are frowned upon, there would obviously also be a
FieldBuilder::finish(self) -> Result<()>
method that users would be strongly encouraged to use instead of dropping the field builder. The drop-time panic would only be here to detect failures to callfinish()
.User-defined structs?
Some may have noticed that I have taken a slight logical shortcut by saying that arrow structs should be modeled as Rust tuples. Wouldn't the most logical design be to model them as Rust structs? This is actually true, and that would give us, on top of the typing guarantees of tuple builders...
push(&mut self, UserStruct)
that doesn't require round-tripping between the user struct type and a standard tuple type.const fn field<const NAME: &str, T>(&mut self) -> FieldBuilder<'_, T>
that is checked at compile time.However, since Rust doesn't have compile time reflection today, getting there would require implementing some derive macros. That's a big undertaking, for what is today a small benefit, so I don't think it should be done just for struct building support. But we'll come back to this question when we discuss unions.
Maps
Map arrays are basically arrays of lists of (key, value) tuples. Therefore, naive support for them can be provided using the above building blocks, with no extra supporting API.
However, it would be more ergonomic for users if we also provided interoperability with built-in rust map types in the form of a
push_map(&mut self, map: impl IntoIterator<Item = (K, V)>)
method. With this, users would get an easy way to push aHashMap
or aBTreeMap
into the map array.This way may not be optimally efficient from a memory access pattern/SIMD perspective, but it would likely be good enough for storage I/O use cases, and much easier to use. Choices at various points of the ergonomics vs efficiency tradeoff are good!
Unions
What could change my mind on derive macros is the need to also provide support for unions in the
TypedBuilder
API.If arrow structs map to Rust structs, then arrow unions map to Rust enums. However, while the Rust type system has a standard anonymous struct type today in the form of tuples, it does not have a standard anonymous enum type. It wouldn't be too hard to create one, but due to lack of variadic generics, it would have an awkward
Union<(T, U, V)>
signature, and using it would involve horrible visitor APIs in the style of C++'sstd::variant
.Much better ergonomics could be achieved if users could instead slap a
#[derive(Union)]
proc macro on an enum declarations and automatically generate all support code needed for an array of unions whose variants match the Rust enum's variants. Although this does require writing a proc macro...What the proc macro approach would give us is, of course, the ability to accept enum values as easily as a
TypedBuilder<Enum>::push(&mut self, Enum)
call.Alternatives
Statu quo
Even if I hate using it, I will readily admit that there are good reasons why the
arrow-rs
API currently looks the way it does:TypedBuilder
API can only be a higher-level complement of the current APIs, not a full replacement.TypedBuilder
API will require a complex trait-based dispatch machinery in the implementation in order to ultimately offload user requests into the equivalent of today's columnar builders.As a result, sticking with the statu quo is an obvious alternative to what I'm proposing here. The price to pay is that users will not be able to use arrow in scenarios where it should fit their needs, or only at the expense of experiencing very painful API ergonomics.
Dedicated builder types
Even if we do agree that an API with stronger typing is desirable and worthwhile, then the API design I'm proposing is not the only way to do it.
For example, we could keep the current design principle of having dedicated builder types for most element types (MapBuilder, ListBuilder...) to make implementation easier and avoid the awkward API design of having
TypedBuilder<T>
methods depend on whatT
is.This would, however, come at the expense of having plenty of builder types (which I personally found quite confusing when I started learning arrow) and also at the expense of making structs feel more foreign and awkward compared to other types. As far as I know, there is no way to do strongly typed structs in today's Rust without some variation of the tuple-based design that I am proposing.
More macros
The design that I am proposing purposely uses a minimal amount of procedural macros. Procedural macros are hard to write, hard to test, hard to maintain, hard for users to reason about, and hard on IDEs. In my opinion, they should therefore be a last-resort tool that one only uses for when no good alternative is available.
But with all that being said, there is a clear argument to be made that if we're starting to build a mapping between (a subset of) the Rust type system and the Arrow data model, we could go further an map a larger subset of the Rust type system by allowing for user-defined types, like structs and enums.
I briefly alluded to this possibility above, but only touched the surface of what is possible. We could go a lot further with macros, for example by turning some user struct
Foo { bar: Bar }
into a dedicatedFooBuilder
variant ofTypedBuilder
design, which features tailor-made API adaptations for theFoo
struct like abar_field(&mut self) -> BarBuilder<'_>
method.The benefits for users would be easier manipulation of individual struct fields and avoidance of humonguous deeply nested generic types. The drawbacks would be enormous implementation complexity on the arrow-rs side, and a more difficult API documentation story around macro-generated types.
I'm not fully rejecting this idea, but I think that it is not first-priority because...
TypedBuilder
to cleanly support std types anyway.TypedBuilder
work will provide plenty of valuable design and implementation lessons that will come in very handy if the need for a builder generator ever arises.Additional context
This API design proposal comes with the important disclaimer that I have not started any implementation work, not even a prototype, because I wanted to see how open you were to the general idea before investing significant time and effort into making it work.
As a result, there may be unforeseen implementation difficulties, or design problems that emerge only in more advanced use cases than those which I have thought about while drafting this on virtual paper. If you decide that this feature is worth having, it may evolve significantly from implementation feedback before it is eventually shipped to users.
There are also entire aspects of the proposed API that I am handwaving away as "basically the same idea, but applied in a different context", for example binary vs list or how reading strongly typed data from a
RecordBatch
would work. I did this so this issue would be shorter, and thus easier for me to write and for you to read. But it may also mean that there are design problems in this area that will only become apparent as I start working on it.If you're interested, I can try to write a more formal and complete spec to reduce the amount of unexplored design regions. I just preferred not to do so before knowing if there is any interest from your side.
Finally, when it comes to manpower, I'm a lone developer working on my spare time with plenty of others already running personal projects competing for my time. So if I end up being the one who implements this feature, it may take a long while to see the light of the day, if at all. Implementation assistance from other interested people with more time/financial resources would therefore be much appreciated.
The text was updated successfully, but these errors were encountered: