Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Alternate APIs with stronger typing #5700

Open
HadrienG2 opened this issue Apr 29, 2024 · 8 comments
Open

Proposal: Alternate APIs with stronger typing #5700

HadrienG2 opened this issue Apr 29, 2024 · 8 comments
Labels
question Further information is requested

Comments

@HadrienG2
Copy link
Contributor

HadrienG2 commented Apr 29, 2024

Motivation

There are a number of things that make using parquet+arrow feel like writing Python in Rust syntax today:

  • Everywhere one needs to deal with &dyn Anys that must be downcasted to concrete types that can only be known by checking the documentation.
  • Everything is nullable, often without an easy way to opt out of nullability.
  • Every offset is i32 or i64, even in situations where there is no sane use case for negative offsets.
  • API methods are not orthogonal, for example if you want to insert an entry in a MapBuilder you must insert a key with one function, insert a value with a different function, and insert an entry with a third function. Forget any of these, or insert unrelated calls in the middle, and either you will get a panic or it will not do what you want.
  • More generally, to get to the thing you want when reading a file, you often need to go through multiple layers of faillible downcasting of weak types to strong types, when ideally you would want instead to specify the type schema early on, check for errors there, and be done with it.

Because of this, the API of the arrow and parquet crates often feels very foreign and frustrating to the Rust developer that is accustomed to strongly typed I/O abstractions like serde. As a result, in a recent project of mine that uses parquet for disk I/O, I ended up choosing to reimplementing a columnar in-memory data format myself, which is only converted to arrow arrays at I/O time, simply because the resulting ergonomics were so much better. In my opinion, this is a failure of the current arrow API design.

While some of this weak typing frustration is sometimes necessary (e.g. I don't think we can allow a user to open and analyze arbitrary parquet files of unknown content without it), it seems to me that in almost every use case, people actually know what is the type of the columns of the files they are reading and writing, and would be willing to write a tiny amount of code to specify it at the time where files are opened or arrays are created, in exchange for the convenience of dealing away with most of the above ceremony.

Proposal

To avoid making the initial post of this issue excessively long, I'll only cover array creation, since that is where arrow is the closest to strong typing today, with structs being the main pain point, and thus the explanation is quickest. However, I believe the API design that I'm describing can scale to all other arrow operations, including e.g. readout of parquet files.

Today, arrow needs about 100 entities (types + typedefs...) in arrow::builder to achieve some degree of strong typing, and still does so at the expense of subpar struct support. Instead, I believe better overall ergonomics and struct support could be achieved with a single TypedBuilder<T> type, at the expense of accepting the usability tradeoff that not all TypedBuilder methods are available for all type parameters T (a Rust API design feature whose use is not terribly common, but still occasionally seen in e.g. the standard library).

Let's first see how this works with primitive types and collections of primitive types first, then I'll build upon it to explain how this formalism can be expanded to greatly improve the support for structs in arrow.

Primitive types and nullability

First, given P a primitive type, the API of TypedBuilder<P> is largely equivalent to that of the existing PrimitiveBuilder, but with support for nulls disabled (there is no way to insert a null in the array being built).

Since we're going through the trouble of redesigning the API, though, I would suggest using the opportunity to switch to "push" instead of "append" as a verb for adding things, because that terminology is universal in Rust's array-like types and arrow-rs is the outlier here.

If you want support for nulls, you can use TypedBuilder<Option<P>> instead, which allows for nulls via the standard Some/None mechanism. So far, this is all very easy.

Lists

Want an array of lists of primitive types? You can use TypedBuilder<List<P>>. In this flavor of the TypedBuilder API, the method for inserting a single element is replaced with two methods:

  • A push(&mut self, &[P]) method that appends a new list to the array in a single transaction, orthogonal to other methods. But it requires you to have a pre-made slice, which is not always the case, and therefore you also have...
  • A start_push(&mut self) -> ListBuilder<'_, P> that returns an RAII guard that you can use to push elements to the list one by one. While this RAII guard is active, other methods of TypedBuilder<List<P>>, including other calls to start_push, cannot be used. Once the RAII guard is dropped, it terminates the current list like the existing arrow::builder::GenericListBuilder::append(true) workflow.

Fixed-sized lists are more interesting because this notion can be mapped into the Rust type system in two different ways:

  • One flavor where the size of the list is known at compile time, which we could model as TypedBuilder<FixedList<P, Const<N>>. In this flavor of the fixed list API, we can have methods with verify at compile time that the sublists that are being pushed have the right lenght at compile time, using API designs like push(&mut self, &[P; N]).
  • One flavor where the size of the list is not known at compile time, which we could model as TypedBuilder<FixedList<P, Dyn>>, or even TypedBuilder<FixedList<P>> if we make Dyn a default type parameter of FixedList. This flavor of the fixed list API mostly looks and feels like the dynamically sized TypedBuilder<List<P>> flavor, except the size of sublists is specified at construction time and checked at insertion time, as in the existing arrow::builder::FixedSizeListBuilder.

Structs

Finally, we get to the point where I can talk about structs, which I believe are the biggest improvement provided by this API redesign.

So far, the arrow-rs API has largely modeled structs as vectors of fields in the Rust type system. This is maximally flexible, because it allows structs to have a number of fields that is not known at compile time. But I believe that in >90% of use cases, this flexibility is not needed, and users would be able to do what they want with an API that models structs as tuples of types. The outcome being, of course, better integration into a strongly typed programming language like Rust.

Unfortunately, this is also the point where we start to hit more serious limitations of the Rust type system, such as lack of variadic generics which prevents the existence of a Struct<T, U, V, ...> type. Nothing that can't be worked around with tuples, though.

Primitive members

Let's start easy with structs whose members are all primitive types. In the proposed API, this would be modeled as a typed builder of tuples, i.e. TypedBuilder<(P1, P2, P3)>.

As you would expect, this typed builder would have a push(&mut self, (P1, P2, P3)) method that takes a tuple of elements. However, as you may know, this can easily result in bad LLVM codegen if many elements need to be inserted.

Therefore, for performance-sensitive use cases, we'll also likely want to have an alternate push_slices(&mut self, (&[P1], &[P2], &[P3])) bulk insertion method which checks that all slices have the same length and then forwards the slices to the underlying primitive array builders.

Non-primitive members

The push(&mut self, (P1, P2, P3)) API design can scale to any type for which we can devise an element-wise push() method. Which, as we've seen, is pretty much always the case. However, it will not always scale efficiently.

As I've alluded to above, if you're starting from columnar data, repeatedly pushing tuples of elements is often less efficient than bulk-pushing tuples of slices of elements. That's an LLVM limitation that will hopefully be lifted someday. But what will likely never be efficient on any compiler backend is when the struct fields have inner structure themselves. Think nullable fields, or tuples of lists of tuples.

This inefficiency may not a problem if the eventual goal is to write down the data to disk, because storage is so much slower than RAM that the extra overhead of reshuffling data in RAM is invisible. However, it can become unacceptable if the freshly built array is to be immediately used for in-RAM computations.

In this scenario, what we will likely want is for TypedBuilder<(T, U, V)> to have is a start_pushes(&mut self) -> (FieldBuilder<'_, T>, FieldBuilder<'_, U>, FieldBuilder<'_, V>) method which transforms the builder of tuples into a tuple of builder of fields, essentially exposing the way a struct builder is implemented under the hood.

The resulting FieldBuilders, would mostly behave like a TypedBuilder of the corresponding field type, but also act as an RAII guard that, when dropped, would perform the following validation actions:

  • If it is the first FieldBuilder to be dropped, record the number of elements that were pushed into the underlying StructBuilder.
  • If it is not the first FieldBuilder to be dropped, check that the same number of elements were pushed as into the other FieldBuilders, otherwise panic.

Since panics in destructors are frowned upon, there would obviously also be a FieldBuilder::finish(self) -> Result<()> method that users would be strongly encouraged to use instead of dropping the field builder. The drop-time panic would only be here to detect failures to call finish().

User-defined structs?

Some may have noticed that I have taken a slight logical shortcut by saying that arrow structs should be modeled as Rust tuples. Wouldn't the most logical design be to model them as Rust structs? This is actually true, and that would give us, on top of the typing guarantees of tuple builders...

  • Automatic StructArray column names that repeat the names of struct fields.
  • A convenience push(&mut self, UserStruct) that doesn't require round-tripping between the user struct type and a standard tuple type.
  • In future Rust where const generics are more powerful, we could also envision things like a const fn field<const NAME: &str, T>(&mut self) -> FieldBuilder<'_, T> that is checked at compile time.

However, since Rust doesn't have compile time reflection today, getting there would require implementing some derive macros. That's a big undertaking, for what is today a small benefit, so I don't think it should be done just for struct building support. But we'll come back to this question when we discuss unions.

Maps

Map arrays are basically arrays of lists of (key, value) tuples. Therefore, naive support for them can be provided using the above building blocks, with no extra supporting API.

However, it would be more ergonomic for users if we also provided interoperability with built-in rust map types in the form of a push_map(&mut self, map: impl IntoIterator<Item = (K, V)>) method. With this, users would get an easy way to push a HashMap or a BTreeMap into the map array.

This way may not be optimally efficient from a memory access pattern/SIMD perspective, but it would likely be good enough for storage I/O use cases, and much easier to use. Choices at various points of the ergonomics vs efficiency tradeoff are good!

Unions

What could change my mind on derive macros is the need to also provide support for unions in the TypedBuilder API.

If arrow structs map to Rust structs, then arrow unions map to Rust enums. However, while the Rust type system has a standard anonymous struct type today in the form of tuples, it does not have a standard anonymous enum type. It wouldn't be too hard to create one, but due to lack of variadic generics, it would have an awkward Union<(T, U, V)> signature, and using it would involve horrible visitor APIs in the style of C++'s std::variant.

Much better ergonomics could be achieved if users could instead slap a #[derive(Union)] proc macro on an enum declarations and automatically generate all support code needed for an array of unions whose variants match the Rust enum's variants. Although this does require writing a proc macro...

What the proc macro approach would give us is, of course, the ability to accept enum values as easily as a TypedBuilder<Enum>::push(&mut self, Enum) call.

Alternatives

Statu quo

Even if I hate using it, I will readily admit that there are good reasons why the arrow-rs API currently looks the way it does:

  • It's more general. Very advanced use cases like files with dynamically known numbers of array columns, or structs with dynamically known number of fields, cannot be supported without a dynamically typed approach. Because of this, the proposed TypedBuilder API can only be a higher-level complement of the current APIs, not a full replacement.
  • It's easier to implement. Although much easier from a user's perspective, the proposed TypedBuilder API will require a complex trait-based dispatch machinery in the implementation in order to ultimately offload user requests into the equivalent of today's columnar builders.

As a result, sticking with the statu quo is an obvious alternative to what I'm proposing here. The price to pay is that users will not be able to use arrow in scenarios where it should fit their needs, or only at the expense of experiencing very painful API ergonomics.

Dedicated builder types

Even if we do agree that an API with stronger typing is desirable and worthwhile, then the API design I'm proposing is not the only way to do it.

For example, we could keep the current design principle of having dedicated builder types for most element types (MapBuilder, ListBuilder...) to make implementation easier and avoid the awkward API design of having TypedBuilder<T> methods depend on what T is.

This would, however, come at the expense of having plenty of builder types (which I personally found quite confusing when I started learning arrow) and also at the expense of making structs feel more foreign and awkward compared to other types. As far as I know, there is no way to do strongly typed structs in today's Rust without some variation of the tuple-based design that I am proposing.

More macros

The design that I am proposing purposely uses a minimal amount of procedural macros. Procedural macros are hard to write, hard to test, hard to maintain, hard for users to reason about, and hard on IDEs. In my opinion, they should therefore be a last-resort tool that one only uses for when no good alternative is available.

But with all that being said, there is a clear argument to be made that if we're starting to build a mapping between (a subset of) the Rust type system and the Arrow data model, we could go further an map a larger subset of the Rust type system by allowing for user-defined types, like structs and enums.

I briefly alluded to this possibility above, but only touched the surface of what is possible. We could go a lot further with macros, for example by turning some user struct Foo { bar: Bar } into a dedicated FooBuilder variant of TypedBuilder design, which features tailor-made API adaptations for the Foo struct like a bar_field(&mut self) -> BarBuilder<'_> method.

The benefits for users would be easier manipulation of individual struct fields and avoidance of humonguous deeply nested generic types. The drawbacks would be enormous implementation complexity on the arrow-rs side, and a more difficult API documentation story around macro-generated types.

I'm not fully rejecting this idea, but I think that it is not first-priority because...

  1. Due to orphan rules, we're going to need something like TypedBuilder to cleanly support std types anyway.
  2. Making TypedBuilder work will provide plenty of valuable design and implementation lessons that will come in very handy if the need for a builder generator ever arises.

Additional context

This API design proposal comes with the important disclaimer that I have not started any implementation work, not even a prototype, because I wanted to see how open you were to the general idea before investing significant time and effort into making it work.

As a result, there may be unforeseen implementation difficulties, or design problems that emerge only in more advanced use cases than those which I have thought about while drafting this on virtual paper. If you decide that this feature is worth having, it may evolve significantly from implementation feedback before it is eventually shipped to users.

There are also entire aspects of the proposed API that I am handwaving away as "basically the same idea, but applied in a different context", for example binary vs list or how reading strongly typed data from a RecordBatch would work. I did this so this issue would be shorter, and thus easier for me to write and for you to read. But it may also mean that there are design problems in this area that will only become apparent as I start working on it.

If you're interested, I can try to write a more formal and complete spec to reduce the amount of unexplored design regions. I just preferred not to do so before knowing if there is any interest from your side.

Finally, when it comes to manpower, I'm a lone developer working on my spare time with plenty of others already running personal projects competing for my time. So if I end up being the one who implements this feature, it may take a long while to see the light of the day, if at all. Implementation assistance from other interested people with more time/financial resources would therefore be much appreciated.

@HadrienG2 HadrienG2 added the enhancement Any new improvement worthy of a entry in the changelog label Apr 29, 2024
@tustvold
Copy link
Contributor

I believe the narrow crate has some similar goals and might be worth checking out.

FWIW I likewise found the lack of strong typing annoying when I first encountered the crate but have come to appreciate it for a number of reasons:

  • API compatibility: we still struggle to preserve this, but it gets infinitely harder with generics exposing what are often implementation details
  • Type Erasure: especially when working on query engines, you very often don't know and don't want to have to know what the type of something is. Downcasting to the concrete type is obnoxious, verbose, and macro heavy
  • Compilation Time: by not exposing the generics we can ensure they get instantiated once, and use tricks to reduce the amount of code that gets generated. Moving away from generics significantly improved compilation times
  • Codegen: LLVM is a very finicky beast, and a lot of care has gone into making sure it properly vectorises code. This is very hard to do with generics that are instantiated in the call site

My 2 cents is that arrow makes little sense in statically typed contexts, specialized code will almost always win out both for ergonomics and performance. Perhaps using crates like serde-arrow to convert back and forth where necessary

@tustvold tustvold added question Further information is requested and removed enhancement Any new improvement worthy of a entry in the changelog labels Apr 29, 2024
@HadrienG2
Copy link
Contributor Author

HadrienG2 commented Apr 29, 2024

Thanks for your reply!

I believe the narrow crate has some similar goals and might be worth checking out.

I may eventually be taking this to narrow instead, then, but I thought it might be worthwhile to explore whether the One True Arrow implementation from Apache wants it first. Your implementation has the most visibility, it's what the Apache Arrow web page links to, and it's what people wanting to know what arrow feels like in Rust will end up finding first. It is therefore unfortunate that it has such bad UX, needlessly discouraging people from using arrow in Rust programs.


API compatibility: we still struggle to preserve this, but it gets infinitely harder with generics exposing what are often implementation details

I don't see how the API that I proposed so far leaks more implementation details than the current design of having &dyns that must be downcasted to specific types (which, in the eyes of the users, are very much implementation details).

Just check out the example in the documentation of StructBuilder, specifically this comment:

We can't obtain the ListBuilder with the expected generic types, because under the hood
the StructBuilder was returned as a Box and passed as such to the ListBuilder constructor

If you don't call that implementation details leaking through, I don't know what it is! :)


Type Erasure: especially when working on query engines, you very often don't know and don't want to have to know what the type of something is. Downcasting to the concrete type is obnoxious, verbose, and macro heavy

I agree that having a type erased API is a good thing. What I disagree with is having that as the only option.


Compilation Time: by not exposing the generics we can ensure they get instantiated once, and use tricks to reduce the amount of code that gets generated. Moving away from generics significantly improved compilation times

Generic code that is written with compile-time performance in mind (with dynamic dispatch in non-perf-critical sections) only significantly increase compilation time if they are instantiated a lot, for a lot of different concrete arguments. This is typically a problem for APIs that take impl Fn(), are used with many different callback types, and cannot afford to use &dyn for performance reasons, like iterator adapters.

I don't believe this is as much of a problem for container-like types, however, which are what you are building in arrow. The reason is that container types tend to be often instantiated with the same arguments, or in our case to have recursive instantiations that are themselves instantiated with the same arguments. The compiler knows how to deduplicate such instances.

Also, it's a lot easier to end up with thousands of calls to iterator adapters with different callback types, than it is to end up with thousands of data columns with unrelated data types on your storage.

Further, in a container type whose primary purpose is I/O, like Arrow, you can afford to use a lot more dynamic dispatch internally than a general-purpose container like Vec or BTreeMap would, further reducing the cost of instantiating a generic type with a new type argument.


Codegen: LLVM is a very finicky beast, and a lot of care has gone into making sure it properly vectorises code. This is very hard to do with generics that are instantiated in the call site

Here, you need to bear in mind that the API which I am proposing can be built as a relatively thin layer on top of the existing arrow code. Most complexity would be in the trait machinery required to turn e.g. a push of tuple into multiple pushes of tuple elements followed by a finish.

This would largely be resolved at compile time, with run-time mostly dispatching into the optimized code that you already have. And in cases where it's not sufficient (e.g. complex layouts like list of tuples of lists), I'm proposing lower-level APIs that let you more directly target the underlying ListBuilders and StructBuilders for performance, at the expense of worse API ergonomics.


My 2 cents is that arrow makes little sense in statically typed contexts, specialized code will almost always win out both for ergonomics and performance.

For sure, nothing beats specialized code, but efficient I/O is also a lot of code to write. You obviously put in a lot of care into having an efficient on-disk format and I/O patterns in parquet, and it would be a shame not to be able to reuse it. Or is there an alternative columnar I/O library with support for nested structs and lists that you can suggest which has equally developed Rust support ?

Perhaps using crates like serde-arrow to convert back and forth where necessary

In my opinion, the idea of serde-arrow is elegant, but cannot be implemented efficiently and therefore shouldn't be the only ergonomic option, for multiple reasons:

  • serde works row-wise, and converting row-wise data to columnar data can get expensive very quickly.
  • serde is only easy to use when using owned container types. So readout of e.g. nested lists is going to require tons of allocations, which can become costly if the underlying storage medium is sufficiently fast (think modern SSD).

@tustvold
Copy link
Contributor

tustvold commented Apr 29, 2024

If you wish to just add a strongly typed version of StructBuilder, that would be most welcome and I think well received. The builders have not been a major focus of efforts to this point, few kernels make use of them, and I think if you wanted to make iterative improvements to them that would be most welcome. I have mostly been encouraging people to just use the array constructors.

However, the wording of your initial proposal was suggesting a more wholesale refactoring of the array abstractions and that was what I was responding to, aside from being incredibly disruptive to our downstream dependencies, there are real reasons why they are designed the way they are, which I articulated.

I would also add that taking a very confrontational/derogatory tone is unlikely to do you favors in this endeavor

@HadrienG2
Copy link
Contributor Author

HadrienG2 commented Apr 30, 2024

Please excuse my tone, it was triggered by frustration from using the current API. Instead of pointlessly venting it on you, I'll try to use it as motivation fuel to get the API to the state where I would've liked to find it.

My intent is indeed to build a strong typing facade that lies on top of the builders and array types that you already have. That is the only viable option from an API compatibility point of view, and it is also necessary because some arrow functionality cannot be exposed in a strongly typed manner (e.g. if people want to open a file whose content they know ~nothing about and query basic metadata or run datafusion SQL queries on it).

Basically, in my dream scenario, the arrow-rs user tutorial story would eventually go from the current "use the high-level array API if you can, and the low-level buffer API if you must" advice, to a three-staged "use the high-level strongly typed API if you can, the mid-level dynamically typed array API if you need more dynamism, and the low-level buffers API if you must" advice.

@HadrienG2
Copy link
Contributor Author

HadrienG2 commented May 5, 2024

So, I have a design question.

Currently, the GenericListBuilder enables someone to append a null list with items. This is done by appending a few items using the values() sub-builder, then calling append(false).

Is this considered to be a desirable usage pattern, or should I forbid it in the strongly typed API by only allowing the semantic equivalent of push(Option<&[T]>)?

A possible middle ground is to allow it, but not through the highest level API (i.e. if you really want to do it you can do it, but you will need to go through lower-level APIs to do so).

@tustvold
Copy link
Contributor

tustvold commented May 5, 2024

Its important there is a way to do it, as the spec allows it and certain kernels like nullif could create such constructs, however, I see no reason for a high level API to permit it

@gstvg
Copy link

gstvg commented May 15, 2024

Hi @HadrienG2
Are you working on array builders only, or accessing concrete arrays data too?
I've been playing with typed accessor for concrete arrays for the past few weeks, and pushed part of it as a draft #5767

My current approach is use derive macros on struct of arrays

#[derive(StructArray)]
struct MyStructArray<'a> {
   number: &'a Int32Array,
   text: &'a StringArray,
}

This would implement TryFrom<&'a StructArray> and TypedStructInnerAccessor for MyStructArray and ArrayAccessor for TypedStruct<T: TypedStructInnerAccessor> { type Item = TypedStructInnerAccessor::Item }

where TypedStructInnerAccessor::Item is a generated type with a few methods for each column:

struct MyStructArrayStructAccessorItem<'a> {
   array: TypedStructArray<&'a MyStructArray>,
   index: usize
}

impl<'a> MyStructArrayStructAccessorItem<'a> {

    #[inline]
    fn is_valid(&self) -> ::std::primitive::bool {
        self.struct_.is_valid(self.index)
    }
    #[inline]
    fn is_null(&self) -> ::std::primitive::bool {
        self.struct_.is_null(self.index)
    }
    #[inline]
    fn text(&self) -> <&'a StringArray as ::arrow::array::ArrayAccessor>::Item {
        (&self.struct_.fields().text).value(self.index)
    }
    #[inline]
    fn text_opt(
        &self,
    ) -> ::std::option::Option<
        <&'a StringArray as ::arrow::array::ArrayAccessor>::Item,
    > {
        if (&self.struct_.fields().text).is_valid(self.index) {
            Some((&self.struct_.fields().text).value(self.index))
        } else {
            None
        }
    }
    #[inline]
    fn is_text_valid(&self) -> ::std::primitive::bool {
        self.struct_.fields().text.is_valid(self.index)
    }


... the same methods for text

}

TypedStructArray def:

#[derive(Debug)]
pub struct TypedStructArray<'a, T> {
    fields: T,
    struct_: &'a StructArray,
}

impl<T> TypedStructArray<T> {
    pub fn fields(&self) -> &T {
        &self.fields
    }
}

impl<'a, T: TryFrom<&'a StructArray, Error = ArrowError>> TryFrom<&'a dyn Array>
    for TypedStructArray<T>
{
    type Error = ArrowError;

    fn try_from(value: &'a dyn Array) -> Result<Self, Self::Error> {
        let struct_ = <&'a StructArray>::try_from(value)?;

        Ok(Self {
            fields: T::try_from(&struct_)?,
            struct_: struct,
        })
    }
}

impl<'a, T: TypedStructInnerAccessor<'a>> ArrayAccessor for &'a TypedStructArray<T> {
    type Item = T::Item;

    fn value(&self, index: ::std::primitive::usize) -> Self::Item {
        assert!(
            index < self.len(),
            "Trying to access an element at index {} from a TypedStructArray of length {}",
            index,
            self.len()
        );
        unsafe { self.value_unchecked(index) }
    }

    unsafe fn value_unchecked(&self, index: ::std::primitive::usize) -> Self::Item {
        (*self, index).into()
    }
}
pub trait TypedStructInnerAccessor<'a>: std::fmt::Debug + Send + Sync + Sized + 'a {
    type Item: std::fmt::Debug + Send + Sync + From<(&'a TypedStructArray<Self>, usize)>;
}

impl<T: std::fmt::Debug + Send + Sync> Array for TypedStructArray<T> {
    fn as_any(&self) -> &dyn std::any::Any {
        self.struct_.as_any()
    }
... forward other Array methods to self.struct_
}

The reason for returning a intermediary value with methods instead of a tuple or a struct with values is to not access any memory that the user may not want

I also using the same approach for RecordBatch, Union, Map, GenericList, FixedSizeList and FixedSizeBinary

@HadrienG2
Copy link
Contributor Author

Hi @gstvg .

Right now I am working on array builders, because that seemed like a better starting point since there was already a stronger-typed API available to constrain the problem. But the intent is to eventually cover array access too.


I have strong feelings against using proc macros for this, for reasons that I explained in the OP:

The design that I am proposing purposely uses a minimal amount of procedural macros. Procedural macros are hard to write, hard to test, hard to maintain, hard for users to reason about, and hard on IDEs. In my opinion, they should therefore be a last-resort tool that one only uses for when no good alternative is available.

In retrospect, one point which I should have added, although it was somewhat covered by "hard for users to reason about", is "hard to document". With proc macros (or macros in general indeed), it is hard to explain to users what methods the type generated by a macro will have, what these methods do, what is their usage contract if they are unsafe, etc. This is not to say that the process is super-easy with sufficiently generic code, but I think it's much easier and getting there in my prototype.

For this reason, my intent is that if I provide proc macros at all, it will only be as an optional extension for support of user-defined types like custom structs and enums.


The reason for returning a intermediary value with methods instead of a tuple or a struct with values is to not access any memory that the user may not want.

My preferred way to handle this problem (and a few other issues of row-based access like memory access inefficiency caused by all the transposes and bad compiler optimizations of complex tuples) is to provide slice-based accessors. For example, the data from a builder of tuples TypedBuilder<(T, U, V)> will be accessible as tuple of slices (&[T], &[U], &[V]). In the context of builders, there will also be an inner builder accessor following the same logic (morally an impl AsMut<(&mut TypedBuilder<T>, &mut TypedBuilder<U>, &mut TypedBuilder<V>)> with extra consistency checks), so that users can call type-specific builder methods. This may or may not be needed for array access after building as well, I'll see when I get there.


Overall, my plan so far has been to submit a WIP prototype here once I'm confident that I have a design that can scale to all the current builder/array types that arrow offers. Right now I have nulls, booleans, Option<T>, primitive types are mostly done, and lists are well on the way too although I am cleaning up a few things at the moment. Next I want structs and unions. I believe that after that, all the remaining builders have APIs that can be considered a combination of the primitive, lists, and structs APIs, modulo a few extra/missing constructor parameters, and thus the builder API design that I'm proposing should be robust/stabilized enough to warrant an external review.

But if you want, I can push the code before that, so that you can have an early look at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants