Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): Add order_by parameter to the collect() method #9170

Open
1 task done
chelsea-lin opened this issue May 10, 2024 · 3 comments
Open
1 task done

feat(api): Add order_by parameter to the collect() method #9170

chelsea-lin opened this issue May 10, 2024 · 3 comments
Assignees
Labels
feature Features or general enhancements

Comments

@chelsea-lin
Copy link
Contributor

Is your feature request related to a problem?

The collect() method in Ibis is a convenient way to aggregate data into arrays when working with BigQuery backends. It translates to the ARRAY_AGG function in BigQuery SQL. However, there's currently no built-in way to specify an ORDER BY clause within the ARRAY_AGG aggregation. This means that the order of elements in the resulting array is not guaranteed, which can be problematic when the order of elements matters for the downstream analysis.

What is the motivation behind your request?

No response

Describe the solution you'd like

I propose adding an order_by parameter to the collect() method. This parameter would accept either:

  • A string or list of strings representing column names to order by.
  • An Ibis expression indicating the ordering.

This would allow users to explicitly control the ordering of elements within the aggregated array, making the results more predictable and useful.

What version of ibis are you running?

9.0.0

What backend(s) are you using, if any?

BigQuery

Code of Conduct

  • I agree to follow this project's Code of Conduct
@chelsea-lin chelsea-lin added the feature Features or general enhancements label May 10, 2024
@jcrist
Copy link
Member

jcrist commented May 10, 2024

Thanks for opening this! I agree that supporting ordering in collect would be useful (we're also adding an order_by kwarg to Table.distinct for similar reasons).

I'm not sure how well an order_by kwarg would work though in the existing collect() method though - given that collect is on a Column instead of a Table, having the ordering refer to columns on the parent table may feel weird.

t = ibis.table({"a": "int", "b": "int, "c": "int"})

t.a.collect(order_by=("b", "c"))  # here `b` and `c` refer to columns on the parent table of `a`
t.a.collect(order_by=(t.b, t.c))   # explicit column references also work

Typing all that out doesn't feel that odd, but AFAIK we don't have any other columnar operations that support an ordering specification, so we don't have an existing pattern to mirror here.

Edit: actually, we do have some precedent for binding args to a column method to the parent table - the where arg to all column reductions. This currently doesn't accept string arguments, only deferred operators, but given that then I think the spelling here is a logical extension.

@cpcloud
Copy link
Member

cpcloud commented May 10, 2024

group_concat is another operation where this would make sense, along with the first and last aggregate functions. I think we should add them on a case by case basis for now and do an audit of the aggs where ordering has an effect on the output.

@jcrist
Copy link
Member

jcrist commented May 10, 2024

A quick survey of backends and current ops shows those are the main 4 operations. I've taken a quick look through implementing this (needed a fun friday break) and like most things ibis it's turned up some internal cleanups we'll want to do first but I think this should be only a medium-sized lift to implement.

@jcrist jcrist changed the title feat(bigquery): Add order_by parameter to the collect() method feat(api): Add order_by parameter to the collect() method May 13, 2024
@jcrist jcrist self-assigned this May 20, 2024
gforsyth pushed a commit that referenced this issue May 22, 2024
…ss (#9222)

This is an initial refactor moving towards #9170.

Previously every backend implemented their own `_aggregate` function -
many of them copy-pasted (with slight variations) of each other.

To add a new `order_by` kwarg to `_aggregate` would require editing all
16 copies of this function, which would be _annoying_. This PR is an
attempt to centralize their implementations into a common `Gen` class.
The class takes a config flag to handle the common cases, the uncommon
cases are then handled by backend-specific subclasses.

This also extracts the `.agg` attribute out to be a class variable (err,
descriptor) rather than an instance variable.

An alternative implementation would be adding boolean flags directly to
the `SQLGlotCompiler` class, but IIRC we intentionally moved away from
those in the SQLAlchemy -> SQLGlot refactor. Could go either way here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Status: todo
Development

No branches or pull requests

3 participants