Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Aggregation Functions with GROUPS Clause in Window Functions #4474

Open
javier opened this issue May 6, 2024 · 0 comments
Open

Support Aggregation Functions with GROUPS Clause in Window Functions #4474

javier opened this issue May 6, 2024 · 0 comments
Labels
New feature Feature requests

Comments

@javier
Copy link
Contributor

javier commented May 6, 2024

Is your feature request related to a problem?

Background

Currently, the GROUPS window frame specification is implemented in QuestDB but lacks support for aggregation functions like AVG and SUM. As discussed, this feature aligns with PostgreSQL semantics, where GROUPS provides a unique way to define window frames based on logical groups rather than individual rows.

Problem Statement

In its current state, the GROUPS clause is implemented but lacks support for aggregation functions such as AVG and SUM, which limits its usefulness. Without this support, the GROUPS keyword only complicates the code without offering any unique benefits over ROWS.

Expected Behavior

When the GROUPS window frame specification is used, it should support aggregate functions to provide meaningful results.

References

  • Slack conversation regarding GROUPS support: [link to conversation]

Describe the solution you'd like.

Example Implementation

Consider the following data for user_scores:

CREATE TABLE user_scores (
    participant VARCHAR(10),
    age INT,
    score INT
);

INSERT INTO user_scores (participant, age, score)
VALUES 
    ('a', 40, 50),
    ('b', 50, 35),
    ('c', 34, 47),
    ('d', 50, 10),
    ('e', 33, 5),
    ('f', 40, 11);

Query Using GROUPS:

SELECT *,
    SUM(score) OVER (
        ORDER BY age
        GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING
    ) AS sum_groups
FROM user_scores
ORDER BY age;

Expected Output:

| participant | age | score | sum_groups |
|-------------|-----|-------|------------|
| e           | 33  | 5     | 52         |
| c           | 34  | 47    | 113        |
| a           | 40  | 50    | 153        |
| f           | 40  | 11    | 153        |
| b           | 50  | 35    | 106        |
| d           | 50  | 10    | 106        |

Comparison with RANGE and ROWS

To understand how GROUPS compares to other window frame specifications, here are the equivalent queries using RANGE and ROWS.

Query Using RANGE:

SELECT *,
    SUM(score) OVER (
        ORDER BY age
        RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING
    ) AS sum_range
FROM user_scores
ORDER BY age;

Expected Output:

| participant | age | score | sum_range |
|-------------|-----|-------|-----------|
| e           | 33  | 5     | 52        |
| c           | 34  | 47    | 52        |
| a           | 40  | 50    | 61        |
| f           | 40  | 11    | 61        |
| b           | 50  | 35    | 45        |
| d           | 50  | 10    | 45        |

Query Using ROWS:

SELECT *,
    SUM(score) OVER (
        ORDER BY age
        ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
    ) AS sum_rows
FROM user_scores
ORDER BY age;

Expected Output:

| participant | age | score | sum_rows |
|-------------|-----|-------|----------|
| e           | 33  | 5     | 52       |
| c           | 34  | 47    | 87       |
| a           | 40  | 50    | 108      |
| f           | 40  | 11    | 96       |
| b           | 50  | 35    | 96       |
| d           | 50  | 10    | 45       |

The explanation here is…

  • When using ROWS, it is very straight forward. I look before/after as many rows as I am told.
  • When using RANGE, I look at as many rows before/after as the range says, but.. interestingly… the RANGE needs to be consecutive. So im the case of rows for age 33 and 34 it is adding them together, as 33 and 34 are consecutive integers, but when aggregating for rows with ages 40 and 50 it is not passing the boundaries as the difference between 35 and 40, or between 40 and 50 is higher than 1
  • When using GROUPS, it is adding to the frame all the score values for all the columns in the group before and after, so for columns in the age 50, it will be adding the scores of the two values in the 40 group. For the rows in group 40, it is adding all the values for group 34 and group 50

Conclusion

  • GROUPS allows logical grouping based on ordered values, offering more meaningful aggregate calculations compared to ROWS and RANGE in some use cases.
  • The current implementation should be enhanced to support aggregation functions like AVG and SUM fully.

Describe alternatives you've considered.

No response

Full Name:

Javier Ramirez

Affiliation:

QuestDB

Additional context

This issue was first raised on public slack https://questdb.slack.com/archives/C0394MS0FMF/p1714323610008919

@javier javier added the New feature Feature requests label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New feature Feature requests
Projects
None yet
Development

No branches or pull requests

1 participant