Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windowing] Complex queries utilizing window functions become too slow to reasonably execute #161

Open
ohaibbq opened this issue Feb 5, 2024 · 1 comment · May be fixed by #169
Open

Comments

@ohaibbq
Copy link
Contributor

ohaibbq commented Feb 5, 2024

In a similar vein to #153 I tested out a complex query against the emulator and encountered the following error:

too many references to "recidiviz-bq-emulator-project_dataset_table": max 65535, reason: jobInternalError

I believe that it may be possible to rewrite the zetasqlite implementation of window functions using SQLite3's custom window function feature pending mattn/go-sqlite3#1215.

The new implementation may not require any additional subqueries to be made which would greatly reduce query complexity.

@ohaibbq
Copy link
Contributor Author

ohaibbq commented Feb 7, 2024

Hi @goccy and @totem3. I'm currently investigating how to undergo a rewrite of the windowing module.

It seems like it'd be possible to achieve an exponential performance increase for complex queries by delegating windowing to SQLite, rather than using the custom-built implementation in function_window_option.go.

When formatting window functions we add a new select statement containing the current formattedQuery, which causes some of our queries to end up being ~500,000 characters long and exceed the limit of allowed table references.

Did we re-implement function_window_option.go solely because the SQLite driver does not currently allow custom window functions to be registered? Do you foresee any issues with this approach?

Here's the difference in queries for the lead test in query_test.go. You can take a look at the throw-away implementation here.

Current implementation

WITH finishers AS (SELECT `name#1` AS `name#31`, `finish_time#2` AS `finish_time#32`, `division#3` AS `division#33`
                   FROM (/* ... */))
SELECT `name#34`, `finish_time#38`, `division#36`, `followed_by#39`
FROM (SELECT `name#34`,
             `finish_time#35`,
             `division#36`,
             `finish_time#38`,
             (SELECT zetasqlite_window_lead(`name#34`, zetasqlite_ignore_nulls(),
                                            zetasqlite_window_partition(`division#36`),
                                            zetasqlite_window_order_by(`division#36`, true),
                                            zetasqlite_window_order_by(`finish_time#35`, true),
                                            zetasqlite_window_rowid(`row_id`))
              FROM (SELECT `name#34`,
                           `finish_time#35`,
                           `division#36`,
                           zetasqlite_format_timestamp("eyJoZWFkZXIiOiJzdHJpbmciLCJib2R5IjoiJVgifQ==",
                                                       `finish_time#35`) AS `finish_time#38`
                    FROM (SELECT `name#31`        AS `name#34`,
                                 `finish_time#32` AS `finish_time#35`,
                                 `division#33`    AS `division#36` FROM `finishers`))) AS `followed_by#39`
      FROM (SELECT *, ROW_NUMBER() OVER () AS `row_id`
            FROM (SELECT `name#34`,
                         `finish_time#35`,
                         `division#36`,
                         zetasqlite_format_timestamp("eyJoZWFkZXIiOiJzdHJpbmciLCJib2R5IjoiJVgifQ==",
                                                     `finish_time#35`) AS `finish_time#38`
                  FROM (SELECT `name#31`        AS `name#34`,
                               `finish_time#32` AS `finish_time#35`,
                               `division#33`    AS `division#36` FROM `finishers`)))
      ORDER BY `division#36` COLLATE zetasqlite_collate, `finish_time#35` COLLATE zetasqlite_collate)

New implementation

WITH finishers AS (SELECT `name#1` AS `name#31`, `finish_time#2` AS `finish_time#32`, `division#3` AS `division#33`
                   FROM (/*...*/))
SELECT `name#34`, `finish_time#38`, `division#36`, `followed_by#39`
FROM (SELECT `name#34`,
             `finish_time#35`,
             `division#36`,
             `finish_time#38`,
             lead(`name#34`)
                  OVER (PARTITION BY `division#36` COLLATE zetasqlite_collate ORDER BY `division#36` ASC, `finish_time#35` ASC ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) AS `followed_by#39`
      FROM (SELECT `name#34`,
                   `finish_time#35`,
                   `division#36`,
                   zetasqlite_format_timestamp("eyJoZWFkZXIiOiJzdHJpbmciLCJib2R5IjoiJVgifQ==",
                                               `finish_time#35`) AS `finish_time#38`
            FROM (SELECT `name#31` AS `name#34`, `finish_time#32` AS `finish_time#35`, `division#33` AS `division#36`
                  FROM `finishers`))
      ORDER BY `division#36` COLLATE zetasqlite_collate, `finish_time#35` COLLATE zetasqlite_collate)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant