Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(api): rewrite union and intersection construction to support more operands #9194

Merged
merged 1 commit into from
May 31, 2024

Conversation

jitingxu1
Copy link
Contributor

@jitingxu1 jitingxu1 commented May 14, 2024

Description of changes

When I am working on the #9139 (comment), I had a RecursionError: maximum recursion depth exceeded while calling a Python object when union large number of tables, I found we had this similar issues before, #7124 and tobymao/sqlglot#2961.

I am trying to rewrite the union in our codebase, It recursively union the tables using a queue, the number of unions is still similar to the original implementation, but it allow us to union more tables than before (It could unblock #9139 and added a test for larger number of tables where the original implementation will fail). Not sure if there is other side effect or not. If it is not a good solution, I could close this PR. Thanks for your review.

I was thinking to open an issue for discussion first, since I had some code ready, just open this for discussion, if it is not the best practice to file a PR, please let me know.


Update from the below comment

The updated implementation generates SQL in a manner that reduces the depth of nesting, thereby delaying the hitting of the recursion error in the SQLglot implementation

sql query: Union 7 tables in the new implementation

 SELECT
*
FROM (
SELECT
  *
FROM (
  SELECT
    *
  FROM "ibis_pandas_memtable_3dqxlaaxrjbrhn6sezkq3jfey4" AS "t0"
  UNION ALL
  SELECT
    *
  FROM (
    SELECT
      *
    FROM "ibis_pandas_memtable_gjf7ximrfnat7ocq7u3snufdwy" AS "t1"
    UNION ALL
    SELECT
      *
    FROM "ibis_pandas_memtable_khxo6jdnpnbc3lz4kdpkoll6ny" AS "t2"
  ) AS "t7"
) AS "t10"
UNION ALL
SELECT
  *
FROM (
  SELECT
    *
  FROM (
    SELECT
      *
    FROM "ibis_pandas_memtable_4wg3k2vsvzf4tophotkj6n6ope" AS "t3"
    UNION ALL
    SELECT
      *
    FROM "ibis_pandas_memtable_zpp6dkndfnhjla473tjklyp3ku" AS "t4"
  ) AS "t8"
  UNION ALL
  SELECT
    *
  FROM (
    SELECT
      *
    FROM "ibis_pandas_memtable_y5w7wpkhkbgr3agcuc3ftnujai" AS "t5"
    UNION ALL
    SELECT
      *
    FROM "ibis_pandas_memtable_ebrddatk6bfgti4x7gwrkgi3gm" AS "t6"
  ) AS "t9"
) AS "t11"
) AS "t12"

SQL query for the same union from the original implementation

 SELECT
  *
FROM (
  SELECT
    *
  FROM (
    SELECT
      *
    FROM (
      SELECT
        *
      FROM (
        SELECT
          *
        FROM (
          SELECT
            *
          FROM (
            SELECT
              *
            FROM "ibis_pandas_memtable_n4qzohbx3nfjdi6crezpa3zkfu" AS "t5"
            UNION ALL
            SELECT
              *
            FROM "ibis_pandas_memtable_3dhomfxvdfb2tezjssmbdkvvam" AS "t6"
          ) AS "t7"
          UNION ALL
          SELECT
            *
          FROM "ibis_pandas_memtable_mqknkr6dr5gv7hpcrri34hxydq" AS "t4"
        ) AS "t8"
        UNION ALL
        SELECT
          *
        FROM "ibis_pandas_memtable_v5wru54hhfabxd7khavq27zruq" AS "t3"
      ) AS "t9"
      UNION ALL
      SELECT
        *
      FROM "ibis_pandas_memtable_m4hgmj33nreopoxvxwhkycicfy" AS "t2"
    ) AS "t10"
    UNION ALL
    SELECT
      *
    FROM "ibis_pandas_memtable_lilxpxko6jh4lplqw2kp6jny3y" AS "t1"
  ) AS "t11"
  UNION ALL
  SELECT
    *
  FROM "ibis_pandas_memtable_mxiz4rs5z5dohdjxror5hm6p5a" AS "t0"
) AS "t12"

Issues closed

@jitingxu1 jitingxu1 requested review from kszucs and cpcloud May 14, 2024 18:43
@kszucs
Copy link
Member

kszucs commented May 15, 2024

So currently both the ibis and sqlglot representations are deeply nested using pairwise union operations. We could change it on the ibis side to be variadic but we would still hit the recursion error in the sqlglot implementation, but at least the error wouldn't come from our side.

@jitingxu1
Copy link
Contributor Author

we would still hit the recursion error in the sqlglot implementation

The updated implementation generates SQL in a manner that reduces the depth of nesting, thereby delaying the hitting of the recursion error in the SQLglot implementation.

See the generated sql query fabove.

@jitingxu1
Copy link
Contributor Author

my current unit test uses ibis.to_sql() or execute(), it does not work under the current path, I am looking for some suggestions too, where or how to do the unit test.

ibis/expr/types/relations.py Outdated Show resolved Hide resolved
ibis/tests/expr/test_table.py Outdated Show resolved Hide resolved
ibis/expr/types/relations.py Outdated Show resolved Hide resolved
@cpcloud cpcloud added this to the 9.1 milestone May 30, 2024
@cpcloud cpcloud added the performance Issues related to ibis's performance label May 30, 2024
@cpcloud cpcloud changed the title fix: rewrite the union to enable the union of large number of tables perf(api): rewrite the union to enable the union of large number of tables May 30, 2024
@cpcloud cpcloud force-pushed the rewrite-union branch 5 times, most recently from 3e6d064 to cfbd7e5 Compare May 31, 2024 13:46
@cpcloud cpcloud changed the title perf(api): rewrite the union to enable the union of large number of tables perf(api): rewrite union and intersection construction to enable a larger number of tables May 31, 2024
@cpcloud cpcloud changed the title perf(api): rewrite union and intersection construction to enable a larger number of tables perf(api): rewrite union and intersection construction to support more operands May 31, 2024
@cpcloud
Copy link
Member

cpcloud commented May 31, 2024

Ok, so the new compile benchmark simply cannot be run on main, because it hits a recursion error.

After this PR, the benchmarks look like this:

-------------------------------------- benchmark 'test_large_union_compile': 1 tests --------------------------------------
Name (time in ms)                 Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------
test_large_union_compile     807.3274  871.2054  825.3349  26.0861  817.4346  21.9607       1;1  1.2116       5           1
---------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'test_large_union_construct': 2 tests ------------------------------------------------------------------------------
Name (time in ms)                                 Min                Max               Mean            StdDev             Median               IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_large_union_construct (0001_4c0ba2c)     79.3308 (1.00)     83.1229 (1.00)     80.6946 (1.00)     1.1855 (1.0)      80.0964 (1.0)      1.7544 (1.0)           4;0  12.3924 (1.00)         13           1
test_large_union_construct (NOW)              79.0677 (1.0)      82.8776 (1.0)      80.6321 (1.0)      1.2461 (1.05)     80.2037 (1.00)     1.8576 (1.06)          4;0  12.4020 (1.0)          12           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

Merging!

@cpcloud cpcloud merged commit 5d7aa55 into ibis-project:main May 31, 2024
74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to ibis's performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants