Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aliasing Left Side of Join Breaks Derivation #4299

Open
2 tasks done
jmsmdy opened this issue Mar 4, 2024 · 5 comments
Open
2 tasks done

Aliasing Left Side of Join Breaks Derivation #4299

jmsmdy opened this issue Mar 4, 2024 · 5 comments
Labels
bug Invalid compiler output or panic

Comments

@jmsmdy
Copy link

jmsmdy commented Mar 4, 2024

What happened?

Aliasing the left side of a join magically erases any derivations in that left side.

Without alias (seems to work):

prql target:sql.bigquery
from sales
derive {profit = sell_price - buy_price}
join side:left details (this.item_id == that.item_id)
select !{details.item_id}

With alias (doesn't work):

prql target:sql.bigquery
sales_with_profit = (
    from sales
    derive {profit = sell_price - buy_price}
)
join side:left details (this.item_id == that.item_id)
select !{details.item_id}

With the aliasing, the "profit" derivation completely disappears. That should not happen.

PRQL input

prql target:sql.bigquery
sales_with_profit = (
    from sales
    derive {profit = sell_price - buy_price}
)
join side:left details (this.item_id == that.item_id)
select !{details.item_id}

SQL output

WITH table_0 AS (
  SELECT
    *
  FROM
    sales
)
SELECT
  table_0.*,
  details.*
EXCEPT
  (item_id)
FROM
  table_0
  LEFT JOIN details ON table_0.item_id = details.item_id

-- Generated by PRQL compiler version:0.11.2 (https://prql-lang.org/)

Expected SQL output

WITH table_0 AS (
  SELECT
    *,
    sell_price - buy_price AS profit
  FROM
    sales
)
SELECT
  table_0.*
EXCEPT
  (profit),
  details.*
EXCEPT
  (item_id),
  table_0.profit
FROM
  table_0
  LEFT JOIN details ON table_0.item_id = details.item_id

MVCE confirmation

  • Minimal example
  • New issue

Anything else?

No response

@jmsmdy jmsmdy added the bug Invalid compiler output or panic label Mar 4, 2024
@max-sixty
Copy link
Member

max-sixty commented Mar 4, 2024

Thanks for the issue. Possibly this should raise a proper error message.

The fix is that the aliasing is within the from:

prql target:sql.bigquery
from sales_with_profit=sales
derive {profit = sell_price - buy_price}
join side:left details (this.item_id == that.item_id)
select !{details.item_id}
WITH table_0 AS (
  SELECT
    *,
    sell_price - buy_price AS profit
  FROM
    sales AS sales_with_profit
)
SELECT
  table_0.*
EXCEPT
  (profit),
  details.*
EXCEPT
  (item_id),
  table_0.profit
FROM
  table_0
  LEFT JOIN details ON table_0.item_id = details.item_id

-- Generated by PRQL compiler version:0.11.3 (https://prql-lang.org)

Does that make sense?


Edit: but looks like there's a different bug — the EXCEPT expressions are quite wrong. That looks like it's driven by the join; it's not affected by the aliasing. OK if I change this issue to focus on that?

@jmsmdy
Copy link
Author

jmsmdy commented Mar 4, 2024

About Raising Error

It sounds like the PRQL parser is accepting invalid PRQL, so I think raising an error would be ideal (to prevent silent failure).

About SELECT * EXCEPT

I'm actually not 100% familiar with the "SELECT * EXCEPT" syntax, so I didn't realize that wasn't valid. In any, case, the PRQL compiler is not indenting it sensibly, I believe PRQL compiler is trying to output this (exact same content, just re-indented for clarity):

WITH table_0 AS (
  SELECT
    *,
    sell_price - buy_price AS profit
  FROM
    sales AS sales_with_profit
)
SELECT
  table_0.* EXCEPT (profit),
  details.* EXCEPT (item_id),
  table_0.profit
FROM
  table_0
  LEFT JOIN details ON table_0.item_id = details.item_id

Are you trying to say it should be something like:

WITH table_0 AS (
  SELECT
    *,
    sell_price - buy_price AS profit
  FROM
    sales AS sales_with_profit
)
SELECT * EXCEPT (table_0.profit, details.item_id),
FROM
  table_0
  LEFT JOIN details ON table_0.item_id = details.item_id

?

About Documentation around Aliasing / Simulating Equijoins

BigQuery is not actually one of my ultimate targets, but rather Spark SQL (which is not yet supported). The "sql.generic" target doesn't support "SELECT * EXCEPT" syntax at all, so I am using sql.bigquery as an arbitrary intermediate target and then using sqlglot to convert to Spark SQL. "sql.generic" silently omits "select !{...}" statements without giving a warning it is doing so. But that's not really relevant to this issue.

What is relevant that you don't have support for equijoins (anymore?), so I need to drop the duplicated join columns (either from the left or the right table, depending on if it is left or right join). For dropping from the right table, you can (AFAIK) replace "details" in my example with an arbitrary PRQL pipeline then wrap it in "details = (...)" and it will continue to work, e.g.:

prql target:sql.bigquery
from sales_with_profit=sales
derive {profit = sell_price - buy_price}
join side:left details = (
    <1000 lines of PRQL code>
) (this.item_id == that.item_id)
select !{details.item_id}

This makes it especially suitable for recursive / compositional code-gen (my use case).

For dropping the column from the left side, however, there doesn't seem to be documented an analogous aliasing feature:

<2000 lines of PRQL code>
join side:right details (this.item_id == that.item_id)
select !{???.item_id}

What do you sub in for the ??? to refer to the left table in this join?

Just messing around, it appears that this will work as an alias for the left hand side of join, but I was unable to find anything in the documentation on this. Note that that does NOT work here, which makes sense as PRQL pipelines are not supposed to be passing along the entire history of all the intermediate tables to the next step in the pipeline. It is a little surprising that this does work, but I am also unsure if this can be relied upon into the future.

If it can be relied upon, it would help a lot to have some examples added to the documentation specifically demonstrating how to do right/left equijoins (removing the redundant columns in those cases). There are other more verbose ways of handling this situation (such as aliasing the join columns in one of the tables before the join, then dropping those aliased columns after), but this adds a fair amount of complexity for the reader.

EDIT: It appears that "let ..." or "into" statements might solve my particular code-gen use case (since a let statement or into can be used to alias the left side of the join).

@max-sixty
Copy link
Member

Sorry, I'm confusing things.

The EXCEPT syntax is fine, it's just the indenting that's off. I raised that in shssoichiro/sqlformat-rs#35.

Though are you sure SparkSQL supports EXCEPT here? From two minutes of searching, I only see it works as a set operation, like an INTERSECT / UNION etc.


The equijoins point is valid; that is a big advantage of using USING.

FWIW we're rethinking how name resolution works, and I would love to put USING back as part of that (though it's not on the immediate horizon).

At the moment, I think the main way out is to either use EXCEPT for dialects which support it, or know the columns beforehand and feed them into a select {foo, bar, etc}.


Does that answer your question re the except? I'm not sure how much of that comes from my incorrect claim about EXCEPT not working above...

Feel free to post more details of what you're doing, it sounds quite interesting!

@jmsmdy
Copy link
Author

jmsmdy commented Mar 5, 2024

What I am working on is probably pretty standard: just trying to write some tools to help build unit-testable data pipelines that can run in 3 different languages/dialects: (Py)Spark SQL, Presto/Trino, and Pandas. We have tables with lots of columns (sometimes 100+), so some degree of meta-programming is required just to not lose your sanity. It would also be nice to automate column lineage (which I know is in progress for PRQL). I'm thinking of PRQL as a kind of intermediate language: you have dev tools (which might make extensive use of meta-programming) written in python which generate PRQL, and then PRQL which can be compiled to run in these different environments.

PRQL is great for this purpose. Right now, I have an interface which mimics the PySpark API, but generates PRQL instead. PySpark and PRQL are surprisingly similar (e.g. there is a ".filter" method, a ".groupBy.agg()" method, etc.), and they are both pipelined. So right now I could actually go directly to PySpark without much trouble and support all the features needed, but it would be nicer to have everything go through PRQL.

Spark SQL does in fact support a syntax to drop columns. As long as you have run
SET spark.sql.parser.quotedRegexColumnNames=true, you can do the following:

SELECT
  table_0.`(item_id)?+.+`,
  table_1.*
FROM `table_1`
RIGHT JOIN `table_0`
  ON `table_1`.`item_id` = `table_0`.`item_id`;

This is using a Java-specific regex syntax for the column selector which is a little bit hard to understand, but basically says "Either you start with item_id and have one or more additional characters afterwards, or you don't start with item_id".

Whether some combination of PRQL + SQLGlot can actually automatically convert SELECT * EXCEPT to this syntax is yet to be seen (although I don't see any conceptual blockers). PySpark has a "drop(...)" method so PySpark is easier to target.

@max-sixty
Copy link
Member

Nice, thanks, v interesting to hear what you're working on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Invalid compiler output or panic
Projects
None yet
Development

No branches or pull requests

2 participants