Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue when using a DuckDb Macro #4150

Open
2 tasks done
maelp opened this issue Jan 30, 2024 · 21 comments
Open
2 tasks done

issue when using a DuckDb Macro #4150

maelp opened this issue Jan 30, 2024 · 21 comments
Labels
bug Invalid compiler output or panic

Comments

@maelp
Copy link

maelp commented Jan 30, 2024

What happened?

I have a macro defined in DuckDb to access Parquet files on s3

CREATE MACRO cse(year, weekA, weekB) AS TABLE SELECT * FROM read_parquet(list_transform(
            generate_series(weekA, weekB),
            week -> format('s3://${BUCKET_PATH}/year={0}/week={1}/0.parquet', year, week)
        ));

in SQL I can do

SELECT count(*) FROM cse(2024, 1, 5);

but if I try to compile the following PRQL

from cse(2024,1,5)
aggregate {
   count this
}

it gives an error

[{"kind":"Error","code":null,"reason":"unexpected , while parsing function call","hints":[],"span":"1:13-14","display":"Error: \n   ╭─[:1:14]\n   │\n 1 │ from cse(2024, 1, 5)\n   │              ┬  \n   │              ╰── unexpected , while parsing function call\n───╯\n","location":{"start":[0,13],"end":[0,14]}}]}

PRQL input

see above

SQL output

See above

Expected SQL output

No response

MVCE confirmation

  • Minimal example
  • New issue

Anything else?

No response

@maelp maelp added the bug Invalid compiler output or panic label Jan 30, 2024
@maelp
Copy link
Author

maelp commented Jan 30, 2024

Right now I seem to be "forced" to use an s-string which is not very neat, eg

from s"SELECT * FROM cse(2024, 1, 4)"
aggregate {
   count this
}

or something, is there a better way?

@maelp
Copy link
Author

maelp commented Jan 30, 2024

similarly if I have a field with a "reserved name" like "timestamp" I cannot directly use it, I need to use s"timestamp", for instance

aggregate {
   min_time = min s"timestamp" -- using "min timestamp" would generate an error because "timestamp" is a type
}

isn't there a way to see that what follows is necessarily a value, and should be coerced to a value?

@max-sixty
Copy link
Member

Hi @maelp , good questions, thanks for opening an issue.

Check out https://prql-lang.org/book/reference/syntax/keywords.html — these are solvable with backticks and this. respectively:

from `cse(2024,1,5)`
aggregate {
   sum this.time
}
SELECT
  COALESCE(SUM(time), 0)
FROM
  "cse(2024,1,5)"

-- Generated by PRQL compiler version:0.11.1 (https://prql-lang.org)

Does that make sense?

@maelp
Copy link
Author

maelp commented Jan 30, 2024

Thanks! However if I use this DuckDB MACRO

CREATE MACRO cse(year, weekA, weekB) AS TABLE SELECT * FROM read_parquet(list_transform(
            generate_series(weekA, weekB),
            week -> format('s3://${BUCKET_PATH}/year={0}/week={1}/0.parquet', year, week)
        ));

I think it doesn't create a "table" per-se, but should be used as SELECT * FROM cse(...)

do you think I should instead create a table? not sure if I can do this as a macro / function?

When I use your code I get the error:

Catalog Error: Table with name recent_cse(4) does not exist!

@maelp
Copy link
Author

maelp commented Jan 30, 2024

The generated SQL is this

WITH table_0 AS (
  SELECT
    bms_id,
    GREATEST(value.temperature_mosfet, value.temperature_ic) AS _expr_0,
    GREATEST(value.temperature_front, value.temperature_back) AS _expr_1,
    LEAST(value.temperature_front, value.temperature_back) AS _expr_2
  FROM
    "recent_cse(4)"
)
SELECT
  bms_id,
  MIN(_expr_2) AS min_temp_cells,
  MAX(_expr_1) AS max_temp_cells,
  MAX(_expr_0) AS max_temp_bms,
  COUNT(*) AS count
FROM
  table_0
WHERE
  _expr_2 < -10
  OR _expr_1 > 40
  OR _expr_0 > 70
GROUP BY
  bms_id

I guess it should not use the " character because I want the macro to be run in duckdb no?

@maelp
Copy link
Author

maelp commented Jan 30, 2024

I verified that if I run the same query without the quotes, eg not "cse" but directly cse it works

eg

WITH table_0 AS (
  SELECT
    bms_id,
    GREATEST(value.temperature_mosfet, value.temperature_ic) AS _expr_0,
    GREATEST(value.temperature_front, value.temperature_back) AS _expr_1,
    LEAST(value.temperature_front, value.temperature_back) AS _expr_2
  FROM
    recent_cse(4)
)
SELECT
  bms_id,
  MIN(_expr_2) AS min_temp_cells,
  MAX(_expr_1) AS max_temp_cells,
  MAX(_expr_0) AS max_temp_bms,
  COUNT(*) AS count
FROM
  table_0
WHERE
  _expr_2 < -10
  OR _expr_1 > 40
  OR _expr_0 > 70
GROUP BY
  bms_id

is there a way to generate this simply from PRQL?

@maelp
Copy link
Author

maelp commented Jan 30, 2024

And if I try

from s"recent_cse(4)"
derive {
  min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
  max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
  max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
}
filter min_temp_cells < -10 || max_temp_cells > 40 || max_temp_bms > 70
group bms_id (
  aggregate {
    min_temp_cells = min min_temp_cells,
    max_temp_cells = max max_temp_cells,
    max_temp_bms = max max_temp_bms,
    count = count this,
  }
)```

it says

s-strings representing a table must start with SELECT

@max-sixty
Copy link
Member

Ah, so we want literally FROM cse(2024,1,5) and not the quoted FROM "cse(2024,1,5)".

In that case, I think we do need to use the from s"SELECT * FROM cse(2024, 1, 4)", since currently PRQL expects an identifier.

We do have prior art for something more with the read_parquet-like functions; so it's not out of the question to do something similar.


This probably isn't going to make this perfect immediately, but it would be possible to make a function so the inline call were elegant:

prql target:sql.duckdb

let cse = x y z -> s"select * from cse({x},{y},{z})"

from (cse 1 2 3)
derive x = 5
WITH table_0 AS (
  SELECT
    *
  from
    cse(1, 2, 3)
)
SELECT
  *,
  5 AS x
FROM
  table_0

-- Generated by PRQL compiler version:0.11.1 (https://prql-lang.org)

@maelp
Copy link
Author

maelp commented Jan 30, 2024

would be interesting if we could have something like this! Or allow macros and duckdb functions in PRQL or something

@maelp
Copy link
Author

maelp commented Jan 30, 2024

BTW is it possible to do a multi-line PRQL function (this does not seem to be shown in the documentation)

I'm trying to do a helper function, then call it like this

let checkTemps = numWeeks minCellTemp maxCellTemp maxBMSTemp -> (from (recent_cse numWeeks)
    derive {
      min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
      max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
      max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
    }
    filter min_temp_cells < minCellTemp || max_temp_cells > maxCellTemp || max_temp_bms > maxBMSTemp
    group bms_id (
      aggregate {
        min_temp_cells = min min_temp_cells,
        max_temp_cells = max max_temp_cells,
        max_temp_bms = max max_temp_bms,
        count = count this,
      }
    )
)

checkTemps 4 -10 40 70

but it doesn't work...

@max-sixty
Copy link
Member

The -10 needs to be parenthesized...

let checkTemps = numWeeks minCellTemp maxCellTemp maxBMSTemp -> (
from (recent_cse)
    derive {
      min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
      max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
      max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
    }
    filter min_temp_cells < minCellTemp || max_temp_cells > maxCellTemp || max_temp_bms > maxBMSTemp
    group bms_id (
      aggregate {
        min_temp_cells = min min_temp_cells,
        max_temp_cells = max max_temp_cells,
        max_temp_bms = max max_temp_bms,
        count = count this,
      }
    )
)

checkTemps 4 (-10) 40 70

...works!

(also I changed (recent_cse numWeeks) as that's not a function in the example)

Parentheses can be a bit confusing in some limited circumstances — explained here: https://prql-lang.org/book/reference/syntax/operators.html#parentheses. The main thing we can do is to make the error messages much much better.

@max-sixty
Copy link
Member

would be interesting if we could have something like this! Or allow macros and duckdb functions in PRQL or something

We could have something very very simple like duckdb_macro cse 1 2 3. It's not elegant.

How do you / others think about macros vs functions? To the extent they're overlapping, I would deprioritize work here — having a boilerplate function for each macro doesn't seem too bad. To the extent they'll be coexisting lots, we could think about improving the interface.

@maelp
Copy link
Author

maelp commented Jan 31, 2024

Would be helpful to have a better help message indeed! I thought the issue was something else

yes the boilerplate can work I agree, so not a big priority. It's just that I want to be able to do queries both in SQL and PRQL so I'd like to have macros available in both to avoid duplicating code

would it be possible to rewrite that macro in pure PRQL (eg do you have access to duckdb generate_series and list_transform in PRQL? possibly in order to have those available, having something like a ::duckdb:generate_series and ::duckdb:list_transform as tokens (or a kind of PRQL "module" that we could import like "import duckdb" and then "duckdb.generate_series", "duckdb.list_transform" and "duckdb.call_macro" could be useful?)

@eitsupi
Copy link
Member

eitsupi commented Jan 31, 2024

Maybe you want to see this.
https://eitsupi.github.io/querying-with-prql/method_chaining/

@maelp
Copy link
Author

maelp commented Jan 31, 2024

Sorry to use this issue to ask questions but I find that the documentation doesn't show all the power of PRQL yet, for instance: is there a way to do a ternary operation or an if / then / else in PRQL?

I'd like to do something like

select {
  result = someTemp < -10 ? "below" : "above"
}

is that possible?

@eitsupi
Copy link
Member

eitsupi commented Jan 31, 2024

@maelp
Copy link
Author

maelp commented Jan 31, 2024

Thanks!

Last question (and small "bug report")

when I use this it works fine

let check_temperatures = numWeeks minCellTemp maxCellTemp maxBMSTemp -> (
  from (recent_cse numWeeks)
  derive {
    min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
    max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
    max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
  }
  filter min_temp_cells < minCellTemp || max_temp_cells > maxCellTemp || max_temp_bms > maxBMSTemp
  group bms_id (
    aggregate {
      min_temp_cells = min min_temp_cells,
      max_temp_cells = max max_temp_cells,
      max_temp_bms = max max_temp_bms,
      count = count this,
    }
  )
)

But if I'd like to "rephrase" the min_temp, max_temp, etc at the end to only display if they are above / under their respective threshold it fails, and the error message is weird (it complains about the "->" in the function definition)

let check_temperatures = numWeeks minCellTemp maxCellTemp maxBMSTemp -> (
  from (recent_cse numWeeks)
  derive {
    min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
    max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
    max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
  }
  filter min_temp_cells < minCellTemp || max_temp_cells > maxCellTemp || max_temp_bms > maxBMSTemp
  group bms_id (
    aggregate {
      min_temp_cells = min min_temp_cells,
      max_temp_cells = max max_temp_cells,
      max_temp_bms = max max_temp_bms,
      count = count this,
    }
    derive {
      min_temp_cells = case [
        min_temp_cells <= minCellTemp => min_temp_cells
        true => ""
      ],
      max_temp_cells = case [
        max_temp_cells >= maxCellTemp => max_temp_cells
        true => ""
      ],
      max_temp_cells = case [
        max_temp_bms >= maxBMSTemp => max_temp_bms
        true => ""
      ],
    }
  )
)

(another small nitpick: why not using the same "->" for case and function, and introducing another symbol like "=>" ?)

@eitsupi
Copy link
Member

eitsupi commented Jan 31, 2024

I think you forgot the comma in arrays.

case [
  foo # bad line
  bar
]

@maelp
Copy link
Author

maelp commented Jan 31, 2024

ah indeed you're right, but the error message is confusing, @max-sixty would there be a way to show the "correct" error in those cases?

@max-sixty
Copy link
Member

@max-sixty would there be a way to show the "correct" error in those cases?

It's definitely possible. I'd say it's harder than I thought it would be when I originally proposed terser syntax...

For example, in that case we need to realize we're in a list, possibly attempt to execute a function bar with an argument foo & realize foo isn't a function — then add those up to say "maybe foo should have a trailing comma, and is part of a list". And we don't have column names atm.

I would really like to focus on this, and think this will be the focus of our GSOC application.

@max-sixty
Copy link
Member

(another small nitpick: why not using the same "->" for case and function, and introducing another symbol like "=>" ?)

FYI there was some discussion on this in a past issue... I don't think anyone had a really strong view, we ended up favoring consistency with other langs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Invalid compiler output or panic
Projects
None yet
Development

No branches or pull requests

3 participants