Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop hints support #125

Open
ruslandoga opened this issue Sep 19, 2023 · 17 comments
Open

drop hints support #125

ruslandoga opened this issue Sep 19, 2023 · 17 comments

Comments

@ruslandoga
Copy link
Contributor

ruslandoga commented Sep 19, 2023

It doesn't seem like ClickHouse supports hints. In Plausible query.hints are used as a workaround to pass the SAMPLE clause.

Maybe it can be similar to input/1 helper function:

q = from t in sample("table", 0.1), select: count(t.id) * 10
# select count(t.id) * 10 from table t sample 0.1
@hkrutzer
Copy link
Contributor

I use them to add FINAL as I wrote here https://github.com/plausible/ecto_ch/pull/103/files (which I will update soon :) )

@ruslandoga
Copy link
Contributor Author

ruslandoga commented Sep 19, 2023

It doesn't seem like FINAL is a hint though. I think ClickHouse is using settings in place of hints: ClickHouse/ClickHouse#11719 (comment)

@hkrutzer
Copy link
Contributor

Indeed it's not a hint but it seems to be the only way to pass it into that query from Ecto. I don't think Ecto supports a settings clause?

@ruslandoga
Copy link
Contributor Author

It doesn't have to go into the Ecto query, settings can be passed separately from the query.

@Zarathustra2
Copy link
Contributor

I think final still has some issues because it cannot move always the where clause to prewhere. See ClickHouse/ClickHouse#31411

So I would advise to not use/promote the usage of final too much.

@hkrutzer
Copy link
Contributor

Some queries cannot be performed without FINAL but I think that is tangential to this issue.

@Zarathustra2
Copy link
Contributor

Really? Do you have an example? I am curious because I cannot think of an example maybe because I never encountered such a situation

@hkrutzer
Copy link
Contributor

When using an AggregatingMergeTree and you need the merged result for example

@Zarathustra2
Copy link
Contributor

You can use aggregating functions - sum, avg, argMaxMerge, etc. - + group by to have the same effect as your final query

@hkrutzer
Copy link
Contributor

I don't use it like that, I merge several partial records into one final record, like this (simplified):

`id` FixedString(26),
`started_at` DateTime('UTC') MATERIALIZED ULIDStringToDateTime(id),
`disconnected_at` SimpleAggregateFunction(max, DateTime('UTC')),
`user_responded_at` SimpleAggregateFunction(max, DateTime('UTC')),
`properties` SimpleAggregateFunction(groupArrayArray, Array(String))

Additional properties (in the properties column) can be added to the record at any time, and the disconnected_at and user_responded_at are inserted later as well. They are eventually merged based on the ID column.

@Zarathustra2
Copy link
Contributor

Zarathustra2 commented Sep 19, 2023

But can't you just do

select id, started_at, max(disconnected_at) as disconnected_at, max(user_responded_at) as user_responded_at, groupArrayArray(properties) as properties from <table> group by id, started_at

Maybe I am not really following. Still I don't think there is any query which you can only run with final and not as a group by + aggregation.

@hkrutzer
Copy link
Contributor

hkrutzer commented Sep 19, 2023

I don't know if I can, but why would I do it that way? That is a way longer query. Perhaps you are right and FINAL is not strictly speaking necessary but I don't see the advantage of writing a longer query.

@Zarathustra2
Copy link
Contributor

It depends on your use case. If you have a small dataset final should be fine. If you have a larger dataset final should be avoided. If you don't run into performance issues or don't care about speed then final is also fine.

@hkrutzer
Copy link
Contributor

hkrutzer commented Sep 19, 2023

If I query with a group by, I get

10 rows in set. Elapsed: 6.001 sec. Processed 33.86 million rows, 802.51 MB (5.64 million rows/s., 133.73 MB/s.)

and with the FINAL modifier:

10 rows in set. Elapsed: 0.584 sec. Processed 94.00 thousand rows, 110.76 KB (160.89 thousand rows/s., 189.56 KB/s.)

With GROUP BY, Clickhouse will probably scan every row, while FINAL can be performed in a more efficient way.

When I repeat the query it is even worse:

10 rows in set. Elapsed: 6.421 sec. Processed 33.86 million rows, 802.52 MB (5.27 million rows/s., 124.98 MB/s.)

remains.

With FINAL:

10 rows in set. Elapsed: 0.096 sec. Processed 94.01 thousand rows, 111.21 KB (982.83 thousand rows/s., 1.16 MB/s.)

So this is over 70 times slower and it looks like it will slow down further with additional records.

Group by

select conversation_id, max(disconnected_at) as disconnected_at, max(user_responded_at) as user_responded_at from conversations group by conversation_id limit 10 format Vertical

final

select conversation_id, disconnected_at, user_responded_at from conversations final limit 10 format Vertical

@Zarathustra2
Copy link
Contributor

Zarathustra2 commented Sep 19, 2023

I mean 0.584 isn't bad depending on use case but also I dont see the full query & table so cannot say whether the query without final is performant written or not (but most likely it is not)

@hkrutzer
Copy link
Contributor

If you say so. I will stop derailing this issue, I think I've made a clear case for FINAL.

@Zarathustra2
Copy link
Contributor

Zarathustra2 commented Sep 19, 2023

The case is not clear because I don't see the table definition and the queries you used that result in that huge performance diff

Edit: I am not against the PR but I am just saying that in all cases you can have at least equal performance and in most cases better performance without final modifier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants