Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sql): push down ORDER BY advice for DISTINCT queries #4180

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

piotrrzysko
Copy link
Contributor

@piotrrzysko piotrrzysko commented Jan 31, 2024

Fixes #4172.

Pushing down ORDER BY advice allows selecting a scanning order that matches the order direction specified in the ORDER BY clause. Consequently, for the query from the issue, we can avoid unnecessary sorting and utilize a record cursor factory available here:

if (orderBy.get(column) == QueryModel.ORDER_DIRECTION_ASCENDING
&& recordCursorFactory.getScanDirection() == RecordCursorFactory.SCAN_DIRECTION_FORWARD) {
return recordCursorFactory;
} else if (orderBy.get(column) == ORDER_DIRECTION_DESCENDING
&& recordCursorFactory.getScanDirection() == RecordCursorFactory.SCAN_DIRECTION_BACKWARD) {
return recordCursorFactory;
}

@puzpuzpuz puzpuzpuz added SQL Issues or changes relating to SQL execution Performance Performance improvements labels Jan 31, 2024
core/src/main/java/io/questdb/griffin/SqlOptimiser.java Outdated Show resolved Hide resolved
assertMemoryLeak(() -> {
ddl("create table tab (x int, y long, ts timestamp) timestamp(ts)");

assertPlan(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be great to assert result sets, not only execution plans in these tests.

ddl("create table tab (x int, y long, ts timestamp) timestamp(ts)");

assertPlan(
"select DISTINCT ts, x, y from tab order by ts DESC",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to add another test for the select DISTINCT x, y from tab order by ts DESC case. The test should verify that we don't push down ORDER BY advice since the timestamp column is not in the SELECT DISTINCT clause.

Other candidates would be a query with positional numbers in ORDER BY (e.g. select DISTINCT ts, x, y from tab order by 1) and a SELECT DISTINCT over a sub-query (e.g. select DISTINCT ts, x, y from (tab WHERE x > y) order by ts or select DISTINCT ts, x, y from (SELECT tab.ts, tab.x, tab2.y FROM tab ASOF JOIN tab2) order by ts).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select DISTINCT x, y from tab order by ts DESC

In fact, this query is invalid; therefore, I initially skipped it (there is a test in OrderByAdviceTest that verifies a similar case). Now, I've added a test to verify that the exact query you suggested is invalid.

select DISTINCT ts, x, y from (tab WHERE x > y) order by ts DESC

This one on the other hand revealed unexpected behaviour:

Sort light
  keys: [ts desc]
    DistinctTimeSeries
      keys: ts,x,y
        Async JIT Filter workers: 1
          filter: y<x
            DataFrame
                Row forward scan
                Frame forward scan on: tab

I'd expect a backward scan.

The reason why it happens is that here:

this.aliasToColumnNameMap.put(name, name);
the columnNameToAliasMap field is not updated. Based on the rest of the code, I'd expect that the following fields are always kept in sync:

  • bottomUpColumnNames
  • bottomUpColumns
  • aliasToColumnNameMap
  • aliasToColumnMap
  • columnNameToAliasMap
  • columnAliasIndexes

I'll investigate this in the evening.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, this query is invalid

Good point! That query was indeed invalid.

I'd expect a backward scan.

Yes, it should be a backward scan. As for the copyColumnsFrom method, indeed it could be the case that it doesn't leave the maps in sync. Look forward to your further findings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @piotrrzysko

Any chances that you'll be able to finish this PR any time soon? If not, not a problem at all - I'm just checking to understand if someone from QuestDB team should pick it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Andrei,

Based on my research, in order to finish this PR (specifically this thread), we need to solve this issue: #4197.

Additionally, Vlad mentioned to me that a fuzzy test related to HyperLogLog occasionally fails (I suspect it's related to this comment: #4083 (comment)). My plan was, when I have time, to first address the HLL problem and then return to this PR.

Generally, I'm happy to help, but I'm a bit busy right now, and I won't be able to work on any of these issues in the next two or three weeks. Please, let me know what you think about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. No prob with the time aspect: the PR will be kept open as long as needed. Needless to say, we value your contributions a lot.

As for the flaky test, relaxing the check should help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Performance improvements SQL Issues or changes relating to SQL execution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up SELECT DISTINCT + LIMIT queries involving designated timestamp column
4 participants