Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up SQL IN using SCALAR_IN_ARRAY. #16388

Merged
merged 7 commits into from May 14, 2024

Conversation

gianm
Copy link
Contributor

@gianm gianm commented May 3, 2024

Main changes:

  1. DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of
    the IN is above inFunctionThreshold. The default value of inFunctionThreshold
    is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE.

  2. SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular
    expression, when the size of the SEARCH is above inFunctionExprThreshold. The default
    value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting
    it to Integer.MAX_VALUE.

  3. ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is
    greater than inFunctionThreshold.

Benchmarks follow. Overall planning for large IN is much faster. Two new ones are marked with DNF on master, where I gave up and canceled them after they ran for a few minutes. Those same test cases completed in 100ms each with the patch.

InPlanningBenchmark
===================

inClauseLiteralsCount = 1000
inSubQueryThreshold = 2147483647
rowsPerSegment = 100

## master

Benchmark                                                        Score    Error  Units
InPlanningBenchmark.queryEqualOrInSql                          734.148 ± 46.319  ms/op
InPlanningBenchmark.queryInSql                                 243.272 ± 59.093  ms/op
InPlanningBenchmark.queryJoinEqualOrInSql                      758.371 ± 60.192  ms/op
InPlanningBenchmark.queryMultiEqualOrInSql                     739.495 ± 21.526  ms/op
InPlanningBenchmark.queryStringFunctionInSql                   484.096 ± 46.358  ms/op
InPlanningBenchmark.queryStringFunctionIsNotNullAndNotInSql        DNF
InPlanningBenchmark.queryStringFunctionIsNullOrInSql               DNF

## patch

Benchmark                                                        Score    Error  Units
InPlanningBenchmark.queryEqualOrInSql                           27.063 ±  2.291  ms/op
InPlanningBenchmark.queryInSql                                  24.686 ±  2.113  ms/op
InPlanningBenchmark.queryJoinEqualOrInSql                       29.158 ±  4.165  ms/op
InPlanningBenchmark.queryMultiEqualOrInSql                      29.845 ±  2.914  ms/op
InPlanningBenchmark.queryStringFunctionInSql                    92.489 ±  6.070  ms/op
InPlanningBenchmark.queryStringFunctionIsNotNullAndNotInSql    104.064 ± 31.440  ms/op
InPlanningBenchmark.queryStringFunctionIsNullOrInSql           100.475 ±  9.404  ms/op

SqlReverseLookupBenchmark
=========================

numKeys = 5000000
keysPerValue = 5000
lookupType = immutable

## master

Benchmark                                                        Score     Error  Units
SqlReverseLookupBenchmark.planEquals                           214.932 ±   5.827  ms/op
SqlReverseLookupBenchmark.planEqualsInsideAndOutsideCase      1613.542 ± 182.853  ms/op
SqlReverseLookupBenchmark.planNotEquals                        224.494 ±  19.920  ms/op

## patch

Benchmark                                                        Score     Error  Units
SqlReverseLookupBenchmark.planEquals                            26.214 ±   1.315  ms/op
SqlReverseLookupBenchmark.planEqualsInsideAndOutsideCase       317.464 ±  19.836  ms/op
SqlReverseLookupBenchmark.planNotEquals                         27.020 ±   1.694  ms/op

Main changes:

1) DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of
   the IN is above inFunctionThreshold. The default value of inFunctionThreshold
   is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE.

2) SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular
   expression, when the size of the SEARCH is above inFunctionExprThreshold. The default
   value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting
   it to Integer.MAX_VALUE.

3) ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is
   greater than inFunctionThreshold.
Copy link
Member

@asdf2014 asdf2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, also need to replace || with <code>&#124;&#124;</code> in Markdown table to display it correctly

docs/querying/sql-query-context.md Outdated Show resolved Hide resolved
Co-authored-by: Benedict Jin <asdf2014@apache.org>
Comment on lines +813 to +814
if (valuesNode.size() > plannerContext.queryContext().getInFunctionThreshold()
&& valuesNode.stream().allMatch(node -> node.getKind() == SqlKind.LITERAL && !SqlUtil.isNull(node))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not handle mixed versions as well? literals could be handled with this - but leave the other problematic stuff outside in an OR
the NULL case would be also less problematic - as those will be left outside as well...

or there is something wrong with:
x IN (1,2,3,y,null) => DRUID_IN(x,[1,2,3]) OR x = y OR x = null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extending this to split the call up into multiple calls would add complexity, and I was trying to keep the logic simple. I figured it would not be common to include NULL or nonliterals in the IN.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure - it can be added later if needed!

reverseLookupKey.negate,

// Use regular equals, or SCALAR_IN_ARRAY, depending on inFunctionThreshold.
reversedMatchValues.size() >= plannerContext.queryContext().getInFunctionThreshold(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would look simpler to pass plannerContext instead or inFunctionThreshold - and let this logic live inside makeIn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's like this because different usages of this method use different thresholds. Sometimes it's the inFunctionThreshold, sometimes it's the inFunctionExprThreshold.

cannotVectorize();

testQuery(
"SELECT dim1 NOT IN ('abc', 'def', 'ghi') AND dim1 < 'zzz', COUNT(*)\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more interesting have these tests apply inequality which could have filtered out some IN literal(s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool idea. I added a test for this as well: testNotInOrEqualToOneOfThemExpression.

@github-actions github-actions bot added Area - Batch Ingestion Area - Dependencies Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels May 10, 2024
Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like something odd have happened to you branch; there are some pom.xml changes

@gianm
Copy link
Contributor Author

gianm commented May 13, 2024

I think something got messed up when I pulled the commit from github itself: 492c80c.

I just fixed it up and force pushed. The only change since the original patch is the new test testNotInOrEqualToOneOfThemExpression.

@gianm
Copy link
Contributor Author

gianm commented May 14, 2024

@asdf2014 thanks for reviewing!

@kgyrtkirk thanks as well! please let me know if you have any additional comments; if not I'll merge this.

Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no more comments/questions etc :)

@gianm gianm merged commit 72432c2 into apache:master May 14, 2024
87 checks passed
@gianm gianm deleted the sql-use-scalar-in-array branch May 14, 2024 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants