Speed up SQL IN using SCALAR_IN_ARRAY. #16388

gianm · 2024-05-03T21:54:57Z

Main changes:

DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of
the IN is above inFunctionThreshold. The default value of inFunctionThreshold
is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE.
SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular
expression, when the size of the SEARCH is above inFunctionExprThreshold. The default
value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting
it to Integer.MAX_VALUE.
ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is
greater than inFunctionThreshold.

Benchmarks follow. Overall planning for large IN is much faster. Two new ones are marked with DNF on master, where I gave up and canceled them after they ran for a few minutes. Those same test cases completed in 100ms each with the patch.

InPlanningBenchmark
===================

inClauseLiteralsCount = 1000
inSubQueryThreshold = 2147483647
rowsPerSegment = 100

## master

Benchmark                                                        Score    Error  Units
InPlanningBenchmark.queryEqualOrInSql                          734.148 ± 46.319  ms/op
InPlanningBenchmark.queryInSql                                 243.272 ± 59.093  ms/op
InPlanningBenchmark.queryJoinEqualOrInSql                      758.371 ± 60.192  ms/op
InPlanningBenchmark.queryMultiEqualOrInSql                     739.495 ± 21.526  ms/op
InPlanningBenchmark.queryStringFunctionInSql                   484.096 ± 46.358  ms/op
InPlanningBenchmark.queryStringFunctionIsNotNullAndNotInSql        DNF
InPlanningBenchmark.queryStringFunctionIsNullOrInSql               DNF

## patch

Benchmark                                                        Score    Error  Units
InPlanningBenchmark.queryEqualOrInSql                           27.063 ±  2.291  ms/op
InPlanningBenchmark.queryInSql                                  24.686 ±  2.113  ms/op
InPlanningBenchmark.queryJoinEqualOrInSql                       29.158 ±  4.165  ms/op
InPlanningBenchmark.queryMultiEqualOrInSql                      29.845 ±  2.914  ms/op
InPlanningBenchmark.queryStringFunctionInSql                    92.489 ±  6.070  ms/op
InPlanningBenchmark.queryStringFunctionIsNotNullAndNotInSql    104.064 ± 31.440  ms/op
InPlanningBenchmark.queryStringFunctionIsNullOrInSql           100.475 ±  9.404  ms/op

SqlReverseLookupBenchmark
=========================

numKeys = 5000000
keysPerValue = 5000
lookupType = immutable

## master

Benchmark                                                        Score     Error  Units
SqlReverseLookupBenchmark.planEquals                           214.932 ±   5.827  ms/op
SqlReverseLookupBenchmark.planEqualsInsideAndOutsideCase      1613.542 ± 182.853  ms/op
SqlReverseLookupBenchmark.planNotEquals                        224.494 ±  19.920  ms/op

## patch

Benchmark                                                        Score     Error  Units
SqlReverseLookupBenchmark.planEquals                            26.214 ±   1.315  ms/op
SqlReverseLookupBenchmark.planEqualsInsideAndOutsideCase       317.464 ±  19.836  ms/op
SqlReverseLookupBenchmark.planNotEquals                         27.020 ±   1.694  ms/op

Main changes: 1) DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of the IN is above inFunctionThreshold. The default value of inFunctionThreshold is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE. 2) SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular expression, when the size of the SEARCH is above inFunctionExprThreshold. The default value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting it to Integer.MAX_VALUE. 3) ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is greater than inFunctionThreshold.

asdf2014

Overall LGTM, also need to replace || with <code>||</code> in Markdown table to display it correctly

docs/querying/sql-query-context.md

Co-authored-by: Benedict Jin <asdf2014@apache.org>

kgyrtkirk · 2024-05-09T15:11:56Z

sql/src/main/java/org/apache/druid/sql/calcite/planner/DruidSqlValidator.java

+      if (valuesNode.size() > plannerContext.queryContext().getInFunctionThreshold()
+          && valuesNode.stream().allMatch(node -> node.getKind() == SqlKind.LITERAL && !SqlUtil.isNull(node))) {


why not handle mixed versions as well? literals could be handled with this - but leave the other problematic stuff outside in an OR
the NULL case would be also less problematic - as those will be left outside as well...

or there is something wrong with:
x IN (1,2,3,y,null) => DRUID_IN(x,[1,2,3]) OR x = y OR x = null

Extending this to split the call up into multiple calls would add complexity, and I was trying to keep the logic simple. I figured it would not be common to include NULL or nonliterals in the IN.

sure - it can be added later if needed!

kgyrtkirk · 2024-05-09T15:19:21Z

sql/src/main/java/org/apache/druid/sql/calcite/rule/ReverseLookupRule.java

              reverseLookupKey.negate,
+
+              // Use regular equals, or SCALAR_IN_ARRAY, depending on inFunctionThreshold.
+              reversedMatchValues.size() >= plannerContext.queryContext().getInFunctionThreshold(),


I wonder if it would look simpler to pass plannerContext instead or inFunctionThreshold - and let this logic live inside makeIn

Ah, it's like this because different usages of this method use different thresholds. Sometimes it's the inFunctionThreshold, sometimes it's the inFunctionExprThreshold.

kgyrtkirk · 2024-05-09T15:27:12Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java

+    cannotVectorize();
+
+    testQuery(
+        "SELECT dim1 NOT IN ('abc', 'def', 'ghi') AND dim1 < 'zzz', COUNT(*)\n"


I think it would be more interesting have these tests apply inequality which could have filtered out some IN literal(s)

Cool idea. I added a test for this as well: testNotInOrEqualToOneOfThemExpression.

kgyrtkirk

it seems like something odd have happened to you branch; there are some pom.xml changes

gianm · 2024-05-13T17:10:29Z

I think something got messed up when I pulled the commit from github itself: 492c80c.

I just fixed it up and force pushed. The only change since the original patch is the new test testNotInOrEqualToOneOfThemExpression.

gianm · 2024-05-14T08:11:22Z

@asdf2014 thanks for reviewing!

@kgyrtkirk thanks as well! please let me know if you have any additional comments; if not I'll merge this.

kgyrtkirk

no more comments/questions etc :)

github-actions bot added Area - Documentation Area - Querying labels May 3, 2024

gianm added 3 commits May 3, 2024 15:20

Revert test.

11940ae

Merge branch 'master' into sql-use-scalar-in-array

86bff63

Additional coverage.

d95ef2c

asdf2014 added the Performance label May 9, 2024

asdf2014 approved these changes May 9, 2024

View reviewed changes

docs/querying/sql-query-context.md Outdated Show resolved Hide resolved

Update docs/querying/sql-query-context.md

492c80c

Co-authored-by: Benedict Jin <asdf2014@apache.org>

kgyrtkirk reviewed May 9, 2024

View reviewed changes

github-actions bot added Area - Batch Ingestion Area - Dependencies Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels May 10, 2024

kgyrtkirk reviewed May 13, 2024

View reviewed changes

gianm added 2 commits May 13, 2024 10:08

New test.

7d63105

Merge branch 'master' into sql-use-scalar-in-array

e44a389

gianm force-pushed the sql-use-scalar-in-array branch from 6674c35 to e44a389 Compare May 13, 2024 17:09

kgyrtkirk approved these changes May 14, 2024

View reviewed changes

gianm merged commit 72432c2 into apache:master May 14, 2024
87 checks passed

gianm deleted the sql-use-scalar-in-array branch May 14, 2024 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up SQL IN using SCALAR_IN_ARRAY. #16388

Speed up SQL IN using SCALAR_IN_ARRAY. #16388

gianm commented May 3, 2024 •

edited

asdf2014 left a comment

kgyrtkirk May 9, 2024

gianm May 10, 2024

kgyrtkirk May 13, 2024

kgyrtkirk May 9, 2024

gianm May 10, 2024

kgyrtkirk May 9, 2024

gianm May 10, 2024

kgyrtkirk left a comment

gianm commented May 13, 2024

gianm commented May 14, 2024

kgyrtkirk left a comment

		if (valuesNode.size() > plannerContext.queryContext().getInFunctionThreshold()
		&& valuesNode.stream().allMatch(node -> node.getKind() == SqlKind.LITERAL && !SqlUtil.isNull(node))) {

Speed up SQL IN using SCALAR_IN_ARRAY. #16388

Speed up SQL IN using SCALAR_IN_ARRAY. #16388

Conversation

gianm commented May 3, 2024 • edited

asdf2014 left a comment

Choose a reason for hiding this comment

kgyrtkirk May 9, 2024

Choose a reason for hiding this comment

gianm May 10, 2024

Choose a reason for hiding this comment

kgyrtkirk May 13, 2024

Choose a reason for hiding this comment

kgyrtkirk May 9, 2024

Choose a reason for hiding this comment

gianm May 10, 2024

Choose a reason for hiding this comment

kgyrtkirk May 9, 2024

Choose a reason for hiding this comment

gianm May 10, 2024

Choose a reason for hiding this comment

kgyrtkirk left a comment

Choose a reason for hiding this comment

gianm commented May 13, 2024

gianm commented May 14, 2024

kgyrtkirk left a comment

Choose a reason for hiding this comment

gianm commented May 3, 2024 •

edited