New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up SQL IN using SCALAR_IN_ARRAY. #16388
Conversation
Main changes: 1) DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of the IN is above inFunctionThreshold. The default value of inFunctionThreshold is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE. 2) SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular expression, when the size of the SEARCH is above inFunctionExprThreshold. The default value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting it to Integer.MAX_VALUE. 3) ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is greater than inFunctionThreshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, also need to replace ||
with <code>||</code>
in Markdown table to display it correctly
Co-authored-by: Benedict Jin <asdf2014@apache.org>
if (valuesNode.size() > plannerContext.queryContext().getInFunctionThreshold() | ||
&& valuesNode.stream().allMatch(node -> node.getKind() == SqlKind.LITERAL && !SqlUtil.isNull(node))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not handle mixed versions as well? literals could be handled with this - but leave the other problematic stuff outside in an OR
the NULL
case would be also less problematic - as those will be left outside as well...
or there is something wrong with:
x IN (1,2,3,y,null) => DRUID_IN(x,[1,2,3]) OR x = y OR x = null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extending this to split the call up into multiple calls would add complexity, and I was trying to keep the logic simple. I figured it would not be common to include NULL
or nonliterals in the IN
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure - it can be added later if needed!
reverseLookupKey.negate, | ||
|
||
// Use regular equals, or SCALAR_IN_ARRAY, depending on inFunctionThreshold. | ||
reversedMatchValues.size() >= plannerContext.queryContext().getInFunctionThreshold(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would look simpler to pass plannerContext
instead or inFunctionThreshold
- and let this logic live inside makeIn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it's like this because different usages of this method use different thresholds. Sometimes it's the inFunctionThreshold
, sometimes it's the inFunctionExprThreshold
.
cannotVectorize(); | ||
|
||
testQuery( | ||
"SELECT dim1 NOT IN ('abc', 'def', 'ghi') AND dim1 < 'zzz', COUNT(*)\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more interesting have these tests apply inequality which could have filtered out some IN
literal(s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool idea. I added a test for this as well: testNotInOrEqualToOneOfThemExpression
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like something odd have happened to you branch; there are some pom.xml
changes
6674c35
to
e44a389
Compare
I think something got messed up when I pulled the commit from github itself: 492c80c. I just fixed it up and force pushed. The only change since the original patch is the new test |
@asdf2014 thanks for reviewing! @kgyrtkirk thanks as well! please let me know if you have any additional comments; if not I'll merge this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no more comments/questions etc :)
Main changes:
DruidSqlValidator now includes a rewrite of IN to SCALAR_IN_ARRAY, when the size of
the IN is above inFunctionThreshold. The default value of inFunctionThreshold
is 100. Users can restore the prior behavior by setting it to Integer.MAX_VALUE.
SearchOperatorConversion now generates SCALAR_IN_ARRAY when converting to a regular
expression, when the size of the SEARCH is above inFunctionExprThreshold. The default
value of inFunctionExprThreshold is 2. Users can restore the prior behavior by setting
it to Integer.MAX_VALUE.
ReverseLookupRule generates SCALAR_IN_ARRAY if the set of reverse-looked-up values is
greater than inFunctionThreshold.
Benchmarks follow. Overall planning for large
IN
is much faster. Two new ones are marked withDNF
onmaster
, where I gave up and canceled them after they ran for a few minutes. Those same test cases completed in 100ms each with the patch.