New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defer more expressions in vectorized groupBy. #16338
base: master
Are you sure you want to change the base?
Conversation
This patch adds a way for columns to provide GroupByVectorColumnSelectors, which controls how the groupBy engine operates on them. This mechanism is used by ExpressionVirtualColumn to provide an ExpressionDeferredGroupByVectorColumnSelector that uses the inputs of an expression as the grouping key. The actual expression evaluation is deferred until the grouped ResultRow is created. A new context parameter "deferExpressionDimensions" allows users to control when this deferred selector is used. The default is "fixedWidthNonNumeric", which is a behavioral change from the prior behavior. Users can get the prior behavior by setting this to "singleString".
processing/src/main/java/org/apache/druid/segment/VirtualColumn.java
Dismissed
Show dismissed
Hide dismissed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also SqlGroupByBenchmark
that benchmarks the code with various distributions and cardinalities. Maybe we should benchmark the code with the string columns and different parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting adding a new @Benchmark
method to SqlGroupByBenchmark
that uses a SQL query with expressions?
This patch adds a way for columns to provide GroupByVectorColumnSelectors, which controls how the groupBy engine operates on them. This mechanism is used by ExpressionVirtualColumn to provide an ExpressionDeferredGroupByVectorColumnSelector that uses the inputs of an expression as the grouping key. The actual expression evaluation is deferred until the grouped ResultRow is created.
A new context parameter
deferExpressionDimensions
allows users to control when this deferred selector is used. The default isfixedWidthNonNumeric
, which is a behavioral change from the prior behavior. Users can get the prior behavior by setting this tosingleString
.Benchmarks of a few selected queries from
SqlExpressionBenchmark
follow. Findings:GROUP BY CONCAT(string2, '-', long2)
, speeds up when the expression is deferred.GROUP BY TIME_FLOOR(TIMESTAMPADD(DAY, -1, __time)
,GROUP BY long1 * long2
,GROUP BY CAST(long1 as BOOLEAN) AND CAST (long2 as BOOLEAN)
, andGROUP BY long5 IS NULL, long3 IS NOT NULL
. All are simple expressions with numeric inputs and outputs.For these reasons, I think
fixedWidthNonNumeric
is a good default.