Join reordering stats11 speedup #605

sopel39 · 2017-07-05T12:59:39Z

Using dedicated cache for join stats

After:

Benchmark                                                  (joinReorderingStrategy)  (numberOfTables)  Mode  Cnt     Score     Error  Units
BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins                COST_BASED                10  avgt   30  5814,628 ± 188,457  ms/op

Before:

Benchmark                                                  (joinReorderingStrategy)  (numberOfTables)  Mode  Cnt      Score      Error  Units
BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins                COST_BASED                10  avgt   10  25232,545 ± 1046,160  ms/op

The join_distribution_type property has three option: "repartitioned", "replicated", and "automatic". This property replaces the "distributed_joins" property. "Repartitioned" has the same behavior as distribtued_joins=true. "Replicated" is equivalent to distributed_joins=false. "Automatic" will use stats to evaluate whether partitioned or replicated is better, or if no information is available, it will choose partitioned. The default value for join_distribution_type is "repartitioned" .

Create a new property called join_reordering_strategy with the options ELIMINATE_CROSS_JOINS, COST_BASED and NONE. ELIMINATE_CROSS_JOINS is equivalent to the previous reorder_joins=true. COST_BASED will used join enumeration to make a cost-based decision of join order. NONE will maintain the syntactic join order, and is equivalent to the previous reorder_joins=false.

Allowing the LocalQueryRunner to estimate costs using a fake node count allows unit tests to consider network costs and different cluster configurations.

Change ExpressionUtils.binaryExpression to return TRUE on an empty list

Add a rule to enumerate join order possibilities for a join graph and choose the least cost option. This does a minimal form of cross join elimination, by only partitioning nodes into groups that have at least one edge between them, which eliminates some unnecessary cross joins from consideration. It also means that necessary cross joins will always be executed as late as possible in the plan (which may be worse).

Results run on my development vm BenchmarkReorderJoinsConnectedGraph: BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins ELIMINATE_CROSS_JOINS 2 avgt 30 54.610 ± 4.236 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins ELIMINATE_CROSS_JOINS 4 avgt 30 153.794 ± 9.075 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins ELIMINATE_CROSS_JOINS 6 avgt 30 326.410 ± 19.912 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins ELIMINATE_CROSS_JOINS 8 avgt 30 578.028 ± 33.308 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins ELIMINATE_CROSS_JOINS 10 avgt 30 955.494 ± 44.523 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins COST_BASED 2 avgt 30 54.844 ± 4.256 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins COST_BASED 4 avgt 30 161.164 ± 11.008 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins COST_BASED 6 avgt 30 440.007 ± 28.903 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins COST_BASED 8 avgt 30 2491.240 ± 72.341 ms/op BenchmarkReorderJoinsConnectedGraph.benchmarkReorderJoins COST_BASED 10 avgt 30 24026.603 ± 886.696 ms/opa BencharkReorderJoinsLinearGraph: BenchmarkReorderJoinsLinearQuery.benchmarkReorderJoins ELIMINATE_CROSS_JOINS avgt 30 944.179 ± 42.406 ms/op BenchmarkReorderJoinsLinearQuery.benchmarkReorderJoins COST_BASED avgt 30 1329.194 ± 71.704 ms/op

Previously there was a lot of map copying (through ImmutableMap.copyOf(...) and new HashMap(...)) which was significantly impacting stats code performance. HashTreePMap is much better for cases where individual entries of base map are modified which is common case in stats code.

Not all CostCalculators are thread safe.

Previously JoinEnumerator#setJoinNodeProperties was recomputing join stats for each alternative of join even though actual stats didn't change. Now join stats are memoized by node id. This reduced enumeration time by a factor of 2.

rschlussel-zz · 2017-07-05T14:19:38Z

...o-main/src/test/java/com/facebook/presto/sql/planner/iterative/rule/TestMemoBasedLookup.java

@@ -60,7 +59,7 @@ public void testResolvesGroupReferenceNode()
        PlanNode plan = node(source);
        Memo memo = new Memo(idAllocator, plan);

-        MemoBasedLookup lookup = new MemoBasedLookup(memo, new NodeCountingStatsCalculator(), new CostCalculatorUsingExchanges(1));
+        Lookup lookup = Lookup.from(memo::resolve, new NodeCountingStatsCalculator(), new CostCalculatorUsingExchanges(1));


Are these tests still useful since there's no more MemoBasedLookup

Probably caching calculator tests would be more useful. I will remove this tests for now.

rschlussel-zz · 2017-07-05T14:58:29Z

presto-main/src/main/java/com/facebook/presto/cost/JoinNodeCachingStatsCalculator.java

+        implements StatsCalculator
+{
+    private final StatsCalculator statsCalculator;
+    private final Map<PlanNodeId, PlanNodeStatsEstimate> stats = new HashMap<>();


Add a comment here about why it caches by node id.

kokosing

some minor comments, more offline

kokosing · 2017-07-06T08:23:53Z

presto-main/pom.xml

+        <dependency>
+            <groupId>org.pcollections</groupId>
+            <artifactId>pcollections</artifactId>
+        </dependency>


you cannot add something to pom.xml which is not used in the code. mvn validate will fail in such case.

kokosing · 2017-07-06T08:26:14Z

presto-main/src/main/java/com/facebook/presto/cost/PlanNodeStatsEstimate.java

+            return this;
+        }
+
+        public Builder removeSymbolStatistics(Symbol symbol)


having this method make me think that this whole Builder is not needed any more as you can just simply operate on SymbolStatsEstimate

Maybe, but you need to construct initial PlanNodeStatsEstimate for which the builder seems to be useful

kokosing · 2017-07-06T08:29:24Z

presto-main/src/main/java/com/facebook/presto/sql/planner/iterative/Lookup.java

@@ -102,7 +102,7 @@ public PlanNodeStatsEstimate getStats(PlanNode node, Session session, Map<Symbol
            @Override
            public PlanNodeCostEstimate getCumulativeCost(PlanNode node, Session session, Map<Symbol, Type> types)
            {
-                return costCalculator.calculateCumulativeCost(node, this, session, types);
+                return costCalculator.calculateCumulativeCost(resolve(node), this, session, types);


this does not look related

sopel39 · 2017-07-10T13:56:10Z

Waiting for reorder joins to pass tests then I will merge into reorder joins branch

sopel39 · 2017-07-10T15:24:35Z

presto-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/ReorderJoins.java

+            // Cross joins can't filter symbols as part of the join
+            // If we're doing a cross join, use all output symbols from the inputs and add a project node
+            // on top
+            List<Symbol> joinOutputSymbols = sortedOutputSymbols;


this is no longer needed since we don't allow cross joins

sopel39 · 2017-07-10T15:26:00Z

presto-main/src/main/java/com/facebook/presto/sql/planner/iterative/rule/ReorderJoins.java

+                    Optional.empty(),
+                    Optional.empty()));
+
+            if (!joinOutputSymbols.equals(sortedOutputSymbols)) {


I don't think it's needed either

sopel39 · 2017-07-10T15:32:28Z

...c/main/java/com/facebook/presto/sql/planner/optimizations/DetermineJoinDistributionType.java

@@ -111,6 +111,9 @@ public PlanNode visitDelete(DeleteNode node, RewriteContext<Void> context)

        private JoinNode.DistributionType getTargetJoinDistributionType(JoinNode node)
        {
+            if (node.getDistributionType().isPresent()) {


this is unrelated change to this commit. Make a (small) commit out of it.

sopel39 · 2017-07-13T15:46:29Z

presto-main/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java

@@ -347,7 +347,7 @@ public PlanOptimizers(
                        stats,
                        statsCalculator,
                        estimatedExchangesCostCalculator,
-                        ImmutableSet.of(new ReorderJoins(costComparator))
+                        ImmutableSet.of(new ReorderJoins(costComparator, statsCalculator, costCalculator))


estimatedExchangesCostCalculator should be used here

rschlussel-zz and others added 15 commits June 30, 2017 11:09

Support matching join distribution type in tests

74c80b2

Support inserting stats for plan unit tests

5454fff

Support passing statsCalculator to RuleAssert

0e609c8

Support using a fake node count for unit tests

6f49608

Allowing the LocalQueryRunner to estimate costs using a fake node count allows unit tests to consider network costs and different cluster configurations.

Make binaryExpression() handle empty list

8d126e1

Change ExpressionUtils.binaryExpression to return TRUE on an empty list

Add methods to flip join and set distribution type

3ba7f24

Add pcollections library dependency

79d2efa

Remove @threadsafe annotation from CostCalculator interface

09e0924

Not all CostCalculators are thread safe.

Introduce caching cost and stats calculator

bba57ca

sopel39 assigned kokosing and rschlussel-zz Jul 5, 2017

rschlussel-zz approved these changes Jul 5, 2017

View reviewed changes

rschlussel-zz force-pushed the join-reordering-stats11 branch from 844fe2d to c3d1a7a Compare July 5, 2017 15:12

kokosing reviewed Jul 10, 2017

View reviewed changes

sopel39 commented Jul 17, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Join reordering stats11 speedup #605

Join reordering stats11 speedup #605

sopel39 commented Jul 5, 2017

rschlussel-zz Jul 5, 2017

sopel39 Jul 10, 2017

rschlussel-zz Jul 5, 2017

kokosing left a comment

kokosing Jul 6, 2017

kokosing Jul 6, 2017

sopel39 Jul 10, 2017

kokosing Jul 6, 2017

sopel39 commented Jul 10, 2017

sopel39 Jul 10, 2017

sopel39 Jul 10, 2017

sopel39 Jul 10, 2017

sopel39 Jul 13, 2017

Join reordering stats11 speedup #605

Are you sure you want to change the base?

Join reordering stats11 speedup #605

Conversation

sopel39 commented Jul 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kokosing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 commented Jul 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment