Make sources report their partitioning to Spark #176

EnricoMi · 2022-03-23T11:21:20Z

Having the sources report their partitioning to Spark allows Spark to exploit the existing partitioning and avoid shuffling all data for operations that require the existing partitioning.

For instance, reading triples with predicate partitioning produces a Dataset that is already partitioned by column "predicate", so a groupBy("predicate") would not need to shuffle the data at all:

import uk.co.gresearch.spark.dgraph.connector._

val target = "localhost:9080"

reader
  .option(PartitionerOption, PredicatePartitionerOption)
  .dgraph.triples(target)
  .groupBy("predicate")
  .count()

The groupBy("predicate") will not shuffle the graph data after reading from Dgraph.

The Spark plan for this Dataset is:

*(1) HashAggregate(keys=[predicate#3377], functions=[count(1)], output=[predicate#3377, count#3425L])
+- *(1) HashAggregate(keys=[predicate#3377], functions=[partial_count(1)], output=[predicate#3377, count#3440L])
   +- *(1) Project [predicate#3377]
      +- BatchScan[subject#3376L, predicate#3377, objectUid#3378L, objectString#3379, objectLong#3380L, objectDouble#3381, objectTimestamp#3382, objectBoolean#3383, objectGeo#3384, objectPassword#3385, objectType#3386] class uk.co.gresearch.spark.dgraph.connector.TripleScan

Without reporting the existing partitioning, the plan would look like:

*(2) HashAggregate(keys=[predicate#3100], functions=[count(1)], output=[predicate#3100, count#3148L])
+- Exchange hashpartitioning(predicate#3100, 2), true, [id=#1300]
   +- *(1) HashAggregate(keys=[predicate#3100], functions=[partial_count(1)], output=[predicate#3100, count#3163L])
      +- *(1) Project [predicate#3100]
         +- BatchScan[subject#3099L, predicate#3100, objectUid#3101L, objectString#3102, objectLong#3103L, objectDouble#3104, objectTimestamp#3105, objectBoolean#3106, objectGeo#3107, objectPassword#3108, objectType#3109] class uk.co.gresearch.spark.dgraph.connector.TripleScan

By reporting the partitioning, Spark remove the Exchange hashpartitioning(predicate#3100, 2), true, [id=#1300] step, as it becomes be redundant.

This refactors SingletonPartitioner to extend PredicatePartitioner but with a single partition (all predicates per partition). This allows 'NodeSourcein wide node mode to reject any predicate-partitioned partitioner while relying onSingletonPartitioner` to provide the same behaviour as PredicatePartitioner with one partition did so far.

github-actions · 2022-03-23T11:28:16Z

Unit Test Results

    832 files +    26     832 suites +26 34m 14s ⏱️ + 6m 8s
    513 tests +    93     513 ✔️ +    93 0 💤 ±0 0 ❌ ±0
13 338 runs +2 418 13 338 ✔️ +2 418 0 💤 ±0 0 ❌ ±0

Results for commit 40599a5. ± Comparison against base commit 6c4b463.

This pull request removes 88 and adds 181 tests. Note that renamed tests count towards both.

uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should encode Edge
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should fail without target
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load as a predicate partitions
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load as a single partition
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load edges in chunks
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load edges via implicit dgraph target
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load edges via implicit dgraph targets
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load edges via path
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load edges via paths
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load edges via target option
…

uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2,col3] should not satisfy clustered distribution with unknown column: [col0]
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2,col3] should not satisfy partially overlapping clustered distribution: [col1,col2,col3,col0]
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2,col3] should not satisfy unknown distribution
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2,col3] should satisfy clustered distribution with more columns: [col1,col2,col3,col0]
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2,col3] should satisfy identical clustered distribution: col1,col2,col3
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2] should not satisfy clustered distribution with unknown column: [col0]
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2] should not satisfy partially overlapping clustered distribution: [col1,col2,col0]
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2] should not satisfy unknown distribution
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2] should satisfy clustered distribution with more columns: [col1,col2,col0]
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2] should satisfy identical clustered distribution: col1,col2
…

♻️ This comment has been updated with latest results.

EnricoMi · 2022-04-12T14:37:54Z

With Spark 3.3, using partitioning information reported by SupportsReportPartitioning to be consider during planning requires spark.sql.sources.v2.bucketing.enabled to be true, which defaults to false:

apache/spark@20ffbf7#diff-13c5b65678b327277c68d17910ae93629801af00117a0e3da007afd95b6c6764R1337-R1341

…es in one partition

…ning

github-actions · 2023-03-01T21:53:13Z

Test Results

    992 files +    31     992 suites +31 43m 30s ⏱️ + 12m 44s
    513 tests +    93     513 ✔️ +    93 0 💤 ±0 0 ❌ ±0
15 903 runs +2 883 15 903 ✔️ +2 883 0 💤 ±0 0 ❌ ±0

Results for commit 9771957. ± Comparison against base commit 1429b4f.

This pull request removes 88 and adds 181 tests. Note that renamed tests count towards both.

uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should encode Edge
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should fail without target
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load as a predicate partitions
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load as a single partition
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load edges in chunks
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load edges via implicit dgraph target
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load edges via implicit dgraph targets
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load edges via path
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load edges via paths
uk.co.gresearch.spark.dgraph.connector.sources.TestEdgeSource ‑ EdgeDataSource should load edges via target option
…

uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2,col3] should not satisfy clustered distribution with unknown column: [col0]
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2,col3] should not satisfy partially overlapping clustered distribution: [col1,col2,col3,col0]
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2,col3] should not satisfy unknown distribution
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2,col3] should satisfy clustered distribution with more columns: [col1,col2,col3,col0]
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2,col3] should satisfy identical clustered distribution: col1,col2,col3
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2] should not satisfy clustered distribution with unknown column: [col0]
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2] should not satisfy partially overlapping clustered distribution: [col1,col2,col0]
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2] should not satisfy unknown distribution
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2] should satisfy clustered distribution with more columns: [col1,col2,col0]
uk.co.gresearch.spark.dgraph.connector.TestTripleScan ‑ TripleScan with [col1,col2] should satisfy identical clustered distribution: col1,col2
…

EnricoMi force-pushed the branch-report-partitioning branch from 5eaba8c to 5f2b057 Compare March 23, 2022 11:22

EnricoMi force-pushed the branch-report-partitioning branch 2 times, most recently from 8a53242 to 7c3cb43 Compare March 27, 2022 14:43

EnricoMi added 20 commits March 1, 2023 22:15

Report partitioning to Spark to avoid shuffles

f57b868

Completed TripleSource tests for reporting partitioning

85f90e0

Do not call into planInputPartitions(), use cached / lazy result

76efb8e

Account for adaptive Spark plans in tests

3a1012c

Rework typing in TestEdgeSource

1a408b8

Move ShuffleExchange tests into trait

d7ac635

Test EdgeSource is reporting partitioning

2703e51

Test NodeSource is reporting partitioning for typed nodes

789466c

Test NodeSource is reporting partitioning for wide nodes

8de60e1

Use s string interpolation rather than f

005c239

Disallow any predicate partitioner for wide node source

29a71d0

Make SingletonPartitioner be a PredicatePartitioner with all predicat…

13e4fad

…es in one partition

Add missing scala file, add reporting to CHANGELOG.md

39252ca

Fix compile error

d20752d

Move AdaptiveSparkPlanExec handling into containsShuffleExchangeExec

d973a0a

Rethinking satisfying ClusterDistribution

7a6ad39

Finish TestTripleScan

c364eb4

The relaxed satisfy logic makes it impossible to test no S+P partitio…

71fefc3

…ning

Add partitioning reporting to features in README.md

107a757

Fix tests after rebasing with master

9771957

EnricoMi force-pushed the branch-report-partitioning branch from 40599a5 to 9771957 Compare March 1, 2023 21:40

EnricoMi marked this pull request as draft May 3, 2023 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sources report their partitioning to Spark #176

Make sources report their partitioning to Spark #176

EnricoMi commented Mar 23, 2022 •

edited

github-actions bot commented Mar 23, 2022 •

edited

EnricoMi commented Apr 12, 2022

github-actions bot commented Mar 1, 2023

Make sources report their partitioning to Spark #176

Are you sure you want to change the base?

Make sources report their partitioning to Spark #176

Conversation

EnricoMi commented Mar 23, 2022 • edited

github-actions bot commented Mar 23, 2022 • edited

Unit Test Results

EnricoMi commented Apr 12, 2022

github-actions bot commented Mar 1, 2023

Test Results

EnricoMi commented Mar 23, 2022 •

edited

github-actions bot commented Mar 23, 2022 •

edited