Spark Action to Analyze table #10288

karuppayya · 2024-05-08T05:59:24Z

This change adds a Spark action to Analyze tables.
As part of analysis, the action generates Apache data - sketch for NDV stats and writes it as puffins.

karuppayya · 2024-05-08T05:59:53Z

cc: @RussellSpitzer @aokolnychyi @huaxingao @findepi

ajantha-bhat · 2024-05-08T06:12:17Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** Computes the statistic of the given columns and stores it as Puffin files. */


AnalyzeTableSparkAction is a generic name, I see that in future we want to compute the partition stats too. Which may not be written as puffin files.

Either we can change the change the naming to computeNDVSketches or make it generic such that any kind of stats can be computed from this.

Thinking more on this, I think we should just call it computeNDVSketches and not mix it with partition stats.

I tried to follow the model of RDMS and Engines like Trino using ANALYZE TABLE <tblName> to collect all table level stats.
With a procedure per stats model, the user have to invoke procedure/action for every stats and
also with any new stats addition, the user need to ensure to update his code to call the new procedure/action.

not mix it with partition stats.

I think we could have partition stats as a separate action since it per partition, whereas this procedure can collect top level table stats.

@karuppayya
I can see the tests in TestAnalyzeTableAction, it's working fine.
But have we tested in Spark, whether its working with a query like -
"Analyze table table1 compute statistics" ?

Because generally it gives the error
"[NOT_SUPPORTED_COMMAND_FOR_V2_TABLE] ANALYZE TABLE is not supported for v2 tables."

Spark doesnot have the grammar for Analyzing tables.
This PR introduces a Spark action. In subsequent PR, i plan to introduce a iceberg procedure to invoke the Spark action.

Actually I think Spark has a grammar, and be great to plug it into there. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala#L428

I see that in future we want to compute the partition stats too. Which may not be written as puffin files.

Hi, @ajantha-bhat I agree with you, otherwise, the queries would have a lot of limitations, such as being applicable only for calculating the NDV over the entire table.

For example, Trino might want to read the NDV values written by Spark to respond to queries. However, if the query has partition filter conditions, then Trino would not be able to use the pre-computed NDV information from Spark. So, what do you think ?

Hi @karuppayya , as the above discussions suggest, there can be multiple engines like spark, presto, trino etc, who might want to query of the same data right. So in such a scenario the sketches that are generated by Spark or suppose Presto, must be readable by the alternate engine right.

This question is coming because I ran one Analyze query on Presto and the puffin file it created looks like this ->

{"blobs":[{"type":"apache-datasketches-theta-v1","fields":[2],"snapshot-id":7724902347602477706,
"sequence-number":1,"offset":44,"length":40,"properties":{"ndv":"3"}}],"properties":{"created-by":"presto-testversion"}}

where as the one created by iceberg through the changes of this PR looks like this ->

{"blobs":[{"type":"apache-datasketches-theta-v1","fields":[3],
"snapshot-id":5334747061548805461,"sequence-number":1,"offset":4,"length":32}],"properties"

If seen properly the "{"ndv":"3"}" portion is missing in the iceberg change.

So can we make any modifications or any suggestions from your side may be?
Because as per my understanding the Sketch file should be universal to all engines.

@jeesou
Yes, agreed that the sketch needs to compatible across all engines.
This PR takes care of using the same library(Apache dataasketches) as Trino does. (This was the major concern here)
Do we need to add the property ndv , should nt engines be reading the value from the sketch?

Hm this discussion makes me wonder if we're under spec'd in this regard. According to the spec:

https://iceberg.apache.org/puffin-spec/#blob-types

The blob metadata for this blob may include following properties: ndv: estimate of number of distinct values, derived from the sketch.

It really seems like we should take a stance. Either it must be in the sketch or it must be in the properties. "may include" seems a little too loose.

ajantha-bhat · 2024-05-08T07:10:16Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+              spark(), table, columnsToBeAnalyzed.toArray(new String[0]));
+      table
+          .updateStatistics()
+          .setStatistics(table.currentSnapshot().snapshotId(), statisticsFile)


what if table's current snapshot has modified concurrently by another client between like 117 to line 120?

ajantha-bhat · 2024-05-08T07:14:35Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+
+  public static Iterator<Tuple2<String, ThetaSketchJavaSerializable>> computeNDVSketches(
+      SparkSession spark, String tableName, String... columns) {
+    String sql = String.format("select %s from %s", String.join(",", columns), tableName);


I think we should also think about incremental update and update sketches from previous checkpoint. Querying whole table maybe not efficient.

Yes, incremental need to be wired into the ends of write paths.
This procedure could exist in parallel, which could get stats of the whole table on demand.

ajantha-bhat · 2024-05-08T07:16:11Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestAnalyzeTableAction.java

+    assumeTrue(catalogName.equals("spark_catalog"));
+    sql(
+        "CREATE TABLE %s (id int, data string) USING iceberg TBLPROPERTIES"
+            + "('format-version'='2')",


default format version itself v2 now. So, specifying it again is redundant.

ajantha-bhat · 2024-05-08T07:17:19Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+    String path = operations.metadataFileLocation(String.format("%s.stats", UUID.randomUUID()));
+    OutputFile outputFile = fileIO.newOutputFile(path);
+    try (PuffinWriter writer =
+        Puffin.write(outputFile).createdBy("Spark DistinctCountProcedure").build()) {


I like this name instead of "analyze table procedure".

ajantha-bhat · 2024-05-15T10:41:15Z

there was an old PR on the same: #6582

huaxingao · 2024-05-15T15:02:00Z

there was an old PR on the same: #6582

I don't have time to work on this, so karuppayya will take over. Thanks a lot @karuppayya for continuing the work.

amogh-jahagirdar

Thanks @karuppayya @huaxingao @szehon-ho this is aewsome to see! I left a review of the API/implementation, still have yet to review the tests which look to be a WIP

amogh-jahagirdar · 2024-05-29T17:13:43Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @param statsToBeCollected set of statistics to be collected
+   * @return this for method chaining
+   */
+  AnalyzeTable stats(Set<String> statsToBeCollected);


Should these stats be a Set<StandardBlobType> instead of arbitrary Strings? I feel like the API becomes more well defined in this case.

Oh I see, StandardBlobType defines string constants not enums

amogh-jahagirdar · 2024-05-29T17:16:54Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+  private void validateColumns() {
+    validateEmptyColumns();
+    validateTypes();
+  }
+
+  private void validateEmptyColumns() {
+    if (columnsToBeAnalyzed == null || columnsToBeAnalyzed.isEmpty()) {
+      throw new ValidationException("No columns to analyze for the table", table.name());
+    }
+  }


Nit: I think this validation should just happen at the time of setting these on the action rather than at the execcution time.

amogh-jahagirdar · 2024-05-29T17:19:51Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @return this for method chaining
+   */
+  AnalyzeTable stats(Set<String> statsToBeCollected);
+


I also think this interface should have a snapshot API to allow users to pass in a snapshot to generate the statistics for. If it's not specified then we can generate the statistics for the latest snapshot.

amogh-jahagirdar · 2024-05-29T17:22:41Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+          if (field == null) {
+            throw new ValidationException("No column with %s name in the table", columnName);
+          }


Style nit: new line after the if

amogh-jahagirdar · 2024-05-29T17:30:04Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+      SparkSession spark, Table table, long snapshotId, String... columnsToBeAnalyzed)
+      throws IOException {
+    Iterator<Tuple2<String, ThetaSketchJavaSerializable>> tuple2Iterator =
+        NDVSketchGenerator.computeNDVSketches(spark, table.name(), snapshotId, columnsToBeAnalyzed);


Does computeDVSketches need to be public? Seems like it can just be package private. Also nit, either way don't think you need the full qualified method name

amogh-jahagirdar · 2024-05-29T17:34:48Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ThetaSketchJavaSerializable.java

+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+
+public class ThetaSketchJavaSerializable implements Serializable {


Does this need to be public?

amogh-jahagirdar · 2024-05-29T17:35:06Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ThetaSketchJavaSerializable.java

+    if (sketch == null) {
+      return null;
+    }
+    if (sketch instanceof UpdateSketch) {
+      return sketch.compact();
+    }


Style nit: new line after if

amogh-jahagirdar · 2024-05-29T17:45:44Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                null,
+                ImmutableMap.of()));
+      }
+      writer.finish();


Nit: Don't think you need the writer.finish() because the try with resources will close, and close will already finish

amogh-jahagirdar · 2024-05-29T17:51:57Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                table.currentSnapshot().snapshotId(),
+                table.currentSnapshot().sequenceNumber(),
+                ByteBuffer.wrap(sketchMap.get(columns.get(i)).getSketch().toByteArray()),
+                null,


null means that the file will be uncompressed. I think it makes sense not to compress these files by default since the sketch will be a single long per column, so it'll be quite small already and not worth paying the price of compression/decompression.

amogh-jahagirdar · 2024-05-29T17:52:12Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+      if (sketch1.getSketch() == null && sketch2.getSketch() == null) {
+        return emptySketchWrapped;
+      }
+      if (sketch1.getSketch() == null) {
+        return sketch2;
+      }
+      if (sketch2.getSketch() == null) {
+        return sketch1;
+      }


Style nit: new line after if

krajendran4 added 2 commits May 7, 2024 17:18

core +api changes

8d346d8

Analyze table Spark action

7774ca6

github-actions bot added API spark core build labels May 8, 2024

ajantha-bhat reviewed May 8, 2024

View reviewed changes

ajantha-bhat mentioned this pull request May 15, 2024

Add a Spark procedure to collect NDV #6582

Open

Address review comments

1ba230a

amogh-jahagirdar reviewed May 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Action to Analyze table #10288

Spark Action to Analyze table #10288

karuppayya commented May 8, 2024

karuppayya commented May 8, 2024

ajantha-bhat May 8, 2024

ajantha-bhat May 8, 2024

karuppayya May 20, 2024

jeesou May 27, 2024

karuppayya May 27, 2024

szehon-ho May 28, 2024

rice668 May 30, 2024 •

edited

jeesou May 31, 2024 •

edited

karuppayya May 31, 2024

amogh-jahagirdar May 31, 2024 •

edited

ajantha-bhat May 8, 2024 •

edited

ajantha-bhat May 8, 2024

karuppayya May 20, 2024

ajantha-bhat May 8, 2024

ajantha-bhat May 8, 2024

ajantha-bhat commented May 15, 2024

huaxingao commented May 15, 2024

amogh-jahagirdar left a comment

amogh-jahagirdar May 29, 2024

amogh-jahagirdar May 29, 2024

amogh-jahagirdar May 29, 2024

amogh-jahagirdar May 29, 2024

amogh-jahagirdar May 29, 2024

amogh-jahagirdar May 29, 2024

amogh-jahagirdar May 29, 2024

amogh-jahagirdar May 29, 2024

amogh-jahagirdar May 29, 2024

amogh-jahagirdar May 29, 2024

amogh-jahagirdar May 29, 2024

Spark Action to Analyze table #10288

Are you sure you want to change the base?

Spark Action to Analyze table #10288

Conversation

karuppayya commented May 8, 2024

karuppayya commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rice668 May 30, 2024 • edited

Choose a reason for hiding this comment

jeesou May 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogh-jahagirdar May 31, 2024 • edited

Choose a reason for hiding this comment

ajantha-bhat May 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajantha-bhat commented May 15, 2024

huaxingao commented May 15, 2024

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rice668 May 30, 2024 •

edited

jeesou May 31, 2024 •

edited

amogh-jahagirdar May 31, 2024 •

edited

ajantha-bhat May 8, 2024 •

edited