HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false #5215

zhangbutao · 2024-04-25T14:47:30Z

What changes were proposed in this pull request?

At present, in case of iceberg.hive.keep.stats=true & hive.compute.query.using.stats=true, HS2 will do a fetch task to get iceberg table's numRows property from HMS to optimize count query.
If iceberg.hive.keep.stats=false, HS2 will always launch tez task to compute table's row count when filing a count query.

However, as we know, iceberg table's metadata has some stats information, we can also just start a fetch task to retrieve the row count from iceberg's snapshot summary when iceberg.hive.keep.stats=false or no stats stored in hms. This can avoid launching tez task to compute the table's row count.

BTW, timetravel or branch/tag has different stats from current snapshot, so we need to get the specified snapshotid based on the different iceberg version. Otherwise, we will get the wrong stats when querying the time travel/branch/tag.

Why are the changes needed?

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Qtest

…ceberg.hive.keep.stats=false

zhangbutao · 2024-04-28T09:27:59Z

iceberg/iceberg-handler/src/test/results/positive/write_iceberg_branch.q.out

@@ -237,19 +237,19 @@ STAGE PLANS:
                  alias: ice01
                  filterExpr: (a = 22) (type: boolean)
                  Snapshot ref: branch_test1
-                  Statistics: Num rows: 3 Data size: 291 Basic stats: COMPLETE Column stats: COMPLETE
+                  Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: COMPLETE


Before this PR, we always get row count of branch/tag/timetravel by the current snapshot summary, which is not right.

sonarcloud · 2024-04-30T10:44:30Z

Quality Gate passed

Issues
5 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

deniskuzZ · 2024-05-27T10:17:05Z

Hi @zhangbutao, Hive has an optimization for Iceberg's count(*) - HIVE-27347, where stats supposed to be taken from snapshot

deniskuzZ · 2024-05-27T10:20:48Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

@@ -442,10 +446,11 @@ public Map<String, String> getBasicStatistics(Partish partish) {
    org.apache.hadoop.hive.ql.metadata.Table hmsTable = partish.getTable();
    // For write queries where rows got modified, don't fetch from cache as values could have changed.
    Table table = getTable(hmsTable);
+    Snapshot snapshot = getSpecificSnapshot(partish.getTable(), table);


should we move snapshot fetch under the if

zhangbutao · 2024-05-27T10:25:20Z

Hi @zhangbutao, Hive has an optimization for Iceberg's count(*) - HIVE-27347, where stats supposed to be taken from snapshot

Acutually, HIVE-27347 get row count stats from table's parameterStatsSetupConst.ROW_COUNT (numRows) stored in HMS. not from snapshot.

Check the code:

hive/ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java

Lines 940 to 942 in a57e580

    
           } 
        
           long partRowCnt = Long.parseLong(part.getParameters().get(StatsSetupConst.ROW_COUNT)); 
        
           rowCnt += partRowCnt;

deniskuzZ · 2024-05-27T10:28:26Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

@@ -2173,4 +2179,32 @@ public List<FileStatus> getMergeTaskInputFiles(Properties properties) throws IOE
  public MergeTaskProperties getMergeTaskProperties(Properties properties) {
    return new IcebergMergeTaskProperties(properties);
  }
+
+  private Snapshot getSpecificSnapshot(org.apache.hadoop.hive.ql.metadata.Table hmsTable, Table table) {


could you please move this function to IcebergTableUtil

deniskuzZ · 2024-05-27T10:32:48Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

+    Snapshot snapshot;
+    if (refName != null) {
+      snapshot = table.snapshot(refName);
+    } else if (hmsTable.getAsOfTimestamp() != null) {


how about as of tag?

deniskuzZ · 2024-05-27T10:35:05Z

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

@@ -2173,4 +2179,32 @@ public List<FileStatus> getMergeTaskInputFiles(Properties properties) throws IOE
  public MergeTaskProperties getMergeTaskProperties(Properties properties) {
    return new IcebergMergeTaskProperties(properties);
  }
+
+  private Snapshot getSpecificSnapshot(org.apache.hadoop.hive.ql.metadata.Table hmsTable, Table table) {


getTableSnapshot()

deniskuzZ · 2024-05-27T10:39:06Z

ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveUtils.java

-    Matcher ref = SNAPSHOT_REF.matcher(refName);
-    if (ref.matches()) {
-      return ref.group(1);
+    if (refName != null && !refName.isEmpty()) {


if (StringUtils.isEmpty(refName)) { return null; } Matcher ref = SNAPSHOT_REF.matcher(refName); return ref.matches()? ref.group(1) : null;

deniskuzZ · 2024-05-27T10:48:49Z

ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java

@@ -943,6 +944,11 @@ private Long getRowCnt(
        }
      } else { // unpartitioned table
        if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(tbl, tbl.getParameters())) {
+          if (MetaStoreUtils.isNonNativeTable(tbl.getTTable())
+                  && tbl.getStorageHandler().canComputeQueryUsingStats(tbl)) {
+            return Long.valueOf(tbl.getStorageHandler().getBasicStatistics(Partish.buildFor(tbl))


can get NullPointer when statsSource != ICEBERG

deniskuzZ · 2024-05-27T10:57:12Z

@zhangbutao, do you know if iceberg provides partition row_count stats?
https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk
if not, maybe we can get it from meta table:

SELECT record_count FROM prod.db.table.partitions where spec_id in (....)

zhangbutao · 2024-05-29T08:56:54Z

@zhangbutao, do you know if iceberg provides partition row_count stats? https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk if not, maybe we can get it from meta table:
SELECT record_count FROM prod.db.table.partitions where spec_id in (....)

IMO, iceberg's partition stats feature is in development. Such as https://github.com/apache/iceberg/pull/9170/files.
In addition, partition stats feature started from Iceberg1.5.0, so we need to upgrade icenerg dependency.

I will try to play meta table to get partition stats.

Iceberg: Retrieve row count from iceberg SnapshotSummary in case of i…

d31edef

…ceberg.hive.keep.stats=false

zhangbutao marked this pull request as draft April 25, 2024 14:47

asf-ci-hive added tests pending tests unstable and removed tests pending labels Apr 25, 2024

code minor optimization

96266a9

asf-ci-hive added tests pending tests failed and removed tests unstable tests pending labels Apr 26, 2024

zhangbutao force-pushed the iceberg_count_optimize branch from 0c14207 to 1a953e8 Compare April 26, 2024 04:45

asf-ci-hive added tests pending tests unstable and removed tests failed tests pending labels Apr 26, 2024

zhangbutao force-pushed the iceberg_count_optimize branch from 1a953e8 to 77d9a7e Compare April 28, 2024 09:22

asf-ci-hive added tests pending and removed tests unstable labels Apr 28, 2024

zhangbutao commented Apr 28, 2024

View reviewed changes

zhangbutao force-pushed the iceberg_count_optimize branch from 77d9a7e to 441db00 Compare April 28, 2024 09:44

asf-ci-hive added tests failed tests pending and removed tests pending tests failed labels Apr 28, 2024

zhangbutao force-pushed the iceberg_count_optimize branch from 441db00 to 9971db5 Compare April 29, 2024 05:15

asf-ci-hive added tests pending and removed tests failed tests pending labels Apr 29, 2024

asf-ci-hive added the tests unstable label Apr 29, 2024

Get stats based on specific snapshot

0ffc9df

zhangbutao force-pushed the iceberg_count_optimize branch from 9971db5 to 0ffc9df Compare April 30, 2024 02:34

asf-ci-hive added tests pending tests failed and removed tests unstable tests pending tests failed labels Apr 30, 2024

asf-ci-hive added tests passed and removed tests pending labels Apr 30, 2024

zhangbutao mentioned this pull request May 20, 2024

HIVE-28266: Iceberg: select count(*) from data_files metadata tables … #5253

Merged

zhangbutao marked this pull request as ready for review May 21, 2024 10:20

zhangbutao changed the title ~~Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false~~ HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false May 21, 2024

zhangbutao requested review from SourabhBadhya and deniskuzZ May 21, 2024 10:22

deniskuzZ reviewed May 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false #5215

HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false #5215

zhangbutao commented Apr 25, 2024 •

edited

zhangbutao Apr 28, 2024

sonarcloud bot commented Apr 30, 2024

deniskuzZ commented May 27, 2024

deniskuzZ May 27, 2024

zhangbutao commented May 27, 2024

deniskuzZ May 27, 2024

deniskuzZ May 27, 2024

deniskuzZ May 27, 2024

deniskuzZ May 27, 2024

deniskuzZ May 27, 2024

deniskuzZ commented May 27, 2024 •

edited

zhangbutao commented May 29, 2024

HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false #5215

Are you sure you want to change the base?

HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false #5215

Conversation

zhangbutao commented Apr 25, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

zhangbutao Apr 28, 2024

Choose a reason for hiding this comment

sonarcloud bot commented Apr 30, 2024

Quality Gate passed

deniskuzZ commented May 27, 2024

deniskuzZ May 27, 2024

Choose a reason for hiding this comment

zhangbutao commented May 27, 2024

deniskuzZ May 27, 2024

Choose a reason for hiding this comment

deniskuzZ May 27, 2024

Choose a reason for hiding this comment

deniskuzZ May 27, 2024

Choose a reason for hiding this comment

deniskuzZ May 27, 2024

Choose a reason for hiding this comment

deniskuzZ May 27, 2024

Choose a reason for hiding this comment

deniskuzZ commented May 27, 2024 • edited

zhangbutao commented May 29, 2024

zhangbutao commented Apr 25, 2024 •

edited

deniskuzZ commented May 27, 2024 •

edited