Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false #5215

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

zhangbutao
Copy link
Contributor

@zhangbutao zhangbutao commented Apr 25, 2024

What changes were proposed in this pull request?

At present, in case of iceberg.hive.keep.stats=true & hive.compute.query.using.stats=true, HS2 will do a fetch task to get iceberg table's numRows property from HMS to optimize count query.
If iceberg.hive.keep.stats=false, HS2 will always launch tez task to compute table's row count when filing a count query.

However, as we know, iceberg table's metadata has some stats information, we can also just start a fetch task to retrieve the row count from iceberg's snapshot summary when iceberg.hive.keep.stats=false or no stats stored in hms. This can avoid launching tez task to compute the table's row count.

BTW, timetravel or branch/tag has different stats from current snapshot, so we need to get the specified snapshotid based on the different iceberg version. Otherwise, we will get the wrong stats when querying the time travel/branch/tag.

Why are the changes needed?

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Qtest

@@ -237,19 +237,19 @@ STAGE PLANS:
alias: ice01
filterExpr: (a = 22) (type: boolean)
Snapshot ref: branch_test1
Statistics: Num rows: 3 Data size: 291 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: COMPLETE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, we always get row count of branch/tag/timetravel by the current snapshot summary, which is not right.

Copy link

sonarcloud bot commented Apr 30, 2024

Quality Gate Passed Quality Gate passed

Issues
5 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

@zhangbutao zhangbutao marked this pull request as ready for review May 21, 2024 10:20
@zhangbutao zhangbutao changed the title Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false May 21, 2024
@deniskuzZ
Copy link
Member

Hi @zhangbutao, Hive has an optimization for Iceberg's count(*) - HIVE-27347, where stats supposed to be taken from snapshot

@@ -442,10 +446,11 @@ public Map<String, String> getBasicStatistics(Partish partish) {
org.apache.hadoop.hive.ql.metadata.Table hmsTable = partish.getTable();
// For write queries where rows got modified, don't fetch from cache as values could have changed.
Table table = getTable(hmsTable);
Snapshot snapshot = getSpecificSnapshot(partish.getTable(), table);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move snapshot fetch under the if

@zhangbutao
Copy link
Contributor Author

Hi @zhangbutao, Hive has an optimization for Iceberg's count(*) - HIVE-27347, where stats supposed to be taken from snapshot

Acutually, HIVE-27347 get row count stats from table's parameterStatsSetupConst.ROW_COUNT (numRows) stored in HMS. not from snapshot.

Check the code:

}
long partRowCnt = Long.parseLong(part.getParameters().get(StatsSetupConst.ROW_COUNT));
rowCnt += partRowCnt;

@@ -2173,4 +2179,32 @@ public List<FileStatus> getMergeTaskInputFiles(Properties properties) throws IOE
public MergeTaskProperties getMergeTaskProperties(Properties properties) {
return new IcebergMergeTaskProperties(properties);
}

private Snapshot getSpecificSnapshot(org.apache.hadoop.hive.ql.metadata.Table hmsTable, Table table) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please move this function to IcebergTableUtil

Snapshot snapshot;
if (refName != null) {
snapshot = table.snapshot(refName);
} else if (hmsTable.getAsOfTimestamp() != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about as of tag?

@@ -2173,4 +2179,32 @@ public List<FileStatus> getMergeTaskInputFiles(Properties properties) throws IOE
public MergeTaskProperties getMergeTaskProperties(Properties properties) {
return new IcebergMergeTaskProperties(properties);
}

private Snapshot getSpecificSnapshot(org.apache.hadoop.hive.ql.metadata.Table hmsTable, Table table) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getTableSnapshot()

Matcher ref = SNAPSHOT_REF.matcher(refName);
if (ref.matches()) {
return ref.group(1);
if (refName != null && !refName.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (StringUtils.isEmpty(refName)) {
  return null;
}
Matcher ref = SNAPSHOT_REF.matcher(refName);
return ref.matches()? ref.group(1) : null;

@@ -943,6 +944,11 @@ private Long getRowCnt(
}
} else { // unpartitioned table
if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(tbl, tbl.getParameters())) {
if (MetaStoreUtils.isNonNativeTable(tbl.getTTable())
&& tbl.getStorageHandler().canComputeQueryUsingStats(tbl)) {
return Long.valueOf(tbl.getStorageHandler().getBasicStatistics(Partish.buildFor(tbl))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can get NullPointer when statsSource != ICEBERG

@deniskuzZ
Copy link
Member

deniskuzZ commented May 27, 2024

@zhangbutao, do you know if iceberg provides partition row_count stats?
https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk
if not, maybe we can get it from meta table:

SELECT record_count FROM prod.db.table.partitions where spec_id in (....)

@zhangbutao
Copy link
Contributor Author

@zhangbutao, do you know if iceberg provides partition row_count stats? https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk if not, maybe we can get it from meta table:

SELECT record_count FROM prod.db.table.partitions where spec_id in (....)

IMO, iceberg's partition stats feature is in development. Such as https://github.com/apache/iceberg/pull/9170/files.
In addition, partition stats feature started from Iceberg1.5.0, so we need to upgrade icenerg dependency.

I will try to play meta table to get partition stats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants