-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28268: Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false #5215
base: master
Are you sure you want to change the base?
Conversation
…ceberg.hive.keep.stats=false
0c14207
to
1a953e8
Compare
1a953e8
to
77d9a7e
Compare
@@ -237,19 +237,19 @@ STAGE PLANS: | |||
alias: ice01 | |||
filterExpr: (a = 22) (type: boolean) | |||
Snapshot ref: branch_test1 | |||
Statistics: Num rows: 3 Data size: 291 Basic stats: COMPLETE Column stats: COMPLETE | |||
Statistics: Num rows: 5 Data size: 485 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR, we always get row count
of branch/tag/timetravel by the current snapshot summary, which is not right.
77d9a7e
to
441db00
Compare
441db00
to
9971db5
Compare
9971db5
to
0ffc9df
Compare
Quality Gate passedIssues Measures |
Hi @zhangbutao, Hive has an optimization for Iceberg's count(*) - HIVE-27347, where stats supposed to be taken from snapshot |
@@ -442,10 +446,11 @@ public Map<String, String> getBasicStatistics(Partish partish) { | |||
org.apache.hadoop.hive.ql.metadata.Table hmsTable = partish.getTable(); | |||
// For write queries where rows got modified, don't fetch from cache as values could have changed. | |||
Table table = getTable(hmsTable); | |||
Snapshot snapshot = getSpecificSnapshot(partish.getTable(), table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move snapshot fetch under the if
Acutually, HIVE-27347 get row count stats from table's parameter Check the code: hive/ql/src/java/org/apache/hadoop/hive/ql/optimizer/StatsOptimizer.java Lines 940 to 942 in a57e580
|
@@ -2173,4 +2179,32 @@ public List<FileStatus> getMergeTaskInputFiles(Properties properties) throws IOE | |||
public MergeTaskProperties getMergeTaskProperties(Properties properties) { | |||
return new IcebergMergeTaskProperties(properties); | |||
} | |||
|
|||
private Snapshot getSpecificSnapshot(org.apache.hadoop.hive.ql.metadata.Table hmsTable, Table table) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please move this function to IcebergTableUtil
Snapshot snapshot; | ||
if (refName != null) { | ||
snapshot = table.snapshot(refName); | ||
} else if (hmsTable.getAsOfTimestamp() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about as of tag
?
@@ -2173,4 +2179,32 @@ public List<FileStatus> getMergeTaskInputFiles(Properties properties) throws IOE | |||
public MergeTaskProperties getMergeTaskProperties(Properties properties) { | |||
return new IcebergMergeTaskProperties(properties); | |||
} | |||
|
|||
private Snapshot getSpecificSnapshot(org.apache.hadoop.hive.ql.metadata.Table hmsTable, Table table) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getTableSnapshot()
Matcher ref = SNAPSHOT_REF.matcher(refName); | ||
if (ref.matches()) { | ||
return ref.group(1); | ||
if (refName != null && !refName.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (StringUtils.isEmpty(refName)) {
return null;
}
Matcher ref = SNAPSHOT_REF.matcher(refName);
return ref.matches()? ref.group(1) : null;
@@ -943,6 +944,11 @@ private Long getRowCnt( | |||
} | |||
} else { // unpartitioned table | |||
if (!StatsUtils.areBasicStatsUptoDateForQueryAnswering(tbl, tbl.getParameters())) { | |||
if (MetaStoreUtils.isNonNativeTable(tbl.getTTable()) | |||
&& tbl.getStorageHandler().canComputeQueryUsingStats(tbl)) { | |||
return Long.valueOf(tbl.getStorageHandler().getBasicStatistics(Partish.buildFor(tbl)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can get NullPointer when statsSource != ICEBERG
@zhangbutao, do you know if iceberg provides partition row_count stats?
|
IMO, iceberg's partition stats feature is in development. Such as https://github.com/apache/iceberg/pull/9170/files. I will try to play meta table to get partition stats. |
What changes were proposed in this pull request?
At present, in case of
iceberg.hive.keep.stats=true
&hive.compute.query.using.stats=true
, HS2 will do a fetch task to get iceberg table'snumRows
property from HMS to optimizecount
query.If
iceberg.hive.keep.stats=false
, HS2 will always launch tez task to compute table's row count when filing acount
query.However, as we know, iceberg table's metadata has some stats information, we can also just start a fetch task to retrieve the row count from iceberg's snapshot summary when
iceberg.hive.keep.stats=false
or no stats stored in hms. This can avoid launching tez task to compute the table's row count.BTW, timetravel or branch/tag has different stats from current snapshot, so we need to get the specified snapshotid based on the different iceberg version. Otherwise, we will get the wrong stats when querying the time travel/branch/tag.
Why are the changes needed?
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
Qtest