Upgrade HadoopTableOperations.version from int32 to long64 #10277

jkolash · 2024-05-06T16:47:43Z

Feature Request / Improvement

We are using the hadoop catalog and have encountered tables written by a 3rd party that are encoding the latest-version.text field in a value higher than supported by int32.

I can provide a PR if it is desired, the changes are all isolated to HadoopTableOperations.

The only issue I encountered were if the spark driver/worker iceberg jars were not the same we'd have serialization issues, but this is very often the case anyway when upgrading libraries.

Query engine

Spark

nastra · 2024-05-07T20:48:54Z

@jkolash can you share a little bit more details about the 3rd party that is writing this? It would be good to know why this 3rd party writes this as a long instead of an int.

jkolash · 2024-05-09T20:11:53Z

In this case it is data written by snowflake
it looks like it is a timestamp vs an auto increment.

test_snowflake_table
test_snowflake_table/data
test_snowflake_table/data/snow_CYr21sbt9Ps_ALiJR-PqzBc_0_2_002.parquet
test_snowflake_table/metadata
test_snowflake_table/metadata/version-hint.text
test_snowflake_table/metadata/v1715003877288000000.metadata.json
test_snowflake_table/metadata/1715003877288000000-_0LSkJly75ls9-Mfg7ymhA.avro
test_snowflake_table/metadata/snap-1715003877288000000.avro

We aren't particularly interested in writing data to snowflake. But we are interested in using the hadoop catalog to read data after it has landed on s3. Our goal is to be able to simply have snowflake write the data to s3 without needing to connect to the snowflake catalog. Then just use s3 after data has been delivered so we don't have to "know" it is from snowflake.

I've tested that I can query via spark 3.4 once I switch from int32 to long64

nastra · 2024-05-10T08:47:35Z

@jkolash you might want to report this to Snowflake as the version should currently be an int instead of a long to comply with the implementation in Iceberg

jkolash · 2024-05-14T13:04:01Z

Seeing other implementations that use version-latest.text and whether int32 would break them.

duckdb

https://github.com/duckdb/duckdb_iceberg/blob/main/src/common/iceberg.cpp#L220C25-L220C40 uses a string and makes no assumptions about the type other than that it seems.

jkolash added the improvement PR that improves existing functionality label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade HadoopTableOperations.version from int32 to long64 #10277

Upgrade HadoopTableOperations.version from int32 to long64 #10277

jkolash commented May 6, 2024 •

edited

nastra commented May 7, 2024

jkolash commented May 9, 2024 •

edited

nastra commented May 10, 2024

jkolash commented May 14, 2024

Upgrade HadoopTableOperations.version from int32 to long64 #10277

Upgrade HadoopTableOperations.version from int32 to long64 #10277

Comments

jkolash commented May 6, 2024 • edited

Feature Request / Improvement

Query engine

nastra commented May 7, 2024

jkolash commented May 9, 2024 • edited

nastra commented May 10, 2024

jkolash commented May 14, 2024

duckdb

jkolash commented May 6, 2024 •

edited

jkolash commented May 9, 2024 •

edited