Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade HadoopTableOperations.version from int32 to long64 #10277

Open
jkolash opened this issue May 6, 2024 · 4 comments
Open

Upgrade HadoopTableOperations.version from int32 to long64 #10277

jkolash opened this issue May 6, 2024 · 4 comments
Labels
improvement PR that improves existing functionality

Comments

@jkolash
Copy link
Contributor

jkolash commented May 6, 2024

Feature Request / Improvement

We are using the hadoop catalog and have encountered tables written by a 3rd party that are encoding the latest-version.text field in a value higher than supported by int32.

I can provide a PR if it is desired, the changes are all isolated to HadoopTableOperations.

The only issue I encountered were if the spark driver/worker iceberg jars were not the same we'd have serialization issues, but this is very often the case anyway when upgrading libraries.

Query engine

Spark

@jkolash jkolash added the improvement PR that improves existing functionality label May 6, 2024
@nastra
Copy link
Contributor

nastra commented May 7, 2024

@jkolash can you share a little bit more details about the 3rd party that is writing this? It would be good to know why this 3rd party writes this as a long instead of an int.

@jkolash
Copy link
Contributor Author

jkolash commented May 9, 2024

In this case it is data written by snowflake
it looks like it is a timestamp vs an auto increment.

test_snowflake_table
test_snowflake_table/data
test_snowflake_table/data/snow_CYr21sbt9Ps_ALiJR-PqzBc_0_2_002.parquet
test_snowflake_table/metadata
test_snowflake_table/metadata/version-hint.text
test_snowflake_table/metadata/v1715003877288000000.metadata.json
test_snowflake_table/metadata/1715003877288000000-_0LSkJly75ls9-Mfg7ymhA.avro
test_snowflake_table/metadata/snap-1715003877288000000.avro

We aren't particularly interested in writing data to snowflake. But we are interested in using the hadoop catalog to read data after it has landed on s3. Our goal is to be able to simply have snowflake write the data to s3 without needing to connect to the snowflake catalog. Then just use s3 after data has been delivered so we don't have to "know" it is from snowflake.

I've tested that I can query via spark 3.4 once I switch from int32 to long64

@nastra
Copy link
Contributor

nastra commented May 10, 2024

@jkolash you might want to report this to Snowflake as the version should currently be an int instead of a long to comply with the implementation in Iceberg

@jkolash
Copy link
Contributor Author

jkolash commented May 14, 2024

Seeing other implementations that use version-latest.text and whether int32 would break them.

duckdb

https://github.com/duckdb/duckdb_iceberg/blob/main/src/common/iceberg.cpp#L220C25-L220C40 uses a string and makes no assumptions about the type other than that it seems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement PR that improves existing functionality
Projects
None yet
Development

No branches or pull requests

2 participants