unable to create target format Delta with source format as Iceberg when the source table is on S3 #431

rajender07 · 2024-04-30T15:40:18Z

I followed the documentation "Creating your first interoperable table", able to build the utilities-0.1.0-SNAPSHOT-bundled.jar successfully.

Initiated a pyspark session using below command. Spark version is 3.4.1 running on Amazon EMR 6.14

pyspark --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" --conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog" --conf "spark.sql.catalog.spark_catalog.type=hive"

Create an Iceberg table using below commands:

data =[("James","Smith","01012020","M","3000"),
("Michael","","02012021","M","4000"),
("Robert","Williams","03012023","M","4000"),
("Maria","Jones","04012024","F","4000"),
("Jen","Brown","05012025","F","-1")]

columns=["firstname","lastname","dob","gender","salary"]

df=spark.createDataFrame(data,columns)

spark.sql("""CREATE TABLE IF NOT EXISTS iceberg_table (firstname string,lastname string,dob string,gender string,salary string) USING iceberg""");

df.writeTo("iceberg_table").append()

I see the data and metadata directory under the table name on s3.

Created my_config.yaml as mentioned in the documentation
my_config.txt

executed below command and see failing with metadata/version-hint.text not available
sudo java -jar ./utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml

2024-04-30 10:24:25 INFO org.apache.xtable.conversion.ConversionController:240 - No previous InternalTable sync for target. Falling back to snapshot sync.
2024-04-30 10:24:25 WARN org.apache.iceberg.hadoop.HadoopTableOperations:325 - Error reading version hint file s3:///iceberg_table_1/metadata/version-hint.text
java.io.FileNotFoundException: No such file or directory: s3:////iceberg_table_1/metadata/version-hint.text
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3801) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3652) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
at org.apache.hadoop.fs.s3a.S3AFileSystem.extractOrFetchSimpleFileStatus(S3AFileSystem.java:5288) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$executeOpen$6(S3AFileSystem.java:1578) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499) ~[utilities-0.1.0-SNAPSHOT-bundled.jar:0.1.0-SNAPSHOT]

rajender07 · 2024-04-30T18:35:51Z

I was looking into the documentation and understand if the source is Iceberg table I need to include catalog.yaml as well.
But I am not sure what should be the value for catalogImpl in my case. Any insights on this would be very helpful.

catalogImpl: io.my.CatalogImpl
catalogName: name
catalogOptions: # all other options are passed through in a map
key1: value1
key2: value2

dipankarmazumdar · 2024-05-10T15:52:26Z

Hi @rajender07! The error clarifies the problem. It says the version-hint.text file was not found in the source table format (Iceberg). Do you see it on S3?
This is the metadata file on Iceberg side when used with a Hadoop catalog. XTable would need this file to translate into the target Delta format.

The important part to understand here is that Iceberg needs a CATALOG to get started with. Your config currently connects Iceberg with a Hive catalog but I don't see any thrift URL or such here.
pyspark --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" --conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog" --conf "spark.sql.catalog.spark_catalog.type=hive"

Can you instead use a Hadoop catalog & configure with something like this:

spark.sql.catalog.hadoop_prod = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hadoop_prod.type = hadoop
spark.sql.catalog.hadoop_prod.warehouse = s3a://your-bucket

rajender07 · 2024-05-11T13:17:37Z

@dipankarmazumdar , Thank you for looking into the issue.
No, I do not version-hint.text this file on s3. when I looked into the documentation I understand this file is created while using Hadoop catalog. Since i was use Iceberg session catalog its not generated.

I will try as you suggested using Hadoop catalog and let you know the findings.

Could you please guide me to solve the issue while using Iceberg catalog. Should I use catalog.yaml file? if yes, I am confused on catalogName that should be used. FYI, I have added Thrift related properties under /etc/spark/conf/spark-default.conf and /etc/spark/conf/hive-site.xml. I have no issues connecting to my metastore and read/write data from it.

the-other-tim-brown · 2024-05-13T02:52:30Z

@rajender07 Which catalog are you using? If it is HMS, the implementation is org.apache.iceberg.hive.HiveCatalog, the other args and name are going to be used to configure any required configurations for using this catalog like a uri for your thrift server.

rajender07 · 2024-05-14T00:04:21Z

@dipankarmazumdar @the-other-tim-brown

I used Hadoop catalog as you mentioned and created a new Iceberg table. Now, I can see version-hint.text file as well.

However when I executed sync command it is with below error. Could you please assist how to resolve this issue.
sudo java -jar ./utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml

2024-05-13 13:43:04 INFO org.apache.xtable.conversion.ConversionController:240 - No previous InternalTable sync for target. Falling back to snapshot sync.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(Lorg/apache/hadoop/fs/statistics/DurationTracker;Lorg/apache/hadoop/util/functional/CallableRaisingIOE;)Ljava/lang/Object;

Here is my my_config.yaml

**sourceFormat: ICEBERG
targetFormats:

DELTA
datasets:
tableBasePath:
s3:////x4_iceberg_table
tableName: x4_iceberg_table**

dipankarmazumdar · 2024-05-14T15:57:58Z

@rajender07 - I am not really sure about this particular error. However, I tried reproducing this on my end and I was able to translate from ICEBERG to DELTA using the setup I suggested.

ICEBERG TABLE CONFIG & CREATION:

import pyspark
from pyspark.sql import SparkSession
import os
conf = (
    pyspark.SparkConf()
        .setAppName('app_name')
        .set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.3.4,org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.4.3,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178')
        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
        .set('spark.sql.catalog.hdfs_catalog', 'org.apache.iceberg.spark.SparkCatalog')
        .set('spark.sql.catalog.hdfs_catalog.type', 'hadoop')
        .set('spark.sql.catalog.hdfs_catalog.warehouse', 's3a://my-bucket/new_iceberg/')
        .set('spark.sql.catalog.hdfs_catalog.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark Running")
spark.sql("CREATE TABLE hdfs_catalog.table1 (name string) USING iceberg")
spark.sql("INSERT INTO hdfs_catalog.table1 VALUES ('Alex'), ('Dipankar'), ('Mary')")

my_config.yaml

sourceFormat: ICEBERG
targetFormats:
  - DELTA
datasets:
  -
    tableBasePath: s3://my-bucket/new_iceberg/table1/
    tableDataPath: s3://my-bucket/new_iceberg/table1/data
    tableName: table1

Run Sync

java -jar utilities/target/utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml

amnchauhan · 2024-05-16T14:53:59Z

@rajender07 Which catalog are you using? If it is HMS, the implementation is org.apache.iceberg.hive.HiveCatalog, the other args and name are going to be used to configure any required configurations for using this catalog like a uri for your thrift server.
@the-other-tim-brown referring this when I'm using hive catalog and passing catalogImpl: org.apache.iceberg.hive.HiveCatalog I'm getting java.lang.NoSuchMethodException: Cannot find constructor for interface org,apache.iceberg.catalog.Catalog while if i use 'org.apache.iceberg.hadoop.HadoopCatalog' iam getting no such error. Is there anything else we need to implement if we are using hive Catalog for our iceberg tables?

dipankarmazumdar · 2024-05-23T19:39:30Z

@rajender07 - LMK if you were able to get past the error with the recommendation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to create target format Delta with source format as Iceberg when the source table is on S3 #431

unable to create target format Delta with source format as Iceberg when the source table is on S3 #431

rajender07 commented Apr 30, 2024

rajender07 commented Apr 30, 2024

dipankarmazumdar commented May 10, 2024

rajender07 commented May 11, 2024

the-other-tim-brown commented May 13, 2024

rajender07 commented May 14, 2024

dipankarmazumdar commented May 14, 2024

amnchauhan commented May 16, 2024

dipankarmazumdar commented May 23, 2024

unable to create target format Delta with source format as Iceberg when the source table is on S3 #431

unable to create target format Delta with source format as Iceberg when the source table is on S3 #431

Comments

rajender07 commented Apr 30, 2024

rajender07 commented Apr 30, 2024

dipankarmazumdar commented May 10, 2024

rajender07 commented May 11, 2024

the-other-tim-brown commented May 13, 2024

rajender07 commented May 14, 2024

dipankarmazumdar commented May 14, 2024

ICEBERG TABLE CONFIG & CREATION:

my_config.yaml

Run Sync

amnchauhan commented May 16, 2024

dipankarmazumdar commented May 23, 2024