You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've had a Hudi pipeline running for about a year without cleaner enabled for ~150 tables. After enabling cleaning, all but one of my tables ran the cleaning operation successfully, but this table fails consistently with an OutOfMemory error when serializing the cleaning plan.
After this completes, the job almost immediately fails, with the stacktrace below being logged to the driver.
Ganglia shows my nodes being under-utilized, with memory maxing out around 1/4 of the total allocated memory:
To Reproduce
Steps to reproduce the behavior:
Generate similar dimensioned table with many updates, causing a large cleaner plan.
Run cleaner async
OOM after trying to serialize cleaner plan.
Expected behavior
Clean, or at least use all the driver memory before OOMing
Environment Description
Hudi version : 0.13.1-amzn-0
Spark version : 3.4.0
Hive version : 3.1.3
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
Larger tables with more partitions were able to generate the cleaning plan fine, which we thought was strange.
We also tried reducing the size of the plan by retaining more commits (60 retained) but still received the same error.
Note that I also tried running cleaner synchronously with my ingestion job but also received driver OOM errors.
Stacktrace
24/05/15 19:01:31 ERROR HoodieCleaner: Fail to run cleaning for s3://bucket/table.all_hudi/
java.lang.OutOfMemoryError: null
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) ~[?:1.8.0_412]
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) ~[?:1.8.0_412]
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) ~[?:1.8.0_412]
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) ~[?:1.8.0_412]
at org.apache.avro.io.DirectBinaryEncoder.writeFixed(DirectBinaryEncoder.java:124) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:57) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.io.Encoder.writeString(Encoder.java:130) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:392) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.specific.SpecificDatumWriter.writeString(SpecificDatumWriter.java:76) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:165) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:159) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:108) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:234) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.specific.SpecificDatumWriter.writeRecord(SpecificDatumWriter.java:92) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:145) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:288) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:151) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.java:347) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:154) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:159) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:108) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:234) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.specific.SpecificDatumWriter.writeRecord(SpecificDatumWriter.java:92) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:145) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82) ~[avro-1.11.1.jar:1.11.1]
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:314) ~[avro-1.11.1.jar:1.11.1]
at org.apache.hudi.common.table.timeline.TimelineMetadataUtils.serializeAvroMetadata(TimelineMetadataUtils.java:159) ~[__app__.jar:0.13.1-amzn-0]
at org.apache.hudi.common.table.timeline.TimelineMetadataUtils.serializeCleanerPlan(TimelineMetadataUtils.java:114) ~[__app__.jar:0.13.1-amzn-0]
at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:158) ~[__app__.jar:0.13.1-amzn-0]
at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.execute(CleanPlanActionExecutor.java:176) ~[__app__.jar:0.13.1-amzn-0]
at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleCleaning(HoodieSparkCopyOnWriteTable.java:198) ~[__app__.jar:0.13.1-amzn-0]
at org.apache.hudi.client.BaseHoodieTableServiceClient.scheduleTableServiceInternal(BaseHoodieTableServiceClient.java:433) ~[__app__.jar:0.13.1-amzn-0]
at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:546) ~[__app__.jar:0.13.1-amzn-0]
at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:766) ~[__app__.jar:0.13.1-amzn-0]
at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:738) ~[__app__.jar:0.13.1-amzn-0]
at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:770) ~[__app__.jar:0.13.1-amzn-0]
at org.apache.hudi.utilities.HoodieCleaner.run(HoodieCleaner.java:69) ~[__app__.jar:0.13.1-amzn-0]
at org.apache.hudi.utilities.HoodieCleaner.main(HoodieCleaner.java:111) ~[__app__.jar:0.13.1-amzn-0]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_412]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_412]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_412]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_412]
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:760) ~[spark-yarn_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
Seems the error is happening at org.apache.hudi.common.table.timeline.TimelineMetadataUtils.serializeCleanerPlan(TimelineMetadataUtils.java:114) ~[app.jar:0.13.1-amzn-0]
Looking for assistance in properly configuring the memory settings for this. Thanks so much!
The text was updated successfully, but these errors were encountered:
briefly chatted in office hour: this is likely caused by clean planning loading archived commits which is about 500 mb each on storage, since the clean never run before. the active timeline only has about 70-80 commits.
Describe the problem you faced
We've had a Hudi pipeline running for about a year without cleaner enabled for ~150 tables. After enabling cleaning, all but one of my tables ran the cleaning operation successfully, but this table fails consistently with an OutOfMemory error when serializing the cleaning plan.
Table dimensions in storage:
Async cleaner job:
spark-submit --master yarn --deploy-mode cluster --class org.apache.hudi.utilities.HoodieCleaner --jars /usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-spark-bundle.jar /usr/lib/hudi/hudi-utilities-bundle.jar --target-base-path s3://bucket/table.all_hudi/ --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS --hoodie-conf hoodie.cleaner.commits.retained=30 --hoodie-conf hoodie.cleaner.parallelism=640 --hoodie-conf hoodie.keep.min.commits=40 --hoodie-conf hoodie.keep.max.commits=50 --spark-master yarn
Cluster configs:
Spark History Server:
After this completes, the job almost immediately fails, with the stacktrace below being logged to the driver.
Ganglia shows my nodes being under-utilized, with memory maxing out around 1/4 of the total allocated memory:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Clean, or at least use all the driver memory before OOMing
Environment Description
Hudi version : 0.13.1-amzn-0
Spark version : 3.4.0
Hive version : 3.1.3
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
Larger tables with more partitions were able to generate the cleaning plan fine, which we thought was strange.
We also tried reducing the size of the plan by retaining more commits (60 retained) but still received the same error.
Note that I also tried running cleaner synchronously with my ingestion job but also received driver OOM errors.
Stacktrace
Seems the error is happening at org.apache.hudi.common.table.timeline.TimelineMetadataUtils.serializeCleanerPlan(TimelineMetadataUtils.java:114) ~[app.jar:0.13.1-amzn-0]
Looking for assistance in properly configuring the memory settings for this. Thanks so much!
The text was updated successfully, but these errors were encountered: