Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Async cleaner OOM when serializing cleaner plan #11248

Open
noahtaite opened this issue May 16, 2024 · 2 comments
Open

[SUPPORT] Async cleaner OOM when serializing cleaner plan #11248

noahtaite opened this issue May 16, 2024 · 2 comments

Comments

@noahtaite
Copy link

noahtaite commented May 16, 2024

Describe the problem you faced

We've had a Hudi pipeline running for about a year without cleaner enabled for ~150 tables. After enabling cleaning, all but one of my tables ran the cleaning operation successfully, but this table fails consistently with an OutOfMemory error when serializing the cleaning plan.

Table dimensions in storage:

  • 2200 partitions
  • 1.5 TB
  • 2M S3 objects

Async cleaner job:

spark-submit --master yarn --deploy-mode cluster --class org.apache.hudi.utilities.HoodieCleaner --jars /usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-spark-bundle.jar /usr/lib/hudi/hudi-utilities-bundle.jar --target-base-path s3://bucket/table.all_hudi/ --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS --hoodie-conf hoodie.cleaner.commits.retained=30 --hoodie-conf hoodie.cleaner.parallelism=640 --hoodie-conf hoodie.keep.min.commits=40 --hoodie-conf hoodie.keep.max.commits=50 --spark-master yarn

Cluster configs:

spark.driver.memory	219695M
spark.executor.cores	32
spark.executor.memory	218880M
spark.executor.memoryOverheadFactor	0.1
spark.executor.instances	10

Spark History Server:
image

After this completes, the job almost immediately fails, with the stacktrace below being logged to the driver.

Ganglia shows my nodes being under-utilized, with memory maxing out around 1/4 of the total allocated memory:
image

To Reproduce

Steps to reproduce the behavior:

  1. Generate similar dimensioned table with many updates, causing a large cleaner plan.
  2. Run cleaner async
  3. OOM after trying to serialize cleaner plan.

Expected behavior

Clean, or at least use all the driver memory before OOMing

Environment Description

  • Hudi version : 0.13.1-amzn-0

  • Spark version : 3.4.0

  • Hive version : 3.1.3

  • Hadoop version : 3.3.3

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Additional context

Larger tables with more partitions were able to generate the cleaning plan fine, which we thought was strange.

We also tried reducing the size of the plan by retaining more commits (60 retained) but still received the same error.

Note that I also tried running cleaner synchronously with my ingestion job but also received driver OOM errors.

Stacktrace

24/05/15 19:01:31 ERROR HoodieCleaner: Fail to run cleaning for s3://bucket/table.all_hudi/
java.lang.OutOfMemoryError: null
	at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) ~[?:1.8.0_412]
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) ~[?:1.8.0_412]
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) ~[?:1.8.0_412]
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) ~[?:1.8.0_412]
	at org.apache.avro.io.DirectBinaryEncoder.writeFixed(DirectBinaryEncoder.java:124) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:57) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.io.Encoder.writeString(Encoder.java:130) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:392) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.specific.SpecificDatumWriter.writeString(SpecificDatumWriter.java:76) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:165) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:159) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:108) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:234) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.specific.SpecificDatumWriter.writeRecord(SpecificDatumWriter.java:92) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:145) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:288) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:151) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.java:347) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:154) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:159) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:108) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:234) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.specific.SpecificDatumWriter.writeRecord(SpecificDatumWriter.java:92) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:145) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:95) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:82) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:314) ~[avro-1.11.1.jar:1.11.1]
	at org.apache.hudi.common.table.timeline.TimelineMetadataUtils.serializeAvroMetadata(TimelineMetadataUtils.java:159) ~[__app__.jar:0.13.1-amzn-0]
	at org.apache.hudi.common.table.timeline.TimelineMetadataUtils.serializeCleanerPlan(TimelineMetadataUtils.java:114) ~[__app__.jar:0.13.1-amzn-0]
	at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:158) ~[__app__.jar:0.13.1-amzn-0]
	at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.execute(CleanPlanActionExecutor.java:176) ~[__app__.jar:0.13.1-amzn-0]
	at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleCleaning(HoodieSparkCopyOnWriteTable.java:198) ~[__app__.jar:0.13.1-amzn-0]
	at org.apache.hudi.client.BaseHoodieTableServiceClient.scheduleTableServiceInternal(BaseHoodieTableServiceClient.java:433) ~[__app__.jar:0.13.1-amzn-0]
	at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:546) ~[__app__.jar:0.13.1-amzn-0]
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:766) ~[__app__.jar:0.13.1-amzn-0]
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:738) ~[__app__.jar:0.13.1-amzn-0]
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:770) ~[__app__.jar:0.13.1-amzn-0]
	at org.apache.hudi.utilities.HoodieCleaner.run(HoodieCleaner.java:69) ~[__app__.jar:0.13.1-amzn-0]
	at org.apache.hudi.utilities.HoodieCleaner.main(HoodieCleaner.java:111) ~[__app__.jar:0.13.1-amzn-0]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_412]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_412]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_412]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_412]
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:760) ~[spark-yarn_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]

Seems the error is happening at org.apache.hudi.common.table.timeline.TimelineMetadataUtils.serializeCleanerPlan(TimelineMetadataUtils.java:114) ~[app.jar:0.13.1-amzn-0]

Looking for assistance in properly configuring the memory settings for this. Thanks so much!

@xushiyan
Copy link
Member

briefly chatted in office hour: this is likely caused by clean planning loading archived commits which is about 500 mb each on storage, since the clean never run before. the active timeline only has about 70-80 commits.

@noahtaite
Copy link
Author

Is this related? https://stackoverflow.com/questions/53462161/java-lang-outofmemoryerror-when-plenty-of-memory-left-94gb-200gb-xmx

I've tried changing my offHeap memory, memory overhead, and instance size but the job driver to OOM at 12GB of memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Awaiting Triage
Development

No branches or pull requests

3 participants