-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] RLI index slowing down #11243
Labels
Comments
Spark UI files : Uploading DOC-20240516-WA0005.zip… |
@manishgaurav84 Not sure why I couldn't download event logs. Can you ping me on slack and provide me there also. |
@ad1happy2go I have provided the logs on slack message. |
have you tried async way
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
A MongoDB table is synced by AWS DMS CDC pipeline on s3 using Glue Job.
The job execution time increases after few runs by 50%.
Table Stats:
hudi-spark3.3-bundle_2.12-0.14.0.jar
hudi-aws-0.14.0.jar,
httpclient-4.5.14.jar,
spark-avro_2.12-3.5.0.jar
To Reproduce
Steps to reproduce the behavior:
HUDI table configuration
'hoodie.table.name': 'appsflyerevents', 'hoodie.datasource.write.precombine.field': 'upsert_ts', 'hoodie.datasource.write.recordkey.field': 'oid__id', 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': 'appsflyerevents', 'hoodie.datasource.hive_sync.database': 'origin', 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.datasource.hive_sync.partition_fields': 'creation_month', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.write.partitionpath.field': 'creation_month', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.SimpleKeyGenerator', 'hoodie.datasource.write.operation': 'upsert', 'hoodie.cleaner.policy': 'KEEP_LATEST_FILE_VERSIONS', 'hoodie.cleaner.fileversions.retained': 1, 'hoodie.upsert.shuffle.parallelism': 152, 'hoodie.index.type': 'RECORD_INDEX', 'hoodie.metadata.record.index.enable': 'true', 'hoodie.metadata.record.index.growth.factor': 10, 'hoodie.metadata.record.index.max.filegroup.count': 20000, 'hoodie.metadata.record.index.min.filegroup.count': 1000, 'hoodie.metadata.record.index.max.filegroup.size': 536870912, 'hoodie.metadata.enable': 'true', 'hoodie.parquet.small.file.limit': -1, 'hoodie.metadata.clean.async': 'true', 'hoodie.metadata.keep.min.commits': '4', 'hoodie.metadata.keep.max.commits': '5', 'hoodie.datasource.meta.sync.glue.metadata_file_listing': 'true'
Expected behavior
The execution time should remain consistent and is not expected increase, significantly.
Environment Description
Hudi version : 0.14
Spark version : 3.3
Hive version : NA
Hadoop version : NA
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No
Additional context
Please find the spark UI attached
Stacktrace
Add the stacktrace of the error.
The text was updated successfully, but these errors were encountered: