You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently optimize is an all or nothing operation on all files in the table, or limited by a partition filter. The partition filter allows you to do manually batching of subsets of the table, but with clustering now a thing, there is no option to do partition filtering. We should add the ability to enable batch support inside of optimize, so chunks of optimized files can be added to the transaction log incrementally.
Motivation
Currently you could rewrite an entire petabyte of data, just to fail on the last file and have all that be for naught, wasting a lot of compute time and storage space. With automatic batching, nearly all of the results would be saved along the way, and only the last batch that failed would have to be retried.
Further details
I think this can be fairly straightforward, just grouping the existing bins into another layer of batches.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.
The text was updated successfully, but these errors were encountered:
I think this is already a thing in Databricks, so it would be great to know if there are any plans to open-source that before I spend a bunch of time on this! @scottsand-db
Feature request
Which Delta project/connector is this regarding?
Overview
Currently optimize is an all or nothing operation on all files in the table, or limited by a partition filter. The partition filter allows you to do manually batching of subsets of the table, but with clustering now a thing, there is no option to do partition filtering. We should add the ability to enable batch support inside of optimize, so chunks of optimized files can be added to the transaction log incrementally.
Motivation
Currently you could rewrite an entire petabyte of data, just to fail on the last file and have all that be for naught, wasting a lot of compute time and storage space. With automatic batching, nearly all of the results would be saved along the way, and only the last batch that failed would have to be retried.
Further details
I think this can be fairly straightforward, just grouping the existing bins into another layer of batches.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
The text was updated successfully, but these errors were encountered: