-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513
base: main
Are you sure you want to change the base?
feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
@JonasDev1 why did you make the advanced filtering optional? If this provides better performance across the board, we should enable it always (so also for python bindings) |
My concern was if you want to do e.g. merges via columns with null this would not work, but I think that it would not work without the advanced filtering either as Spark has an extra null safe operator |
What about the review? I can of course also remove the flag again |
My main issue is that it might work or not work based on the contents of the data. I think that's a bit tricky because a person needs to be aware of the contents of their data they are trying to write.
|
Description
This pr improves the merging performance by adding min/max filters to the early filter.
The number of files scanned from the target file table is reduced by using the table statistics.
I have extended the early filter for this purpose. This filter is responsible for pre-filtering the target table.
Previously, the early filter only consisted of partition columns by filtering for all unique values from the source. Now the non-partition columns are also used by aggregating the min/max values from the source and adding a between expression to the early filter.
It is also automatically part of the conflict detection based on the predicate.
I added a property
extended_early_filter
to make this advanced filtering optional. I don't know if this is important, and maybe we can replace the bool with an enum. What do you think about this?Example:
Merge into table t with partition date
Predicate: source.date = target.date and source.timestamp = target.timestamp and source.id = target.id and frob > 42
Early filter before:
date = '2024-‚05-14' and frob > 42
Early filter now:
date = '2024-05-14' and timestamp BETWEEN '…15:00' AND '…15:05' and id BETWEEN 'A' AND 'B' and frob > 42