Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce CDC write-side support for the Update operations #2486

Merged
merged 6 commits into from
Jun 4, 2024

Conversation

rtyler
Copy link
Member

@rtyler rtyler commented May 7, 2024

This change introduces a CDCTracker which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code

There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations

Fixes #604
Fixes #2095

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label May 7, 2024
@rtyler rtyler added this to the Rust v0.18 milestone May 7, 2024
@rtyler rtyler marked this pull request as ready for review May 7, 2024 16:21
@rtyler rtyler enabled auto-merge (rebase) May 7, 2024 16:21
@rtyler rtyler requested a review from ion-elgreco May 8, 2024 22:24
@ion-elgreco
Copy link
Collaborator

I think it's better to disable it until all operations (delete and merge) are supported, otherwise we cannot push any python releases until those are added

@rtyler
Copy link
Member Author

rtyler commented May 12, 2024

I think it's better to disable it until all operations (delete and merge) are supported, otherwise we cannot push any python releases until those are added

How would you disable this? It doesn't make sense to me to include short-term configuration or feature flags to me. The protocol states that when the enable change data feed table-feature is enabled, that writers can optionally produce CDC files. Our writers just optionally will only create them on updates for now 😆

#[cfg(feature = "cdf")]
{
writer_features.insert(WriterFeatures::ChangeDataFeed);
writer_features.insert(WriterFeatures::GeneratedColumns);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? I don't remember generated columns being required

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ion-elgreco writer versions 4-6 require this 😦 [see here](If the table is on a Writer Version starting from 4 up to 6, Generated Columns are always supported.)

Writer versions before 7 are annoying , they're annoying after 7 too 🤣

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we are not adding the support here, right?

Protocol states this: "The value of delta.generationExpression SHOULD be parsed as a SQL expression.
Writers MUST enforce that any data writing to the table satisfy the condition ( <=> ) IS TRUE. <=> is the NULL-safe equal operator which performs an equality comparison like the = operator but returns TRUE rather than NULL if both operands are NULL"

We might only support v7 with CDF at this stage

@rtyler
Copy link
Member Author

rtyler commented Jun 1, 2024

Doing further acceptance testing I have identified what I believe to be a bug in DataFusion and will put this into Draft until I can figure out the path forward

apache/datafusion#10749

@rtyler rtyler marked this pull request as draft June 1, 2024 17:56
auto-merge was automatically disabled June 1, 2024 17:56

Pull request was converted to draft

@rtyler
Copy link
Member Author

rtyler commented Jun 3, 2024

In discussion with @ion-elgreco , due to apache/datafusion#10749 which is really an issue with arrow-rs. We decided that we can move forward without struct/list CDC working with the following conditions:

  • warn if the schema has a struct or a list in it when attempting to create CDC batches, to try to make it clear to the user that this is not yet supported
  • generated column support doesn't exist yet, but as long as delta.generatedExpression is not set, it's safe to allow that table feature to exist (code must be modified).

@rtyler rtyler marked this pull request as ready for review June 4, 2024 05:33
@rtyler rtyler force-pushed the cdc-update-only branch 2 times, most recently from 8e4b248 to 58e4d60 Compare June 4, 2024 06:04
Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix that failing python test, just replace the pandas code with pyarrow equivalent code.

Then we can merge!

@rtyler rtyler enabled auto-merge (rebase) June 4, 2024 13:19
This change introduces a `CDCTracker` which helps collect changes during
merges and update. This is admittedly rather inefficient, but my hope is
that this provides a place to start iterating and improving upon the
writer code

There is still additional work which needs to be done to handle table
features properly for other code paths (see the middleware discussion we
have had in Slack) but this produces CDC files for Update operations

Fixes delta-io#604
Fixes delta-io#2095
This test has highlighted an apparent race condition when handling
structs or lists in how excerpt() is treated by the CDCObserver.
Basically for older minWriterVersions we don't have to really worry
about generated columns unless an expression has been set, in which case
we must fail to write since we cannot honor generationExpression
Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉🎉

@ion-elgreco ion-elgreco disabled auto-merge June 4, 2024 14:28
@ion-elgreco ion-elgreco enabled auto-merge (rebase) June 4, 2024 14:28
@ion-elgreco ion-elgreco merged commit 2715db2 into delta-io:main Jun 4, 2024
22 of 23 checks passed
@rtyler rtyler deleted the cdc-update-only branch June 4, 2024 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change Data Feed in Delta Support CDC
2 participants