[FLINK-35272][cdc][runtime] Pipeline Transform job supports omitting / renaming calculation column #3285

yuxiqian · 2024-04-30T10:55:14Z

Currently, pipeline jobs with transform (including projection and filtering) are constructed with the following topology:

SchemaTransformOp --> DataTransformOp --> SchemaOp

where schema projections are applied in SchemaTransformOp and data projection & filtering are applied in DataTransformOp. The idea is SchemaTransformOp might be embedded in Sources in the future to reduce payload data size transferred in Flink Job.

However, current implementation has a known defect that omits unused columns too early, causing some downstream-relied columns got removed after they arrived in DataTransformOp. See a example as follows:

# Schema is (ID INT NOT NULL, NAME STRING, AGE INT)
transform:
  - source-table: employee
    projection: id, upper(name) as newname
    filter: age > 18

Such transformation rules will fail since name and age columns are removed in SchemaTransformOp, and those data rows could not be retrieved in DataTransformOp, where the actual expression evaluation and filtering comes into effect.

This PR introduces a new design, renaming the transform topology as follows:

PreTransformOp --> PostTransformOp --> SchemaOp

where the PreTransformOp filters out columns, but only if:

The column is not present in projection rules
The column is not indirectly referenced by calculation and filtering expressions

Referenced columns will be generated with exact same order as in the original schema. All schema and data events about those temporarily-referenced columns will be omitted after PostTransformOp. For example, given the following transform rule:

# Schema is (ID INT NOT NULL, NAME STRING, AGE INT)
transform:
  - source-table: employee
    projection: id, age + 4 as newage
    filter: age > 4

PreTransformOp will yield an intermediate schema (ID INT NOT NULL, AGE INT) and corresponding trimmed data records to downstream. Calculated columns (newage here) will not be created then since they haven't been evaluated here; Unused columns (name here) will be removed as early as possible.

If a column is explicity written down, it will be passed to downstream as-is. But for referenced columns, a special prefix will be added to their names. In the example above, a schema like [id, newname, __PREFIX__name, __PREFIX__age] will be generated to downstream. Notice that the expression evaluation and filtering will not come into effect for now, so a DataChangeEvent would be like [1, null, 'Alice', 19].

~~Adding prefix is meant to deal with such cases:~~

# Schema is (ID INT NOT NULL, NAME STRING, AGE INT)
transform:
  - source-table: employee
    projection: id, upper(name) as name

Here we need to distinguish the calculated column (new) name and the referenced original column (old) name. So after the name mangling process the schema would be like: [id, name, __PREFIX__name].

Also, the filtering process is still done in PostTransformOp since user could write down a filter expression that references calculated column, but their value won't be available until PostTransformOp's evaluation. It also means in the following somewhat ambigious case:

# Schema is (ID INT NOT NULL, NAME STRING, AGE INT)
transform:
  - source-table: employee
    projection: id, age * 2 as age
    filter: age > 18

~~The filtering expression is applied to the calculated age column (doubled!) instead of the original one.~~

Now, any calculated column referenced in filtering column will be rewritten as its original definition. For example, the following transform rule:

transform:
  - source-table: employee
    projection: id, age * 2 as newage
    filter: newage > 18

...will be rewritten as follows:

transform:
  - source-table: employee
    projection: id, age * 2 as newage
    filter: age * 2 > 18

Hence, no calculated columns need to be evaluated before filtering process.

yuxiqian · 2024-04-30T11:00:31Z

This PR is still in very early progress, looking for @aiwenmo & @lvyanquan's comments.

yuxiqian · 2024-05-06T07:19:18Z

Updated based on previous comments, cc @aiwenmo

flink-cdc-runtime/src/main/java/org/apache/flink/cdc/runtime/parser/TransformParser.java

...src/test/java/org/apache/flink/cdc/runtime/operators/transform/PreTransformOperatorTest.java

…rhaul

yuxiqian · 2024-05-07T08:38:14Z

Thanks for @aiwenmo's kindly review, addressed comments above.

…d issues

...me/src/main/java/org/apache/flink/cdc/runtime/operators/transform/PostTransformOperator.java

...runtime/src/main/java/org/apache/flink/cdc/runtime/operators/transform/PostTransformers.java

flink-cdc-runtime/src/main/java/org/apache/flink/cdc/runtime/parser/TransformParser.java

…EN statement & Remove unused `containFilteredComputedColumn` field

yuxiqian · 2024-05-11T06:38:49Z

Thanks @aiwenmo for reviewing, I've addressed your comments.

aiwenmo

LGTM

yuxiqian · 2024-05-20T02:01:52Z

cc @PatrickRen @lvyanquan

github-actions bot added composer runtime e2e-tests labels Apr 30, 2024

yuxiqian marked this pull request as draft April 30, 2024 10:55

yuxiqian force-pushed the FLINK-35272 branch 2 times, most recently from c448e89 to 31a7c1d Compare May 6, 2024 06:06

yuxiqian marked this pull request as ready for review May 6, 2024 07:19

aiwenmo reviewed May 7, 2024

View reviewed changes

flink-cdc-runtime/src/main/java/org/apache/flink/cdc/runtime/parser/TransformParser.java Outdated Show resolved Hide resolved

...src/test/java/org/apache/flink/cdc/runtime/operators/transform/PreTransformOperatorTest.java Outdated Show resolved Hide resolved

yuxiqian added 3 commits May 7, 2024 16:36

[FLINK-35272][cdc][runtime] Transform projection & filter feature ove…

139c79d

…rhaul

Fix CI

c2a1425

Cleanup: remove redundant reference column prefix

019869f

yuxiqian force-pushed the FLINK-35272 branch from 31a7c1d to 615547f Compare May 7, 2024 08:37

Clarify DDL parsing methods & unify filter-project execution flow

35ab5f8

yuxiqian force-pushed the FLINK-35272 branch from 615547f to 35ab5f8 Compare May 7, 2024 08:41

Add unified transform operator tests & Fix metadata & wildcard relate…

a31a225

…d issues

github-actions bot added the common label May 10, 2024

yuxiqian requested a review from aiwenmo May 10, 2024 07:44

Add wildcard tests at different position

07324de

aiwenmo reviewed May 11, 2024

View reviewed changes

yuxiqian force-pushed the FLINK-35272 branch from ae087a5 to ec700df Compare May 11, 2024 04:21

Fix transformed schema merging logic & Fix missing rewrite in CASE WH…

eb1dac3

…EN statement & Remove unused `containFilteredComputedColumn` field

yuxiqian force-pushed the FLINK-35272 branch from ec700df to eb1dac3 Compare May 11, 2024 04:32

yuxiqian added 2 commits May 11, 2024 13:10

Add E2e tests for multiple transform rule applied on one single table

ef2c654

Fix SqlCase in DDL parsing

9b7c313

aiwenmo approved these changes May 12, 2024

View reviewed changes

yuxiqian changed the title ~~[FLINK-35272][cdc][runtime] Transform projection & filter feature overhaul~~ [FLINK-35272][cdc][runtime] Pipeline Transform job supports omitting / renaming calculation column May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-35272][cdc][runtime] Pipeline Transform job supports omitting / renaming calculation column #3285

[FLINK-35272][cdc][runtime] Pipeline Transform job supports omitting / renaming calculation column #3285

yuxiqian commented Apr 30, 2024 •

edited

yuxiqian commented Apr 30, 2024

yuxiqian commented May 6, 2024

yuxiqian commented May 7, 2024

yuxiqian commented May 11, 2024 •

edited

aiwenmo left a comment

yuxiqian commented May 20, 2024

[FLINK-35272][cdc][runtime] Pipeline Transform job supports omitting / renaming calculation column #3285

Are you sure you want to change the base?

[FLINK-35272][cdc][runtime] Pipeline Transform job supports omitting / renaming calculation column #3285

Conversation

yuxiqian commented Apr 30, 2024 • edited

yuxiqian commented Apr 30, 2024

yuxiqian commented May 6, 2024

yuxiqian commented May 7, 2024

yuxiqian commented May 11, 2024 • edited

aiwenmo left a comment

Choose a reason for hiding this comment

yuxiqian commented May 20, 2024

yuxiqian commented Apr 30, 2024 •

edited

yuxiqian commented May 11, 2024 •

edited