Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add concurrent writes reconciliation for UPDATE/MERGE/DELETE in Delta Lake #21727

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

findinpath
Copy link
Contributor

Description

Spin-off from #18521

Allow committing operations based on:

  • merge mechanism (applies for regular UPDATE/DELETE/MERGE statements)

in a concurrent context by placing these operations right after
any other previously concurrently completed write operations.

Disallow committing the operation in any of the following cases:

  • table schema change has been committed in the meantime
  • table protocol change has been committed in the meantime
  • add files committed in the meantime should be read by
    the current operation
  • remove files committed in the meantime conflict with the
    add files read by the current operation

The current changes also take into consideration the delta.isolationLevel
table property of the Delta Lake table for UPDATE/DELETE/MERGE operations.

Relevant example taken from Databricks documentation in regards to the
distinction between WriteSerializable and Serializable isolation levels:

For example, consider txn1, a long running delete and txn2,
which inserts blindly data into the table.
txn2 and txn1 complete and they are recorded in the order
txn2, txn1
into the history of the table.
According to the history, the data inserted in txn2 should not exist
in the table. For Serializable level, a reader would never see data
inserted by txn2. However, for the WriteSerializable level, a reader
could at some point see the data inserted by txn2.

A few words about WriteSerializable isolation level taken from delta.io javadocs:

This isolation level will ensure snapshot isolation consistency guarantee
between write operations only.
In other words, if only the write operations are considered, then
there exists a serializable sequence between them that would produce the same
result as seen in the table.

Additional context and related issues

INSERT scaffolding PRs:

Spin-off from #18521

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Delta Lake
* Add support for concurrent `UPDATE`,`MERGE` and `DELETE` queries. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Apr 26, 2024
@github-actions github-actions bot added the delta-lake Delta Lake connector label Apr 26, 2024
@findinpath findinpath force-pushed the findinpath/delta-concurrent-update-merge-delete-reconciliation-follow-up branch from 10b2d55 to 1c8674b Compare May 13, 2024 20:24
@findinpath findinpath marked this pull request as ready for review May 13, 2024 20:24
@ebyhr ebyhr force-pushed the findinpath/delta-concurrent-update-merge-delete-reconciliation-follow-up branch from 1c8674b to edcb69c Compare May 13, 2024 22:07
@findinpath findinpath force-pushed the findinpath/delta-concurrent-update-merge-delete-reconciliation-follow-up branch 2 times, most recently from 0c968e0 to e577846 Compare May 14, 2024 06:18
Add the possibility to perform analysis on the dependencies
of the statement using the merge mechanism.
Specifically one connector could potentially figure out
whether concurrent UPDATE/DELETE/MERGE operations which add
data into the same table as the one from which data is
being selected collide with each other.
… Lake

Allow committing operations based on the merge mechanism in
a concurrent context by placing these operations right after
any other previously concurrently completed write operations.

Disallow committing the operation in any of the following cases:

- table schema change has been committed in the meantime
- table protocol change has been committed in the meantime
- add files committed in the meantime should be read by
the current operation
- remove files committed in the meantime conflict with the
add files read by the current operation

The current changes also take into consideration the `delta.isolationLevel`
table property of the Delta Lake table for UPDATE/DELETE/MERGE operations.

 Relevant example taken from Databricks documentation in regards to the
 distinction between `WriteSerializable` and `Serializable` isolation levels:

 > For example, consider `txn1`, a long running delete and `txn2`,
 > which inserts blindly data into the table.
 > `txn2` and `txn1` complete and they are recorded in the order
 > `txn2, txn1`
 > into the history of the table.
 > According to the history, the data inserted in `txn2` should not exist
 > in the table. For `Serializable` level, a reader would never see data
 > inserted by `txn2`. However, for the `WriteSerializable` level, a reader
 > could at some point see the data inserted by `txn2`.

 A few words about WriteSerializable isolation level taken from delta.io javadocs:

 > This isolation level will ensure snapshot isolation consistency guarantee
 > between write operations only.
 > In other words, if only the write operations are considered, then
 > there exists a serializable sequence between them that would produce the same
 > result as seen in the table.
@findinpath findinpath force-pushed the findinpath/delta-concurrent-update-merge-delete-reconciliation-follow-up branch from cfc7ebc to 72dc45a Compare May 28, 2024 20:37
@findinpath findinpath requested a review from pajaks May 28, 2024 20:44
@findinpath findinpath requested a review from ebyhr May 29, 2024 15:52
@ebyhr
Copy link
Member

ebyhr commented May 29, 2024

/test-with-secrets sha=04d7fde3bbb30b242b07acbbba1a00b0f04dd3ee

Copy link

github-actions bot commented May 29, 2024

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/9294322023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector
Development

Successfully merging this pull request may close these issues.

None yet

3 participants