Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DYOD] Maintain UCCs in face of potentially changing data #2599

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

j-hellenberg
Copy link
Contributor

@j-hellenberg j-hellenberg commented Aug 2, 2023

UCCs (Unique Column Combinations) indicate the uniqueness across all entries of a column (or a set of columns). UCCs can be given by the schema (a primary key guarantees uniqueness), but also happen "incidentally" in real-world data.

Because UCCs can be used for query optimization, we want to detect these "incidental" UCCs as well, because, for optimization purposes, we can use them just as we would use a primary key constraint.

Before this PR, hyrise was already capable to detect such UCCs. It did so by actually generating UNIQUE constraints on a table, which are translated into UCCs during query optimization. However, this discovery happened under the assumption that the data of the table never changes. This means, if the data of the table were to change and violate the previously discovered "incidental" UCC by creating a duplicate value in the targeted column, this using the UCC for query optimization would lead to incorrect query results.

Therefore, in this MR, we assign a validation commit ID to the TableKeyConstraint (UCC) such that it may be only used for optimization if the table has not seen any changes since this commit ID (note that every modifying transaction increments the global commit ID). We will then regularly revalidate the UCC and see if we can expand the constraint to larger commit IDs.

For optimization purposes, we also make use of the MVCC (Multi-Version Concurrency Control) data of chunks, which includes but is not limited to the commit IDs of when any data was last inserted to or deleted from the chunk (note that hyrise models modifications as delete+insert).

@j-hellenberg j-hellenberg marked this pull request as ready for review August 2, 2023 16:12
@j-hellenberg j-hellenberg added the FullCI Run all CI tests (slow, but required for merge) label Aug 2, 2023
@SanJSp SanJSp self-requested a review August 4, 2023 16:36
Copy link
Contributor

@SanJSp SanJSp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey all, I reviewed your code in the name of Team 3.
First of all: Good job with the implementation!

I checked mainly for hyrise coding standards, which were all pretty well followed. I already started to look inside your comments to find stuff I could pick on 😁 So I think this is a pretty good sign. Likewise, I've tried to add some comments regarding const in between, sometimes I was not too sure if it's applicable. Feel free to try it out and if it works implement it, if not, so be it 😊

I will take a look at the other PR as well 😊

src/lib/storage/constraints/table_key_constraint.hpp Outdated Show resolved Hide resolved
src/lib/storage/constraints/table_key_constraint.hpp Outdated Show resolved Hide resolved
src/lib/storage/constraints/table_key_constraint.cpp Outdated Show resolved Hide resolved
src/lib/storage/table.hpp Outdated Show resolved Hide resolved
src/lib/storage/table.cpp Show resolved Hide resolved
src/test/plugins/ucc_discovery_plugin_test.cpp Outdated Show resolved Hide resolved
src/test/plugins/ucc_discovery_plugin_test.cpp Outdated Show resolved Hide resolved
src/test/plugins/ucc_discovery_plugin_test.cpp Outdated Show resolved Hide resolved
src/test/plugins/ucc_discovery_plugin_test.cpp Outdated Show resolved Hide resolved
src/test/plugins/ucc_discovery_plugin_test.cpp Outdated Show resolved Hide resolved
Base automatically changed from dey4ss/set_cids to master August 17, 2023 12:26
j-hellenberg and others added 8 commits August 22, 2023 11:20
Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
… become invalid

Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
…id using MVCC data

Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
…Cs becoming invalid and vice versa

Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
@j-hellenberg j-hellenberg force-pushed the j-hellenberg/maintain_data_dependencies branch from d214251 to b3b721c Compare August 22, 2023 09:21
Copy link
Member

@dey4ss dey4ss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few suggestions, but nothing deal-breaking. Nice :)

@@ -21,11 +21,22 @@ class TableKeyConstraint final : public AbstractTableConstraint {
* voilating the set semantics of the constraint.
*/
TableKeyConstraint(const std::set<ColumnID>& columns, const KeyConstraintType key_type);
TableKeyConstraint(const std::set<ColumnID>& columns, const KeyConstraintType key_type, const CommitID last_validated_on);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in your other PR, just use a default parameter here:

TableKeyConstraint(const std::set<ColumnID>& columns, const KeyConstraintType key_type,
                   const CommitID last_validated_on = INVALID_COMMIT_ID);

and explain the semantics of the special default value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is there a reason not to use MAX_COMMIT_ID (and maybe move it's definition to types.hpp?)

CommitID last_validated_on() const;

/**
* Whether this key constraint can become invalid if the table data changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to fill up 120 chars.

KeyConstraintType _key_type;

/**
* Commit ID during which this constraint was last validated. Note that the constraint will still be valid during
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on which snapshot?

const auto chunk_count = this->chunk_count();
// Iterate through the chunks backwards as inserts are more likely to happen in later chunks, potentially enabling us
// to return faster.
// Subtract 1 from the chunk_id only inside the loop to avoid underflows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather start with chunk_count - 1 then and loop to >= 0?

_table_key_constraints.erase(constraint);
}

bool Table::constraint_guaranteed_to_be_valid(const TableKeyConstraint& table_key_constraint) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why this is a member of Table, but I'd prefer this to be a helper function in a separate file taking the table and the constraint as arguments.

Comment on lines +279 to +281
// Check all chunks for duplicates that we haven't yet in the first loop above.
// Note that the loop below will only contain these chunks, as we have excluded the others when executing the GetTable
// operator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit weird formulation. Also, fill 120 chars.

// operator.
const auto validated_chunk_count = table_view->chunk_count();
for (auto chunk_id = ChunkID{0}; chunk_id < validated_chunk_count; ++chunk_id) {
// No need to do null checks here as we already did so above
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you mean chunk nullptr and not NULL values. There's no need to check after the GetTable operator (it only forwards chunks with actual data).

@@ -73,6 +76,29 @@ class UccDiscoveryPluginTest : public BaseTest {
ChunkEncoder::encode_all_chunks(table, chunk_encoding_spec);
}

void _duplicate_table(const std::string table_name) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take a reference? Not measurable in this test case though.

// Delete row of _table_A that had a duplicate value regarding column 1 such that column 1 is unique afterwards.
_delete_row(_table_A, 3);

// We are only interested in column 1, since it was not unique before the deletion but should be now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pick here: end w/ full stop.

const auto& constraints_B = _table_A->soft_key_constraints();
EXPECT_FALSE(constraints_B.contains({{ColumnID{0}}, KeyConstraintType::UNIQUE}));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pick here: I'd love to see some more test cases, e.g., no CommitID ipdate for schema-given UCC, Updates (since it's delete + insert w/ and w/o duplicate), validation while transactions are active (insert/delete ops are executed, but not committed), ...

@dey4ss
Copy link
Member

dey4ss commented Apr 30, 2024

Put this on hold until end of August. We plan to incorporate this feature, but there is currently no capacity for it. Will work on it when back from the internship.

@dey4ss dey4ss closed this Apr 30, 2024
@dey4ss dey4ss reopened this Apr 30, 2024
@dey4ss dey4ss marked this pull request as draft April 30, 2024 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FullCI Run all CI tests (slow, but required for merge)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants