-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DYOD] Maintain UCCs in face of potentially changing data #2599
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey all, I reviewed your code in the name of Team 3.
First of all: Good job with the implementation!
I checked mainly for hyrise coding standards, which were all pretty well followed. I already started to look inside your comments to find stuff I could pick on 😁 So I think this is a pretty good sign. Likewise, I've tried to add some comments regarding const
in between, sometimes I was not too sure if it's applicable. Feel free to try it out and if it works implement it, if not, so be it 😊
I will take a look at the other PR as well 😊
Co-authored-by: kalathen <kalathen@jyu.fi> Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
… become invalid Co-authored-by: kalathen <kalathen@jyu.fi> Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
Co-authored-by: kalathen <kalathen@jyu.fi> Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
…id using MVCC data Co-authored-by: kalathen <kalathen@jyu.fi> Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
Co-authored-by: kalathen <kalathen@jyu.fi> Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
…Cs becoming invalid and vice versa Co-authored-by: kalathen <kalathen@jyu.fi> Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
Co-authored-by: kalathen <kalathen@jyu.fi> Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
Co-authored-by: kalathen <kalathen@jyu.fi> Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
d214251
to
b3b721c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few suggestions, but nothing deal-breaking. Nice :)
@@ -21,11 +21,22 @@ class TableKeyConstraint final : public AbstractTableConstraint { | |||
* voilating the set semantics of the constraint. | |||
*/ | |||
TableKeyConstraint(const std::set<ColumnID>& columns, const KeyConstraintType key_type); | |||
TableKeyConstraint(const std::set<ColumnID>& columns, const KeyConstraintType key_type, const CommitID last_validated_on); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in your other PR, just use a default parameter here:
TableKeyConstraint(const std::set<ColumnID>& columns, const KeyConstraintType key_type,
const CommitID last_validated_on = INVALID_COMMIT_ID);
and explain the semantics of the special default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is there a reason not to use MAX_COMMIT_ID
(and maybe move it's definition to types.hpp?)
CommitID last_validated_on() const; | ||
|
||
/** | ||
* Whether this key constraint can become invalid if the table data changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to fill up 120 chars.
KeyConstraintType _key_type; | ||
|
||
/** | ||
* Commit ID during which this constraint was last validated. Note that the constraint will still be valid during |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on which snapshot?
const auto chunk_count = this->chunk_count(); | ||
// Iterate through the chunks backwards as inserts are more likely to happen in later chunks, potentially enabling us | ||
// to return faster. | ||
// Subtract 1 from the chunk_id only inside the loop to avoid underflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather start with chunk_count - 1 then and loop to >= 0?
_table_key_constraints.erase(constraint); | ||
} | ||
|
||
bool Table::constraint_guaranteed_to_be_valid(const TableKeyConstraint& table_key_constraint) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why this is a member of Table
, but I'd prefer this to be a helper function in a separate file taking the table and the constraint as arguments.
// Check all chunks for duplicates that we haven't yet in the first loop above. | ||
// Note that the loop below will only contain these chunks, as we have excluded the others when executing the GetTable | ||
// operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit weird formulation. Also, fill 120 chars.
// operator. | ||
const auto validated_chunk_count = table_view->chunk_count(); | ||
for (auto chunk_id = ChunkID{0}; chunk_id < validated_chunk_count; ++chunk_id) { | ||
// No need to do null checks here as we already did so above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you mean chunk nullptr and not NULL values. There's no need to check after the GetTable operator (it only forwards chunks with actual data).
@@ -73,6 +76,29 @@ class UccDiscoveryPluginTest : public BaseTest { | |||
ChunkEncoder::encode_all_chunks(table, chunk_encoding_spec); | |||
} | |||
|
|||
void _duplicate_table(const std::string table_name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take a reference? Not measurable in this test case though.
// Delete row of _table_A that had a duplicate value regarding column 1 such that column 1 is unique afterwards. | ||
_delete_row(_table_A, 3); | ||
|
||
// We are only interested in column 1, since it was not unique before the deletion but should be now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pick here: end w/ full stop.
const auto& constraints_B = _table_A->soft_key_constraints(); | ||
EXPECT_FALSE(constraints_B.contains({{ColumnID{0}}, KeyConstraintType::UNIQUE})); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pick here: I'd love to see some more test cases, e.g., no CommitID ipdate for schema-given UCC, Updates (since it's delete + insert w/ and w/o duplicate), validation while transactions are active (insert/delete ops are executed, but not committed), ...
Put this on hold until end of August. We plan to incorporate this feature, but there is currently no capacity for it. Will work on it when back from the internship. |
UCCs
(Unique Column Combinations) indicate the uniqueness across all entries of a column (or a set of columns). UCCs can be given by the schema (a primary key guarantees uniqueness), but also happen "incidentally" in real-world data.Because UCCs can be used for query optimization, we want to detect these "incidental" UCCs as well, because, for optimization purposes, we can use them just as we would use a primary key constraint.
Before this PR, hyrise was already capable to detect such UCCs. It did so by actually generating
UNIQUE
constraints on a table, which are translated into UCCs during query optimization. However, this discovery happened under the assumption that the data of the table never changes. This means, if the data of the table were to change and violate the previously discovered "incidental" UCC by creating a duplicate value in the targeted column, this using the UCC for query optimization would lead to incorrect query results.Therefore, in this MR, we assign a validation
commit ID
to the TableKeyConstraint (UCC) such that it may be only used for optimization if the table has not seen any changes since thiscommit ID
(note that every modifying transaction increments the globalcommit ID
). We will then regularly revalidate the UCC and see if we can expand the constraint to larger commit IDs.For optimization purposes, we also make use of the
MVCC
(Multi-Version Concurrency Control) data of chunks, which includes but is not limited to thecommit IDs
of when any data was last inserted to or deleted from the chunk (note that hyrise models modifications as delete+insert).