-
Notifications
You must be signed in to change notification settings - Fork 854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occupancy improvement for Hash table build #15700
base: branch-24.08
Are you sure you want to change the base?
Conversation
I think the approach of specializing the type dispatcher is very cumbersome and will lead to a lot of code replication. Currently, I have the conditional dispatch working for |
/ok to test |
1 similar comment
/ok to test |
@tgujar I've updated the docs to unblock CI. Have you noticed any performance regressions for other use cases? It seems that it improves the performance for mixed join but the performance drops significantly in other cases using row hasher. |
id_to_type<type_id::DECIMAL128>, | ||
id_to_type<type_id::DECIMAL64>, | ||
id_to_type<type_id::DECIMAL32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think decimal types are complex type. They are just a wrapper around some integer type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equality operator for Decimal will perform scaling which uses exponentiation.
CUDF_HOST_DEVICE inline bool operator==(fixed_point<Rep1, Rad1> const& lhs, |
I see a reduction in register usage if I comment out decimal types in #15502. I think we can still decide on the types excluded in the branches later on
/ok to test |
@tgujar Could you take a look at the failing tests? |
/ok to test |
1 similar comment
/ok to test |
* @throw cudf::logic_error if the input tables were preprocessed to transform any nested children | ||
* columns into integer columns but `PhysicalElementComparator` is not | ||
* @throw cudf::logic_error if the input tables were preprocessed to transform any nested | ||
* children columns into integer columns but `PhysicalElementComparator` is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that a significant number of changes to this file are due to reformatting comments.
Would it be possible to undo those changes? This particular change is certainly not desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will fix. This was caused because of the clang-format extension on vscode
This PR needs to be rebased on branch-24.08. |
Specializing both the comparator and the hasher drops the register usage to 54 instead of the expected 46 for the mixed semi join case. Investigating why the register pressure is different from commenting out the code paths. |
__device__ bool operator()(size_type const lhs_element_index, | ||
size_type const rhs_element_index) const noexcept | ||
{ | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to express here that this path is unreachable since for types which are typedef-ed to void in the Id to type map are known to be not present in the row. But for some reason, which I don't know why yet, if I instead use CUDF_UNREACHABLE(...)
the register usage increases to 54, instead of the expected 46 in the mixed_semi_join case for integer keys and values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A naive question: have you quantified the performance impact by reducing the register usage from 54 to 46?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done this yet. this would be different for GPU and for joins as well. For e.g on T4, looking at the occupancy calculator, I dont expect any change in perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks for the prompt reply. We can proceed with the current and revisit this when wrapping up the PR, if it reduces register usage but has no performance impact, I'm inclined to use CUDF_UNREACHABLE
for better readability.
build_column_types.insert(col.type().id()); | ||
} | ||
for (auto col : probe) { | ||
probe_column_types.insert(col.type().id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if separation between probe and build column types is required? If I understand correctly left_equality
and right_equality
should have the same column types. This function previously did not have a check using CUDF_EXPECTS(cudf::have_same_types ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if separation between probe and build column types is required?
Probably not needed. If two tables don't match, build time failures will occur when constructing the two-table comparator.
I have a question here. Is it preferable that I make the changes to all the join operations in this PR or break them up into different ones? |
We could just focus on mixed join for this PR. The goal is mainly to evaluate the performance impact and design of the new dispatching method. |
Benchmark results. MR adds specialized dispatch for build and probe in case of hash joins, and only for build in case of mixed semi/anti joins. Other joins are not modified
|
/ok to test |
I think as the PR is currently, this should have the |
Description
Implements specialized template dispatch for hash joins and mixed semi joins to fix issue describes in #15502.
At a high level, this PR typedef's some types to void depending on the column types in the row's to avoid high register usage for comparator and hasher operations associated with more involved types (lists, structs, string, ...). This is done by dynamic dispatch on CPU side using std::variant+std::visit and dispatching with a specialized template.
This pattern can later be extended to other joins and also to groupby operation. Any operator using row hasher and row comparator should be able to see and improvement in occupancy for hash table build/probe operation.
Checklist