-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41233: [C++] Added an are_cols_sorted option to RowTableMetadata for control column sorted #41234
base: main
Are you sure you want to change the base?
GH-41233: [C++] Added an are_cols_sorted option to RowTableMetadata for control column sorted #41234
Conversation
|
When I wanted to unify this parameter in swiss-join (building |
cd0245a
to
2ce1313
Compare
@@ -26,8 +26,10 @@ namespace arrow { | |||
namespace util { | |||
|
|||
template <typename T> | |||
void CheckAlignment(const void* ptr) { | |||
ARROW_DCHECK(reinterpret_cast<uint64_t>(ptr) % sizeof(T) == 0); | |||
void CheckAlignment(const void* ptr, bool do_check = true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the columns of the RowTable are not sorted during comparing(that is are_cols_sorted=false
), then there is no need to check memory alignment.
2ce1313
to
2858e2d
Compare
I have a question about the necessity of this fix. IIUC, though Is there a reason that Thanks. |
This PR try to support column "sorted" and "not-sorted" modes for grouper.
The performance of grouper in some "not-sorted" scenarios tested by benchmark(#41036) is better than that of "sorted" mode.
not-sorted GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10000 31779 us 31771 us 22 items_per_second=1.03138M/s null_percent=0.01 num_groups=32.063k size=32.768k uniqueness=0.0305777
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/100 32104 us 32095 us 22 items_per_second=1.02096M/s null_percent=1 num_groups=32.044k size=32.768k uniqueness=0.0305595
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10 34213 us 34205 us 21 items_per_second=957.999k/s null_percent=10 num_groups=31.015k size=32.768k uniqueness=0.0295782
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/2 40191 us 40179 us 18 items_per_second=815.542k/s null_percent=50 num_groups=21.011k size=32.768k uniqueness=0.0200377
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/1 31498 us 31492 us 22 items_per_second=1.04052M/s null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/0 28087 us 28081 us 25 items_per_second=1.16689M/s null_percent=0 num_groups=32.063k size=32.768k uniqueness=0.0305777
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10000 14359 us 14356 us 47 items_per_second=2.28251M/s null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.03125
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/100 14625 us 14622 us 48 items_per_second=2.24103M/s null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.03125
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10 15353 us 15349 us 46 items_per_second=2.13491M/s null_percent=10 num_groups=32.765k size=32.768k uniqueness=0.0312471
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/2 18351 us 18347 us 38 items_per_second=1.786M/s null_percent=50 num_groups=30.796k size=32.768k uniqueness=0.0293694
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/1 16070 us 16067 us 44 items_per_second=2.03942M/s null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/0 11155 us 11153 us 63 items_per_second=2.93807M/s null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.03125 sorted-mode: GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10000 32762 us 32749 us 22 items_per_second=1.00057M/s null_percent=0.01 num_groups=32.063k size=32.768k uniqueness=0.0305777
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/100 33081 us 33068 us 21 items_per_second=990.941k/s null_percent=1 num_groups=32.044k size=32.768k uniqueness=0.0305595
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10 35627 us 35612 us 19 items_per_second=920.136k/s null_percent=10 num_groups=31.015k size=32.768k uniqueness=0.0295782
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/2 41915 us 41892 us 17 items_per_second=782.197k/s null_percent=50 num_groups=21.011k size=32.768k uniqueness=0.0200377
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/1 31307 us 31300 us 22 items_per_second=1.04692M/s null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/0 28466 us 28460 us 25 items_per_second=1.15138M/s null_percent=0 num_groups=32.063k size=32.768k uniqueness=0.0305777
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10000 14997 us 14993 us 47 items_per_second=2.1856M/s null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.03125
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/100 14937 us 14933 us 46 items_per_second=2.19428M/s null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.03125
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10 15596 us 15592 us 45 items_per_second=2.10155M/s null_percent=10 num_groups=32.765k size=32.768k uniqueness=0.0312471
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/2 18663 us 18658 us 37 items_per_second=1.75624M/s null_percent=50 num_groups=30.796k size=32.768k uniqueness=0.0293694
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/1 16221 us 16217 us 43 items_per_second=2.0206M/s null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/0 11065 us 11062 us 63 items_per_second=2.96212M/s null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.03125 Other scenarios require column sorting, and the grouper operation performance after sorting is better. GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/10000 29802 us 29793 us 23 items_per_second=1.09987M/s null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.0625
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/100 30674 us 30659 us 24 items_per_second=1.06878M/s null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.0625
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/10 31777 us 31765 us 22 items_per_second=1.03159M/s null_percent=10 num_groups=32.764k size=32.768k uniqueness=0.0624924
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/2 34246 us 34232 us 21 items_per_second=957.24k/s null_percent=50 num_groups=30.393k size=32.768k uniqueness=0.05797
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/1 17872 us 17867 us 39 items_per_second=1.83396M/s null_percent=100 num_groups=1 size=32.768k uniqueness=1.90735u
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/0 28033 us 28017 us 26 items_per_second=1.16959M/s null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.0625 sorted GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/10000 24494 us 24485 us 28 items_per_second=1.33829M/s null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.0625
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/100 25044 us 25036 us 29 items_per_second=1.30883M/s null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.0625
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/10 25805 us 25795 us 27 items_per_second=1.27032M/s null_percent=10 num_groups=32.764k size=32.768k uniqueness=0.0624924
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/2 29097 us 29084 us 24 items_per_second=1.12666M/s null_percent=50 num_groups=30.393k size=32.768k uniqueness=0.05797
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/1 17850 us 17845 us 39 items_per_second=1.83627M/s null_percent=100 num_groups=1 size=32.768k uniqueness=1.90735u
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/0 22352 us 22341 us 31 items_per_second=1.46673M/s null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.0625 In general, in scenarios where the sorted mode is better, the grouper performance is improved significantly after sorting (15% improvement); in scenarios where the not-sorted mode is better, the grouper performance is improved by 2%-4%. Do you think we should support not-sorted mode in grouper for user? |
Try to summarize:
IIUC, then I think it might be more proper to treat this PR as an enhancement or even a small feature than a bug fix, because IMO it's not actually a bug if And the question comes to: do we need this new "non-sorted" approach for |
Could you explain:
|
Q1Sorted mode implementation has two main parts:
Non-soretd mode is the opposite of the above logic, that is: column sorting is not performed in the part 1, and the Q2I understand that the application scenario of non-sorted mode is that the input columns of [1]
[2] arrow/cpp/src/arrow/compute/row/compare_internal.cc Lines 366 to 368 in e254c43
|
So, to my question "are the two modes functionally different", is the answer "no"? |
The interface functions of the grouper are the same in both modes, except that another internal implementation of non-sorted mode is added. |
Is it worth keeping the two modes if the only difference is performance? Usually, it's better if users don't have to make low-level performance decisions themselves. |
Yes, this is what I have been thinking about recently. I will mark this PR as DRAFT first, and then start reviewing it when I can provide users with a more elegant way to use it (for example, if internal determines whether column sorting has been completed, the caller can directly use non-sorted mode). |
… we enable column-sorted in RowTableEncoder
2858e2d
to
b3e8438
Compare
// Check whether the column_order has changed due to sorting, | ||
// and the sorted column order will be used first for better | ||
// performance in grouper's compare. | ||
if (inverse_column_order[i] != column_order[i]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When constructing RowTableMetadata
, whether 'the column order after sort changes' will be used as a mark to access column data using sorted-mode or non-sorted mode.
b3e8438
to
83155e9
Compare
83155e9
to
596fde8
Compare
Rationale for this change
Fix the bug in grouper when set are_cols_in_encoding_order=false in below codes:
arrow/cpp/src/arrow/compute/row/grouper.cc
Line 582 in 2979d69
It will cause the num_group different with
are_cols_in_encoding_order=true
condition.What changes are included in this PR?
Added an are_cols_sorted option to RowTableMetadatato control whether we enable column-sorted in RowTableEncoder.
Are these changes tested?
Will be added after #40998 merged(grouper_test has added in this pr).
Are there any user-facing changes?
No.