GH-41233: [C++] Added an are_cols_sorted option to RowTableMetadata for control column sorted #41234

ZhangHuiGui · 2024-04-16T14:41:58Z

Rationale for this change

Fix the bug in grouper when set are_cols_in_encoding_order=false in below codes:

arrow/cpp/src/arrow/compute/row/grouper.cc

Line 582 in 2979d69

/* are_cols_in_encoding_order=*/true);

It will cause the num_group different with are_cols_in_encoding_order=true condition.

What changes are included in this PR?

Added an are_cols_sorted option to RowTableMetadatato control whether we enable column-sorted in RowTableEncoder.

Are these changes tested?

Will be added after #40998 merged(grouper_test has added in this pr).

Are there any user-facing changes?

No.

GitHub Issue: [C++] Grouper produce a wrong num_groups when set are_cols_in_encoding_order to false #41233

github-actions · 2024-04-16T14:42:26Z

⚠️ GitHub issue #41233 has been automatically assigned in GitHub to PR creator.

ZhangHuiGui · 2024-04-16T14:48:56Z

When I wanted to unify this parameter in swiss-join (building RowTableImpl and RowTableEncoder will sort columns, and the order of sorted-columns will not be used by default in CompareColumnsToRows), I found that the test cannot be passed. It seems that swiss-join internally relies on the current use method.

ZhangHuiGui · 2024-05-18T13:29:42Z

cpp/src/arrow/compute/util_internal.h

@@ -26,8 +26,10 @@ namespace arrow {
 namespace util {

 template <typename T>
-void CheckAlignment(const void* ptr) {
-  ARROW_DCHECK(reinterpret_cast<uint64_t>(ptr) % sizeof(T) == 0);
+void CheckAlignment(const void* ptr, bool do_check = true) {


If the columns of the RowTable are not sorted during comparing(that is are_cols_sorted=false), then there is no need to check memory alignment.

zanmato1984 · 2024-05-20T05:18:01Z

I have a question about the necessity of this fix.

IIUC, though RowTableImpl supports both usages of columns "in encoding order" and "not in encoding order", the user (e.g. Grouper or SwissJoin) is free to choose either, in other words, the user is not mandatory to support both. For example, the current SwissJoin is using it the way that all columns are assumed "not in encoding order" and it is perfectly fine because there isn't a case that requires SwissJoin to do it the other way. The same goes to Grouper as well.

Is there a reason that Grouper must assume the columns are not in encoding order, or that Grouper can benefit in terms of performance/complexity from treating the columns not in encoding order?

Thanks.

ZhangHuiGui · 2024-05-21T02:45:01Z

I have a question about the necessity of this fix.

IIUC, though RowTableImpl supports both usages of columns "in encoding order" and "not in encoding order", the user (e.g. Grouper or SwissJoin) is free to choose either, in other words, the user is not mandatory to support both. For example, the current SwissJoin is using it the way that all columns are assumed "not in encoding order" and it is perfectly fine because there isn't a case that requires SwissJoin to do it the other way. The same goes to Grouper as well.

Is there a reason that Grouper must assume the columns are not in encoding order, or that Grouper can benefit in terms of performance/complexity from treating the columns not in encoding order?

Thanks.

This PR try to support column "sorted" and "not-sorted" modes for grouper.

"sorted" mode means do column sort for RowTable and compare with "are_cols_in_encoding_order=true"
"not-sorted" mode, contrary to the above logic, do not sort for RowTable and compare with "are_cols_in_encoding_order=false"

The performance of grouper in some "not-sorted" scenarios tested by benchmark(#41036) is better than that of "sorted" mode.

The columns in this scenario are already sorted. If you use sorted-mode again, the performance will be worse.

not-sorted

GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10000      31779 us        31771 us           22 items_per_second=1.03138M/s null_percent=0.01 num_groups=32.063k size=32.768k uniqueness=0.0305777
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/100        32104 us        32095 us           22 items_per_second=1.02096M/s null_percent=1 num_groups=32.044k size=32.768k uniqueness=0.0305595
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10         34213 us        34205 us           21 items_per_second=957.999k/s null_percent=10 num_groups=31.015k size=32.768k uniqueness=0.0295782
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/2          40191 us        40179 us           18 items_per_second=815.542k/s null_percent=50 num_groups=21.011k size=32.768k uniqueness=0.0200377
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/1          31498 us        31492 us           22 items_per_second=1.04052M/s null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/0          28087 us        28081 us           25 items_per_second=1.16689M/s null_percent=0 num_groups=32.063k size=32.768k uniqueness=0.0305777
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10000        14359 us        14356 us           47 items_per_second=2.28251M/s null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.03125
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/100          14625 us        14622 us           48 items_per_second=2.24103M/s null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.03125
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10           15353 us        15349 us           46 items_per_second=2.13491M/s null_percent=10 num_groups=32.765k size=32.768k uniqueness=0.0312471
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/2            18351 us        18347 us           38 items_per_second=1.786M/s null_percent=50 num_groups=30.796k size=32.768k uniqueness=0.0293694
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/1            16070 us        16067 us           44 items_per_second=2.03942M/s null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/0            11155 us        11153 us           63 items_per_second=2.93807M/s null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.03125

sorted-mode:

GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10000      32762 us        32749 us           22 items_per_second=1.00057M/s null_percent=0.01 num_groups=32.063k size=32.768k uniqueness=0.0305777
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/100        33081 us        33068 us           21 items_per_second=990.941k/s null_percent=1 num_groups=32.044k size=32.768k uniqueness=0.0305595
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/10         35627 us        35612 us           19 items_per_second=920.136k/s null_percent=10 num_groups=31.015k size=32.768k uniqueness=0.0295782
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/2          41915 us        41892 us           17 items_per_second=782.197k/s null_percent=50 num_groups=21.011k size=32.768k uniqueness=0.0200377
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/1          31307 us        31300 us           22 items_per_second=1.04692M/s null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n
GrouperWithMultiTypes/"{boolean, boolean, utf8, utf8}"/32768/0          28466 us        28460 us           25 items_per_second=1.15138M/s null_percent=0 num_groups=32.063k size=32.768k uniqueness=0.0305777
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10000        14997 us        14993 us           47 items_per_second=2.1856M/s null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.03125
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/100          14937 us        14933 us           46 items_per_second=2.19428M/s null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.03125
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/10           15596 us        15592 us           45 items_per_second=2.10155M/s null_percent=10 num_groups=32.765k size=32.768k uniqueness=0.0312471
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/2            18663 us        18658 us           37 items_per_second=1.75624M/s null_percent=50 num_groups=30.796k size=32.768k uniqueness=0.0293694
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/1            16221 us        16217 us           43 items_per_second=2.0206M/s null_percent=100 num_groups=1 size=32.768k uniqueness=953.674n
GrouperWithMultiTypes/"{int32, int32, int64, int64}"/32768/0            11065 us        11062 us           63 items_per_second=2.96212M/s null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.03125

Other scenarios require column sorting, and the grouper operation performance after sorting is better.
not-sorted

GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/10000      29802 us        29793 us           23 items_per_second=1.09987M/s null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.0625
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/100        30674 us        30659 us           24 items_per_second=1.06878M/s null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.0625
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/10         31777 us        31765 us           22 items_per_second=1.03159M/s null_percent=10 num_groups=32.764k size=32.768k uniqueness=0.0624924
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/2          34246 us        34232 us           21 items_per_second=957.24k/s null_percent=50 num_groups=30.393k size=32.768k uniqueness=0.05797
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/1          17872 us        17867 us           39 items_per_second=1.83396M/s null_percent=100 num_groups=1 size=32.768k uniqueness=1.90735u
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/0          28033 us        28017 us           26 items_per_second=1.16959M/s null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.0625

sorted

GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/10000      24494 us        24485 us           28 items_per_second=1.33829M/s null_percent=0.01 num_groups=32.768k size=32.768k uniqueness=0.0625
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/100        25044 us        25036 us           29 items_per_second=1.30883M/s null_percent=1 num_groups=32.768k size=32.768k uniqueness=0.0625
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/10         25805 us        25795 us           27 items_per_second=1.27032M/s null_percent=10 num_groups=32.764k size=32.768k uniqueness=0.0624924
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/2          29097 us        29084 us           24 items_per_second=1.12666M/s null_percent=50 num_groups=30.393k size=32.768k uniqueness=0.05797
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/1          17850 us        17845 us           39 items_per_second=1.83627M/s null_percent=100 num_groups=1 size=32.768k uniqueness=1.90735u
GrouperWithMultiTypes/"{utf8, int32, int64, fixed_size_binary(128), boolean}"/32768/0          22352 us        22341 us           31 items_per_second=1.46673M/s null_percent=0 num_groups=32.768k size=32.768k uniqueness=0.0625

In general, in scenarios where the sorted mode is better, the grouper performance is improved significantly after sorting (15% improvement); in scenarios where the not-sorted mode is better, the grouper performance is improved by 2%-4%.

Do you think we should support not-sorted mode in grouper for user?

zanmato1984 · 2024-05-21T04:49:08Z

Try to summarize:

The current Grouper (before this PR) uses the "sorted" approach, has no bug, and performs 15% faster than the "non-sorted" approach (introduced in this PR) for "sorted"-friendly scenarios.
The current Grouper (before this PR) doesn't support the "non-sorted" approach (or in other words, has bug if force enabled).
This PR enables the "non-sorted" approach (or in other words, fixes the bug in 2), as an user faced option, and performs 2%~4% faster than the original "sorted" approach for "non-sorted"-friendly scenarios.

IIUC, then I think it might be more proper to treat this PR as an enhancement or even a small feature than a bug fix, because IMO it's not actually a bug if Grouper is not designed to use "non-sorted" approach from the beginning.

And the question comes to: do we need this new "non-sorted" approach for Grouper with the given performance number? Personally I think we can have it, but I'm not confident (nor authorized) to make the call. So I'll just leave it to others :) cc @pitrou @felipecrv @westonpace

pitrou · 2024-05-21T14:04:22Z

Could you explain:

is "sorted" mode functionally different from "non-sorted" mode?
what explains the perf difference between the two?

ZhangHuiGui · 2024-05-21T15:56:21Z

Q1

Sorted mode implementation has two main parts:

Add the input cols to the RowTable using the RowTableEncoder pair according to the column sort in [1]. The main purpose of column sorting is to construct arrays of memory-alignment column_offsets to facilitate memory-aligned comparisons in subsequent column-Compare comparisons.
The [2] logic in CompareColumnsToRows decides, according to are_cols_in_encoding_order, whether to take the offset of the comparing columns from RowTable (rows) in order of part1.

Non-soretd mode is the opposite of the above logic, that is: column sorting is not performed in the part 1, and the column_offsets of the input original columns are used in the part 2.

Q2

I understand that the application scenario of non-sorted mode is that the input columns of RowTableEncoder are already sorted (can be accessed by memory alignment). When comparing, the input columns can be compared in the order of the input columns [2] without memory alignment check. This is also the reason for the improved performance in the non-sorted mode in the benchmark above.

[1]

arrow/cpp/src/arrow/compute/row/row_internal.cc

Line 88 in e254c43

std::sort(

[2]

arrow/cpp/src/arrow/compute/row/compare_internal.cc

Lines 366 to 368 in e254c43

    
           uint32_t offset_within_row = 
        
               rows.metadata().encoded_field_offset(ColIdInEncodingOrder( 
        
                   rows, static_cast<uint32_t>(icol), are_cols_in_encoding_order));

pitrou · 2024-05-21T16:40:20Z

So, to my question "are the two modes functionally different", is the answer "no"?

ZhangHuiGui · 2024-05-22T01:26:10Z

So, to my question "are the two modes functionally different", is the answer "no"?

The interface functions of the grouper are the same in both modes, except that another internal implementation of non-sorted mode is added.

pitrou · 2024-05-23T09:41:11Z

Is it worth keeping the two modes if the only difference is performance? Usually, it's better if users don't have to make low-level performance decisions themselves.

ZhangHuiGui · 2024-05-24T01:45:44Z

Usually, it's better if users don't have to make low-level performance decisions themselves.

Yes, this is what I have been thinking about recently. The RowTableEncoder and CompareColumnsToRows interfaces are too low-level and are currently only used in grouper and swiss-join. Choosing the optimal performance mode still requires some internal automatic decision-making, mainly because It is not clear to the user whether the order of the input columns has been sorted.

I will mark this PR as DRAFT first, and then start reviewing it when I can provide users with a more elegant way to use it (for example, if internal determines whether column sorting has been completed, the caller can directly use non-sorted mode).

… we enable column-sorted in RowTableEncoder

ZhangHuiGui · 2024-05-25T06:27:05Z

cpp/src/arrow/compute/row/row_internal.cc

+    // Check whether the column_order has changed due to sorting,
+    // and the sorted column order will be used first for better
+    // performance in grouper's compare.
+    if (inverse_column_order[i] != column_order[i]) {


When constructing RowTableMetadata, whether 'the column order after sort changes' will be used as a mark to access column data using sorted-mode or non-sorted mode.

github-actions bot added Component: C++ awaiting review Awaiting review labels Apr 16, 2024

ZhangHuiGui force-pushed the fix-num-groups-wrong-with-column-sorted-off branch from cd0245a to 2ce1313 Compare May 18, 2024 13:26

ZhangHuiGui commented May 18, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 18, 2024

ZhangHuiGui requested review from zanmato1984 and felipecrv May 18, 2024 13:30

ZhangHuiGui force-pushed the fix-num-groups-wrong-with-column-sorted-off branch from 2ce1313 to 2858e2d Compare May 19, 2024 04:40

ZhangHuiGui marked this pull request as draft May 24, 2024 01:45

Added an are_cols_sorted option to RowTableMetadatato control whether…

ff8fd7d

… we enable column-sorted in RowTableEncoder

ZhangHuiGui force-pushed the fix-num-groups-wrong-with-column-sorted-off branch from 2858e2d to b3e8438 Compare May 25, 2024 06:25

ZhangHuiGui commented May 25, 2024

View reviewed changes

ZhangHuiGui marked this pull request as ready for review May 25, 2024 07:07

ZhangHuiGui force-pushed the fix-num-groups-wrong-with-column-sorted-off branch from b3e8438 to 83155e9 Compare May 27, 2024 01:32

Support internal setting sorted-mode

596fde8

ZhangHuiGui force-pushed the fix-num-groups-wrong-with-column-sorted-off branch from 83155e9 to 596fde8 Compare May 30, 2024 06:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-41233: [C++] Added an are_cols_sorted option to RowTableMetadata for control column sorted #41234

GH-41233: [C++] Added an are_cols_sorted option to RowTableMetadata for control column sorted #41234

ZhangHuiGui commented Apr 16, 2024 •

edited by github-actions bot

github-actions bot commented Apr 16, 2024

ZhangHuiGui commented Apr 16, 2024

ZhangHuiGui May 18, 2024

zanmato1984 commented May 20, 2024

ZhangHuiGui commented May 21, 2024

zanmato1984 commented May 21, 2024

pitrou commented May 21, 2024

ZhangHuiGui commented May 21, 2024

pitrou commented May 21, 2024

ZhangHuiGui commented May 22, 2024 •

edited

pitrou commented May 23, 2024

ZhangHuiGui commented May 24, 2024

ZhangHuiGui May 25, 2024

GH-41233: [C++] Added an are_cols_sorted option to RowTableMetadata for control column sorted #41234

Are you sure you want to change the base?

GH-41233: [C++] Added an are_cols_sorted option to RowTableMetadata for control column sorted #41234

Conversation

ZhangHuiGui commented Apr 16, 2024 • edited by github-actions bot

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Apr 16, 2024

ZhangHuiGui commented Apr 16, 2024

ZhangHuiGui May 18, 2024

Choose a reason for hiding this comment

zanmato1984 commented May 20, 2024

ZhangHuiGui commented May 21, 2024

zanmato1984 commented May 21, 2024

pitrou commented May 21, 2024

ZhangHuiGui commented May 21, 2024

Q1

Q2

pitrou commented May 21, 2024

ZhangHuiGui commented May 22, 2024 • edited

pitrou commented May 23, 2024

ZhangHuiGui commented May 24, 2024

ZhangHuiGui May 25, 2024

Choose a reason for hiding this comment

ZhangHuiGui commented Apr 16, 2024 •

edited by github-actions bot

ZhangHuiGui commented May 22, 2024 •

edited