-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41301: [C++] Extract the kernel loops used for PrimitiveTakeExec and generalize to any fixed-width type #41373
base: main
Are you sure you want to change the base?
Conversation
29b582d
to
d47fb55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm skeptical that you want to reuse this for Filter, unless you add Gather methods for batch selection. For Filter performance, it is essential to write out ranges of selected values at a time, not one value at a time. I don't know if that's what you have in mind.
Other than that, some assorted comments.
} | ||
template <typename IndexCType, typename ValueBitWidthConstant, | ||
typename OutputIsZeroInitialized = std::false_type, | ||
typename WithFactor = std::false_type> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not bool kOutputIsZeroInitialized
and bool kWithFactor = false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No specific reason.
I will change, as I prefer to use bool
directly myself. I was trying to be compatible with the use of std::integral_constant<>
for ValueBitWidthConstant
that was added by you in a previous PR :)
What's the reasoning for using std::integral_constant
instead of an integer typed template parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for using
std::integral_constant
instead of an integer typed template parameter?
I forgot to respond to this, sorry. I don't remember exactly, but using an integer probably failed for some reason, perhaps because it would prevent partial specialization? (but my memory is fuzzy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change everything to bool
and int
and see what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for these tparams being types and not values is that they are populated from a pack of template type parameters:
template <template <typename...> class TakeImpl, typename... Args>
void TakeIndexDispatch(KernelContext* ctx, const ArraySpan& values,
const ArraySpan& indices, ArrayData* out, int64_t factor = 1) {
TakeIndexDispatch<FixedWidthTakeImpl, ...all params here should be types...>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AH, of course. Let's add a comment explaining this for future readers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/arrow/compute/kernels/vector_selection_take_internal.cc
Outdated
Show resolved
Hide resolved
I want to expand the set of |
3dd6c56
to
6d769d3
Compare
…elated to src_offset_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some assorted comments below. The code looks good on the principle.
// See derived Gather classes below for the meaning of the parameters, pre and | ||
// post-conditions. | ||
template <bool kOutputIsZeroInitialized, typename IndexCType> | ||
ARROW_FORCE_INLINE int64_t ExecuteWithNulls(const ArraySpan& src_validity, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ARROW_FORCE_INLINE
desirable here? Generally, the compiler is able to make more informed heuristics than us (but not always admittedly). This function is rather complex so I'm not sure force-inlining it would actually be beneficial (and could blow up code generation sizes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I checked code and benchmarks (a month or more ago).
Here's the overall approach with this header.
GatherBaseCRTP
is a template that generates these templates:
template <typename Global, typename Params>
class Gather {
a simple;
b scalar;
c members;
void WriteValue(pos) { ... }
void WriteZero(pos) { ... }
void WriteZeroSegment(pos, len) { ... }
int64_t ExecuteNoNulls(len) { .. }
template<bool OutputIsZeroInitialized>
int64_t Execute(src_validity, idx_validity, out_is_valid) { ... }
};
The usage should be as follows. On the same function, the specific Gather
instance should be instantiated and the method should be called on it:
const bool out_has_validity = values.MayHaveNulls() || indices.MayHaveNulls();
const uint8_t* src;
int64_t src_offset;
std::tie(src_offset, src) = util::OffsetPointerOfFixedBitWidthValues(values);
uint8_t* out = util::MutableFixedWidthValuesPointer(out_arr);
int64_t valid_count = 0;
arrow::internal::Gather<kValueWidthInBits, IndexCType, WithFactor::value> gather{
/*src_length=*/values.length,
src,
src_offset,
/*idx_length=*/indices.length,
/*idx=*/indices.GetValues<IndexCType>(1),
out,
factor};
if (out_has_validity) {
DCHECK_EQ(out_arr->offset, 0);
// out_is_valid must be zero-initiliazed, because Gather::Execute
// saves time by not having to ClearBit on every null element.
auto out_is_valid = out_arr->GetMutableValues<uint8_t>(0);
memset(out_is_valid, 0, bit_util::BytesForBits(out_arr->length));
valid_count = gather.template Execute<OutputIsZeroInitialized::value>(
/*src_validity=*/values, /*idx_validity=*/indices, out_is_valid);
} else {
valid_count = gather.Execute();
}
What this achieves? Gather<kValueWidthInBits, ...> gather{...}
isn't really the initialization of a portion of the memory in the stack, but of registers. Because the lifetime of gather
is confined to a single function and all the method calls on it are inlined, the member of gather
can live in registers from initialization to the end of the kernel loop. I did this to achieve the same performance we had when these loops were hand-written (more than once!) in vector_selection_take_internal.cc
.
This idea of keeping variables in registers is mentioned in database implementation reports (Efficiently Compiling Efficient Query Plans for Modern Hardware) and it's not just compiler-brain perfectionism (https://llvm.org/docs/Passes.html#sroa-scalar-replacement-of-aggregates).
By force-inlining the Execute
methods I'm actually reducing code size (together with the fact that I instantiate Gather
and these specific Execute
calls from a single place). Without inlining Execute
there would be more binary code to deal with construction/destruction, loads to/from memory.
To make sure we only ever instantiate these templates once, I could wrap this code in templates outside the inline namespace. The templates would contain the initialization of Gather<...> gather
and the inlined calls to the two inlined Execute
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to build that functional façade to these classes before we concretely have more than two instantiations of these templates? I could create an issue to track this concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know using registers is critical for performance. I'm just surprised that the compiler wouldn't cache member variables in registers in a (non-inlined) method's body. Adding a couple memory lookups at the start and end of such a heavy function should not yield a noticeable slowdown.
Just because a compiler feels better with some particular organization of the source code doesn't mean we've unlocked a robust performance improvement. We may just have hit a sweeter spot in this optimizer's behaviour.
In any case, we can proceed with this. This is helper code that should only be used in a couple kernels, so the potential code size increase is hopefully (!) minor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the inline namespace annotation as well as it seems to be a red-herring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just surprised that the compiler wouldn't cache member variables in registers in a (non-inlined) method's body. Adding a couple memory lookups at the start and end of such a heavy function should not yield a noticeable slowdown.
Without inlining Execute
, there has to be a this
pointer pointing to the struct laid out in memory and call to Write*
have to load members through the this
pointer. Not an impossible optmization, but wasn't happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without inlining
Execute
, there has to be athis
pointer pointing to the struct laid out in memory and call toWrite*
have to load members through thethis
pointer. Not an impossible optmization, but wasn't happening.
I see. Compilers are annoying :-)
Proof that this PR reduces binary size: ~/code/arrow/cpp/release $ wc -c ./src/arrow/CMakeFiles/arrow_compute.dir/compute/kernels/vector_selection_internal.cc.o
278808 ./src/arrow/CMakeFiles/arrow_compute.dir/compute/kernels/vector_selection_internal.cc.o
$ git co origin/main
HEAD is now at 7f0c4070dd GH-41397: [C#] Downgrade macOS test runner to avoid infrastructure bug (#41934)
~/code/arrow/cpp/release $ ninja ./src/arrow/CMakeFiles/arrow_compute.dir/compute/kernels/vector_selection_internal.cc.o
[1/1] Building CXX object src/arrow/CMakeFiles/arrow_compute.dir/compute/kernels/vector_selection_internal.cc.o
~/code/arrow/cpp/release $ wc -c ./src/arrow/CMakeFiles/arrow_compute.dir/compute/kernels/vector_selection_internal.cc.o
289832 ./src/arrow/CMakeFiles/arrow_compute.dir/compute/kernels/vector_selection_internal.cc.o -11024 bytes 🎉 |
Nice! Can you also post numbers obtained with the |
Looking at both bloaty -d symbols -C full -n 0 \
HEAD-vector_selection_take_internal.cc.o HEAD-vector_selection_internal.cc.o -- \
MAIN-vector_selection_take_internal.cc.o MAIN-vector_selection_internal.cc.o FILE SIZE VM SIZE
-------------- --------------
[NEW] +32.8Ki [NEW] +32.7Ki arrow::compute::internal::FixedWidthTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*)
[NEW] +204 [NEW] +168 GCC_except_table272
[NEW] +189 [NEW] +104 arrow::Status arrow::Status::NotImplemented<char const (&) [38], arrow::DataType const&>(char const (&&&) [38], arrow::DataType const&&&) Details
+183% +168 +300% +168 GCC_except_table170
+43% +164 +47% +164 GCC_except_table20
[NEW] +156 [NEW] +120 GCC_except_table302
[NEW] +146 [NEW] +76 GCC_except_table99
+40% +109 +71% +144 GCC_except_table24
+72% +143 +66% gg +108 GCC_except_table13
[NEW] +126 [NEW] +56 GCC_except_table60
[NEW] +108 [NEW] +72 GCC_except_table138
+169% +108 +257% +72 GCC_except_table172
[NEW] +108 [NEW] +72 GCC_except_table225
[NEW] +107 [NEW] +72 GCC_except_table64
+195% +107 +360% +72 GCC_except_table87
[NEW] +96 [NEW] +60 GCC_except_table111
+160% +96 +250% +60 GCC_except_table151
+25% +95 +18% +60 GCC_except_table17
[NEW] +95 [NEW] +60 GCC_except_table72
[NEW] +95 [NEW] +60 GCC_except_table81
+116% +88 +220% +88 GCC_except_table192
[NEW] +84 [NEW] +48 GCC_except_table188
[NEW] +76 [NEW] +40 GCC_except_table107
+96% +68 +189% +68 GCC_except_table10
[NEW] +68 [NEW] +32 GCC_except_table102
[NEW] +68 [NEW] +32 GCC_except_table114
[NEW] +68 [NEW] +32 GCC_except_table144
[NEW] +68 [NEW] +32 GCC_except_table197
[NEW] +68 [NEW] +32 GCC_except_table223
[NEW] +67 [NEW] +32 GCC_except_table43
[NEW] +67 [NEW] +32 GCC_except_table53
[NEW] +67 [NEW] +32 GCC_except_table90
[NEW] +64 [NEW] +28 GCC_except_table164
[NEW] +64 [NEW] +28 GCC_except_table199
[NEW] +64 [NEW] +28 GCC_except_table207
[NEW] +64 [NEW] +28 GCC_except_table213
[NEW] +63 [NEW] +28 GCC_except_table28
[NEW] +60 [NEW] +24 GCC_except_table128
[NEW] +60 [NEW] +24 GCC_except_table155
[NEW] +60 [NEW] +24 GCC_except_table298
[NEW] +59 [NEW] +24 GCC_except_table36
[NEW] +56 [NEW] +20 GCC_except_table182
[NEW] +56 [NEW] +20 GCC_except_table190
[NEW] +56 [NEW] +20 GCC_except_table209
[NEW] +52 [NEW] +16 GCC_except_table133
[NEW] +52 [NEW] +16 GCC_except_table134
[NEW] +52 [NEW] +16 GCC_except_table162
[NEW] +52 [NEW] +16 GCC_except_table168
[NEW] +52 [NEW] +16 GCC_except_table179
[NEW] +52 [NEW] +16 GCC_except_table194
+100% +52 +100% +16 GCC_except_table202
[NEW] +52 [NEW] +16 GCC_except_table218
[NEW] +52 [NEW] +16 GCC_except_table221
[NEW] +52 [NEW] +16 GCC_except_table277
[NEW] +52 [NEW] +16 GCC_except_table279
[NEW] +51 [NEW] +16 GCC_except_table38
[NEW] +50 [NEW] +16 GCC_except_table4
+92% +48 +300% +48 GCC_except_table132
[NEW] +47 [NEW] +12 GCC_except_table54
[NEW] +42 [NEW] +16 lCPI223_1
[NEW] +40 [NEW] +16 lJTI2_8
[NEW] +40 [NEW] +16 lJTI2_9
+25% +32 +53% +32 GCC_except_table84
+0.4% +32 +0.4% +32 ltmp9
+30% +31 -5.9% -4 GCC_except_table85
[NEW] +31 [NEW] +5 lJTI170_0
[NEW] +30 [NEW] +4 lJTI170_1
+44% +28 +100% +28 GCC_except_table171
-6.4% -7 +70% +28 GCC_except_table35
+32% +24 +60% +24 GCC_except_table177
+46% +24 +150% +24 GCC_except_table178
+38% +20 +125% +20 GCC_except_table217
+26% +15 +100% +16 GCC_except_table100
+19% +12 +43% +12 GCC_except_table206
+19% +12 +43% +12 GCC_except_table212
+14% +8 +40% +8 GCC_except_table181
+5.6% +4 +11% +4 GCC_except_table216
+2.9% +2 +7.4% +2 typeinfo name for arrow::StructArray
+1.2% +2 +2.6% +2 typeinfo name for std::__1::__shared_ptr_emplace<arrow::ChunkedArray, std::__1::allocator<arrow::ChunkedArray> >
-6.8% -4 -16.7% -4 GCC_except_table83
-13.3% -8 -33.3% -8 GCC_except_table149
-17.9% -12 -37.5% -12 GCC_except_table41
-20.7% -12 -50.0% -12 GCC_except_table8
-23.9% -16 -50.0% -16 GCC_except_table52
-26.3% -20 -50.0% -20 GCC_except_table215
[DEL] -30 [DEL] -4 lJTI169_1
-27.3% -30 -46.9% -30 lJTI2_0
[DEL] -31 [DEL] -5 lJTI169_0
+2.9% +3 -47.1% -32 GCC_except_table34
[DEL] -41 [DEL] -16 lJTI25_0
[DEL] -42 [DEL] -16 lCPI222_1
-40.7% -44 -61.1% -44 GCC_except_table173
[DEL] -46 [DEL] -12 GCC_except_table7
[DEL] -47 [DEL] -12 GCC_except_table55
-0.6% -48 -0.6% -48 arrow::compute::internal::FSLTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*)
[DEL] -51 [DEL] -16 GCC_except_table37
[DEL] -51 [DEL] -16 GCC_except_table51
[DEL] -51 [DEL] -16 GCC_except_table61
[DEL] -52 [DEL] -16 GCC_except_table135
[DEL] -52 [DEL] -16 GCC_except_table148
[DEL] -52 [DEL] -16 GCC_except_table161
[DEL] -52 [DEL] -16 GCC_except_table203
[DEL] -52 [DEL] -16 GCC_except_table285
[DEL] -52 [DEL] -16 GCC_except_table287
[DEL] -52 [DEL] -16 GCC_except_table288
[DEL] -55 [DEL] -20 GCC_except_table33
[DEL] -55 [DEL] -20 GCC_except_table40
[DEL] -56 [DEL] -20 GCC_except_table208
[DEL] -56 [DEL] -20 GCC_except_table214
-48.3% -58 -46.2% -24 GCC_except_table2
[DEL] -60 [DEL] -24 GCC_except_table127
[DEL] -60 [DEL] -24 GCC_except_table306
[DEL] -63 [DEL] -28 GCC_except_table29
[DEL] -64 [DEL] -28 GCC_except_table163
[DEL] -64 [DEL] -28 GCC_except_table180
[DEL] -68 [DEL] -32 GCC_except_table113
[DEL] -68 [DEL] -32 GCC_except_table145
-22.1% -72 -28.1% -72 GCC_except_table11
[DEL] -75 [DEL] -40 GCC_except_table59
[DEL] -76 [DEL] -40 GCC_except_table108
[DEL] -76 [DEL] -40 GCC_except_table205
[DEL] -76 [DEL] -40 GCC_except_table211
-61.3% -92 -79.3% -92 GCC_except_table9
-20.2% -95 -15.0% -60 GCC_except_table16
[DEL] -95 [DEL] -60 GCC_except_table71
[DEL] -95 [DEL] -60 GCC_except_table82
[DEL] -95 [DEL] -60 GCC_except_table98
[DEL] -96 [DEL] -60 GCC_except_table110
-61.5% -96 -71.4% -60 GCC_except_table152
-64.3% -99 -76.2% -64 GCC_except_table86
[DEL] -100 [DEL] -64 GCC_except_table131
[DEL] -100 [DEL] -64 GCC_except_table174
[DEL] -103 [DEL] -24 typeinfo for arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl
[DEL] -107 [DEL] -72 GCC_except_table65
[DEL] -108 [DEL] -72 GCC_except_table139
[DEL] -108 [DEL] -72 GCC_except_table226
[DEL] -113 [DEL] -28 arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl::Finish()
[DEL] -120 [DEL] -48 GCC_except_table222
[DEL] -127 [DEL] -48 vtable for arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl
[DEL] -132 [DEL] -60 GCC_except_table198
-24.1% -132 -25.8% -132 GCC_except_table21
[DEL] -133 [DEL] -8 arrow::compute::internal::(anonymous namespace)::Selection<arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl, arrow::FixedSizeBinaryType>::Init()
[DEL] -134 [DEL] -64 GCC_except_table89
[DEL] -136 [DEL] -64 GCC_except_table101
-26.2% -136 -28.1% -136 GCC_except_table18
[DEL] -137 [DEL] -16 typeinfo for arrow::compute::internal::(anonymous namespace)::Selection<arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl, arrow::FixedSizeBinaryType>
[DEL] -137 [DEL] -58 typeinfo name for arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl
[DEL] -140 [DEL] -68 GCC_except_table189
-64.8% -140 -72.2% -104 GCC_except_table193
[DEL] -149 [DEL] -96 arrow::FixedSizeBinaryArray::~FixedSizeBinaryArray()
-74.5% -152 -90.5% -152 GCC_except_table280
[DEL] -156 [DEL] -120 GCC_except_table310
[DEL] -169 [DEL] -48 vtable for arrow::compute::internal::(anonymous namespace)::Selection<arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl, arrow::FixedSizeBinaryType>
-50.1% -208 -54.7% -208 GCC_except_table22
[DEL] -221 [DEL] -100 typeinfo name for arrow::compute::internal::(anonymous namespace)::Selection<arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl, arrow::FixedSizeBinaryType>
[DEL] -252 [DEL] -8 arrow::compute::internal::(anonymous namespace)::Selection<arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl, arrow::FixedSizeBinaryType>::~Selection()
-2.1% -256 -2.1% -256 ltmp5
[DEL] -312 [DEL] -240 GCC_except_table169
-82.5% -316 -90.8% -316 GCC_except_table25 -8.9% -352 -9.2% -352 arrow::compute::internal::PopulateTakeKernels(std::__1::vector<arrow::compute::internal::SelectionKernelData, std::__1::allocator<arrow::compute::internal::SelectionKernelData> >*)
[DEL] -464 [DEL] -304 arrow::compute::internal::(anonymous namespace)::FSBSelectionImpl::~FSBSelectionImpl()
-0.8% -677 [ = ] 0 [Unmapped]
[DEL] -5.83Ki [DEL] -5.72Ki arrow::compute::internal::FSBTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*)
[DEL] -23.9Ki [DEL] -23.8Ki arrow::compute::internal::PrimitiveTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*)
-0.1% -504 +0.5% +1.45Ki TOTAL UPDATE: Last 3 commits lead to:
|
Impact on origin/main: Before this commit -0.1% -504 +0.5% +1.45Ki TOTAL After this commit -0.1% -392 +0.5% +1.27Ki TOTAL
In a follow-up PR I'm adding Status return type out of necessity, so this change is not only for binary size reasons. -0.1% -528 +0.5% +1.21Ki TOTAL
Rationale for this change
I want to instantiate this primitive operation in other scenarios (e.g. the optimized version of
Take
that handles chunked arrays) and extend the sub-classes ofGatherCRTP
with different member functions that re-use theWriteValue
function generically (any fixed-width type and even bit-wide booleans).When taking these improvements to
Filter
I will also re-use the "gather" concept and parameterize it by bitmaps/boolean-arrays instead of selection vectors (indices
) liketake
does. So gather is not a "renaming of take" but rather a generalization oftake
andfilter
do in Arrow with different representations of what should be gathered from thevalues
array.What changes are included in this PR?
Take
implementation for values/indices without nullsTake
into a single oneAre these changes tested?
Take
guarantees null values are zeroed out