Enable better vectorization for generic convolution #692

heshpdx · 2024-04-24T20:50:20Z

flac is being considered for the next version of the SPEC CPU benchmark suite, currently dubbed CPUv8. As such, we are interested in the generic (non-intrinsic) code paths that can run on all current and future architectures. And as part of the intense scrutiny that benchmarks undergo, we saw a performance improvement opportunity in the generic code path in libFLAC/lpc.c and would like to share it with the community.

In the loop for computing residual from qlp coefficients, if we break the single dependence chain into two parallel sub-chains, we allow vectorization instructions to be interleaved during program execution. This provides 2-4% performance uplift as measured on modern ARM systems due to increased instruction level parallelism. Our two benchmark workloads are encoding podcasts using -7ep --replay-gain and -8p. We did try breaking the chain down further into four sub-chains, but didn't see any additional gains.

Break the single dependence chain into two parallel sub-chains. Provides 2-4% performance uplift as measured on modern ARM systems when using the generic codepath.

ktmf01 · 2024-04-26T06:01:23Z

Thanks for providing this information and sharing this optimization. I do have some questions though.

You mention this change is an improvement for "modern ARM systems". Did you do testing on other systems and architectures too? Can you provide some numbers? I guess the intrinsics parts are stripped for the benchmark, so performance of this bit of code is important on x86-64 too?

Also, can you explain how you tested this? By default FLAC is configured to compile with the GCC options -fassociative-math -fno-signed-zeros -fno-trapping-math -freciprocal-math which should already make the compiler break the dependency chains you mention. Did you test compiles with these options on or off?

heshpdx · 2024-04-26T20:38:44Z

Sure thing. Since we are running within the SPEC CPU harness, all files are built with the same compiler switches (that is one of the requirements for the benchmark). We used gcc-14.0.1-9c7cf5d and our baseline switches are -g -O3 -flto=32 -funroll-loops -ffast-math --param max-completely-peeled-insns=600.

Here are the measured runtimes in seconds. one-sum is the original baseline, two-sum is with the new code. You are right, intrinsics are stripped so we are comparing apples to apples. It does help x86 as well.

machine / vectors	one-sum	two-sum	delta
AMD Genoa / AVX-128	97.8	96.0	1.8%
AmpereOne / NEON-128	141	134	5.2%
Ampere Altra / NEON-128	252	247	2.0%

After seeing your comment above, we rebuilt with the FLAC default compiler options added, and the results were the same. Those switches don't benefit us here: -fassociative-math could help but it only applies to floating-point operations and this function uses integer type. We believe that operation reassociation is on by default for integer types, at least in gcc.

ktmf01 · 2024-05-02T09:00:49Z

After seeing your comment above, we rebuilt with the FLAC default compiler options added, and the results were the same.

Just to be clear, you added those options to the options you mentioned above?

The thing is, I've tried GCC's LTO in the past, and without exception it produced slower builds. So perhaps this optimization only works with LTO, because I am seeing no gains at all. My best guess is that without LTO, these functions are very well optimized, and with LTO, they are less so (because it results in quite a large binary), but your change drives the compiler to optimize a little more.

heshpdx · 2024-05-02T12:33:54Z

No, we replaced our flags with your flags. I confirmed that the gains are not coming from LTO by removing -flto and rerunning; the asm produced is isolated to one file. Which machine are you testing on? My 7 year old x86 desktop did not show the gains I listed above.

ktmf01 · 2024-05-03T05:36:39Z

I've tested with an Intel Xeon E-2224G and an Intel Core i5 7200U. Results for the latter are graphed here.

No difference in speed, at the cost of a bigger binary. Compile with GCC 13.2.0.

Just to be sure, are you using 16-bit, 24-bit input or something else entirely?

heshpdx · 2024-05-03T16:02:24Z

That's a cool graph! Can you explain the axes? The title says "CPU-time vs compression", so I thought that means "Y-axis vs X-axis", but the axes are labeled differently (Compression is on Y-axis, and CPU-time is not plotted but CPU-usage is?). Is this from one run or twelve runs? If you share your input and cmdlines, I can work to produce the same data on my side and send it over.

It's good that you reproduced my Intel results :-) The desktop machine I mentioned above an i7-8809G. The reason behind the lack of improvement is that the cores in older x86 machines don't have as many execution units as the newer ones. Hence they cannot take advantage of the increased instruction level parallelism that we are exposing with the two-sum patch.

My inputs are 32-bit, WAV files. If it helps, I have placed them on a dropbox here. I have full rights to the audio content, I produced these myself, and I am giving them away freely for any use.

Enable better vectorization for generic convolution

1fd5cd8

Break the single dependence chain into two parallel sub-chains. Provides 2-4% performance uplift as measured on modern ARM systems when using the generic codepath.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable better vectorization for generic convolution #692

Enable better vectorization for generic convolution #692

heshpdx commented Apr 24, 2024

ktmf01 commented Apr 26, 2024

heshpdx commented Apr 26, 2024

ktmf01 commented May 2, 2024 •

edited

heshpdx commented May 2, 2024 •

edited

ktmf01 commented May 3, 2024 •

edited

heshpdx commented May 3, 2024

Enable better vectorization for generic convolution #692

Are you sure you want to change the base?

Enable better vectorization for generic convolution #692

Conversation

heshpdx commented Apr 24, 2024

ktmf01 commented Apr 26, 2024

heshpdx commented Apr 26, 2024

ktmf01 commented May 2, 2024 • edited

heshpdx commented May 2, 2024 • edited

ktmf01 commented May 3, 2024 • edited

heshpdx commented May 3, 2024

ktmf01 commented May 2, 2024 •

edited

heshpdx commented May 2, 2024 •

edited

ktmf01 commented May 3, 2024 •

edited