Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable better vectorization for generic convolution #692

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

heshpdx
Copy link

@heshpdx heshpdx commented Apr 24, 2024

flac is being considered for the next version of the SPEC CPU benchmark suite, currently dubbed CPUv8. As such, we are interested in the generic (non-intrinsic) code paths that can run on all current and future architectures. And as part of the intense scrutiny that benchmarks undergo, we saw a performance improvement opportunity in the generic code path in libFLAC/lpc.c and would like to share it with the community.

In the loop for computing residual from qlp coefficients, if we break the single dependence chain into two parallel sub-chains, we allow vectorization instructions to be interleaved during program execution. This provides 2-4% performance uplift as measured on modern ARM systems due to increased instruction level parallelism. Our two benchmark workloads are encoding podcasts using -7ep --replay-gain and -8p. We did try breaking the chain down further into four sub-chains, but didn't see any additional gains.

Break the single dependence chain into two parallel sub-chains.
Provides 2-4% performance uplift as measured on modern ARM systems
when using the generic codepath.
@ktmf01
Copy link
Collaborator

ktmf01 commented Apr 26, 2024

Thanks for providing this information and sharing this optimization. I do have some questions though.

You mention this change is an improvement for "modern ARM systems". Did you do testing on other systems and architectures too? Can you provide some numbers? I guess the intrinsics parts are stripped for the benchmark, so performance of this bit of code is important on x86-64 too?

Also, can you explain how you tested this? By default FLAC is configured to compile with the GCC options -fassociative-math -fno-signed-zeros -fno-trapping-math -freciprocal-math which should already make the compiler break the dependency chains you mention. Did you test compiles with these options on or off?

@heshpdx
Copy link
Author

heshpdx commented Apr 26, 2024

Sure thing. Since we are running within the SPEC CPU harness, all files are built with the same compiler switches (that is one of the requirements for the benchmark). We used gcc-14.0.1-9c7cf5d and our baseline switches are -g -O3 -flto=32 -funroll-loops -ffast-math --param max-completely-peeled-insns=600.

Here are the measured runtimes in seconds. one-sum is the original baseline, two-sum is with the new code. You are right, intrinsics are stripped so we are comparing apples to apples. It does help x86 as well.

machine / vectors one-sum two-sum delta
AMD Genoa / AVX-128 97.8 96.0 1.8%
AmpereOne / NEON-128 141 134 5.2%
Ampere Altra / NEON-128 252 247 2.0%

After seeing your comment above, we rebuilt with the FLAC default compiler options added, and the results were the same. Those switches don't benefit us here: -fassociative-math could help but it only applies to floating-point operations and this function uses integer type. We believe that operation reassociation is on by default for integer types, at least in gcc.

@ktmf01
Copy link
Collaborator

ktmf01 commented May 2, 2024

After seeing your comment above, we rebuilt with the FLAC default compiler options added, and the results were the same.

Just to be clear, you added those options to the options you mentioned above?

The thing is, I've tried GCC's LTO in the past, and without exception it produced slower builds. So perhaps this optimization only works with LTO, because I am seeing no gains at all. My best guess is that without LTO, these functions are very well optimized, and with LTO, they are less so (because it results in quite a large binary), but your change drives the compiler to optimize a little more.

@heshpdx
Copy link
Author

heshpdx commented May 2, 2024

No, we replaced our flags with your flags. I confirmed that the gains are not coming from LTO by removing -flto and rerunning; the asm produced is isolated to one file. Which machine are you testing on? My 7 year old x86 desktop did not show the gains I listed above.

@ktmf01
Copy link
Collaborator

ktmf01 commented May 3, 2024

I've tested with an Intel Xeon E-2224G and an Intel Core i5 7200U. Results for the latter are graphed here.

No difference in speed, at the cost of a bigger binary. Compile with GCC 13.2.0.

Just to be sure, are you using 16-bit, 24-bit input or something else entirely?

@heshpdx
Copy link
Author

heshpdx commented May 3, 2024

That's a cool graph! Can you explain the axes? The title says "CPU-time vs compression", so I thought that means "Y-axis vs X-axis", but the axes are labeled differently (Compression is on Y-axis, and CPU-time is not plotted but CPU-usage is?). Is this from one run or twelve runs? If you share your input and cmdlines, I can work to produce the same data on my side and send it over.

It's good that you reproduced my Intel results :-) The desktop machine I mentioned above an i7-8809G. The reason behind the lack of improvement is that the cores in older x86 machines don't have as many execution units as the newer ones. Hence they cannot take advantage of the increased instruction level parallelism that we are exposing with the two-sum patch.

My inputs are 32-bit, WAV files. If it helps, I have placed them on a dropbox here. I have full rights to the audio content, I produced these myself, and I am giving them away freely for any use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants