Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to improve code size, fix GCC 12 #802

Draft
wants to merge 11 commits into
base: dev
Choose a base branch
from

Conversation

easyaspi314
Copy link
Contributor

@easyaspi314 easyaspi314 commented Feb 23, 2023

Overview

Kinda a dump of (mostly) related changes I've been sitting on and never bothered to commit.

This is a set of changes to:

  • Reduce code size on -O3
    • Especially on 32-bit, where -O3 could get well over 100 kB .text
    • I feel this could still be improved if hashLong constant propagation can be reduced
  • Improve -Os/XXH_SIZE_OPT performance
  • Make things more clean and modular

These haven't been micro-benchmarked yet. The "what is worth inlining" is a rough estimate, and I want to go through the various recent compilers to find what works best.

Major changes:

  • GCC 12 will now be forced to -O3 to work around the -Og bug. Fixes GCC-12 fails to inline XXX_FORCED_INLINE functions, XXH3_accumulate_512_sse2(), and XXH3_scrambleAcc_sse2() with -Og flag. #800, Pb when building with gcc-12 -Og #720
    • Only applies when __OPTIMIZE__, !__NO_INLINE__ and XXH_SIZE_OPT <= 0 (!__OPTIMIZE_SIZE__), so -O2 and -Og both apply
    • I tried with adjusting the inline hints but that tanked performance.
  • XXH_INLINE: Acts like existing XXH_FORCE_INLINE where it is disabled when XXH_SIZE_OPT is used.
    • XXH_FORCE_INLINE now actually force inlines unconditionally (when inline hints are on) and is used on XXH3 and XXH32's core loops and utilities.
    • Greatly improves GCC -Os performance with only a small code size overhead
  • XXH_32BIT_OPT, XXH_INLINE_64BIT: These are used to reduce code bloat on 32-bit
    • This with the other related optimizations reduce clang armv4t -O3 code size by about half, now being <100 kB
  • XXH32 and XXH64 have been rewritten to reuse the algorithm steps
  • XXH3_update has been rewritten to be much easier to understand and much smaller.
  • XXH3_128bits_update now tail calls XXH3_64bits_update, greatly reducing code size at the cost of one branch

Minor changes:

  • GCC 11 and higher no longer force -O2 on AVX2, that bug has been fixed.
  • XXH32_state_t and XXH64_state_t have some fields renamed and now use unsigned char [N] for their buffer
  • XXH_OLD_NAMES has been removed, it's been long enough
  • Fixed a few inline hints
  • Some reused snippets in XXH3 have been extracted into functions.
  • Fixed a theoretical integer overflow in the update code

 - Fix the state structs to use unsigned char, update names
 - Extract the algorithm steps into inline subroutines
 - Fix a theoretical integer overflow bug
 - Reroll XXH64 on 32-bit (it is going to run like crap anyway)
It's been long enough
 - Add `XXH_32BIT_OPT` which is enabled for targets without 64-bit arithmetic
 - Add a 64-bit only inline hint, apply it to the functions that need them
   - TODO benchmark them more
 - Tail call XXH3_64bits_update instead of inlining XXH3_update
 - Reroll midrange on 32-bit (ARM is the only one that doesn't choke on the 128-bit
   multiplies)
 - Extract the copy-pasted midrange finalization code in XXH128
XXH3_consumeStripes can now process multiple blocks, greatly simplifying
the logic and making it resemble XXH32 and XXH64.

This also reduces code size by a solid amount.
 - XXH_SIZE_OPT now, instead of disabling inline hints, turns off
   *most* inline hints
    - Things like acc512 and utilities remain force inline
 - XXH_FORCE_INLINE now always force inlines for functions that need it.
Not the best solution, but not force inlining the accumulate functions
is more of a problem than forcing GCC to -O3.
@easyaspi314
Copy link
Contributor Author

Seems like some apt repos are down which is causing some things to fail.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Feb 23, 2023

I also have an idea that seems to get full performance (at least on large scale, I'll have to test the latency) with only 2 copies of XXH3_hashLong_internal_loop.

This manages to get clang aarch64 -O3 to 24 kB compared to 36 kB, and clang armv4t -O3 from 92 kB to 61 kB (far less than half of the 160 kB before this PR)

typedef struct {
    XXH_ALIGN_MEMBER(64, xxh_u64 acc[8]);
} XXH3_acc_t;

XXH_FORCE_INLINE XXH3_acc_t
XXH3_hashLong_internal_loop_impl(...)
{
    XXH3_acc_t acc = { XXH3_INIT_ACC };
    // normal internal_loop stuff
    return acc;
}

XXH_NO_INLINE XXH3_acc_t
XXH3_hashLong(const xxh_u8* XXH_RESTRICT input, size_t len,
              const xxh_u8* XXH_RESTRICT secret, size_t secretSize)
{
    return XXH3_hashLong_internal_loop_impl(input, len, secret, secretSize, XXH3_accumulate,
XXH3_scrambleAcc);
}
#if XXH_SIZE_OPT <= 0
/* Constant folded for the default secret size */
XXH_NO_INLINE XXH3_acc_t
XXH3_hashLong_defaultSecretSize(const xxh_u8* XXH_RESTRICT input, size_t len,
              const xxh_u8* XXH_RESTRICT secret, size_t secretSize)
{
    (void) secretSize;
    return XXH3_hashLong_internal_loop_impl(input, len, secret, sizeof(XXH3_kSecret), XXH3_accumulate,
XXH3_scrambleAcc);
}
#else
/* Don't constant fold on size opt */
#  define XXH3_hashLong_defaultSecretSize XXH3_hashLong
#endif

Returning a struct like this is functionally identical to memcpying to an out pointer on the ABI level, but the compiler knows more about it and can optimize it better. (Just having an output pointer without an inlining is 40% slower).

I would have to clean it up, rename it, and test it some more though, so I would probably keep this to another PR.

Edit: This could be a step towards making dispatching on x86 standard as well since the overridden functions are now only 2-3 (also the feature test is 99% logging - it normally compiles to like 50 instructions).

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Feb 23, 2023

How do the rest of these changes look aside from that (which since it would affect dispatching is a huge change that will need some more testing)? The changes shouldn't affect 64-bit codegen that much (although XXH3_update may be worth some benchmarking)

The biggest changes codegen-wise are 32-bit not being forced to inline and unroll ugly functions as much (which have more latency than a call frame/loop and aren't worth the code size increase) and -Os being faster.

@easyaspi314
Copy link
Contributor Author

I'm going to leave the hashLong rework to another PR because that would need reworking dispatching, and that would be a perfect time to redo that mess.

@Cyan4973
Copy link
Owner

I'm trying to get a sense of the scope of the changes in this PR.

To rephrase it :

  • It's mostly focused on reducing the binary size, though speed might also be affected as a side effect
  • It mostly concerns -Os compilation mode, performance modes -O2 and -O3 should not be negatively impacted
  • It mostly concerns 32-bit compilation mode, 64-bit should not be negatively impacted
  • Speed change is expected to be roughly neutral, except for 32-bit -Os where it's expected to be positive.
  • Not sure if arm target is differently impacted than x86/x64 one.

If that's the correct scope, then there are a number of measurements to do and publish to document this PR.

Longer term, I wonder if there is something that could / should be done to ensure these properties are preserved in the future. Otherwise, it would be easy for a future PR to inflate the binary size without anyone noticing it.

@Cyan4973
Copy link
Owner

Cyan4973 commented Feb 27, 2023

I made a series of measurements, leading to surprising results (at least for me), when comparing this pr802 to dev branch.

Let's start with x64 compilation mode. Compiler is gcc v9.4.0, cpu is i7-9700k (turbo off). Performance flag only, meaning the vector extension is default SSE2. Let's measure the size of produced libxxhash.

compiler flags dev lib size pr802 lib size difference
-O3 76392 60080 -16312
-O2 51936 43912 -8024
-Os 32208 36600 +4392

So that's the first surprise : the changes are beneficial to binary sizes for the performance mode, but when it comes to the size optimized -Os mode, binary size is actually increased. It's not necessarily a bad deal, but since it seems to achieve the inverse of what's advertised, I figured it's worth underlining.

Now let's have a look to performance. The scenario measured is the default -b, aka 100 KB (long hash). Numbers measure XXH32, XXH64 and XXH128 successively

compiler flags dev GB/s pr802 GB/s note
-O3 6.9, 13.7, 20.1 6.9, 13.7, 20.1 identical
-O2 6.9, 13.7, 14.0 6.9, 13.7, 14.0 identical
-Os 6.9, 13.7, 12.6 6.9, 13.7, 14.9 XXH128 is improved, even faster than -O2

To begin with, at -O3 and -O2, pr802 features no speed change. So, combined with the lower binary size, this pr802 is a win globally. Nicely done.
Note how performance of XXH128 plummets seriously at -O2, which is probably because gcc needs -O3 to generate efficient vector code. XXH32 and XXH64 in contrast remain unaffected.
Finely, at -Os, XXH32 and XXH64 remain performance neutral, while XXH128 improves with pr802, to the point of running slightly faster than -O2 (yet far from -O3). I wonder if there are some useful learning there.
That makes -Os plainly better than -O2 on pr802, since it wins on both speed and binary size.

So, some surprises, but all in all, this PR has positives.
One could still argue that, for XXH32 and XXH64, dev + -Os is actually the better choice, since performance is unaffected, while binary size is the smallest.
But for XXH128, note that the larger binary size of -Os comes with better performance too, so it's a reasonable trade off. In any case, pr802 + -Os is superior in all metrics to dev + -O2, since it provides both smaller binary size and faster speed for XXH128. That's worth mentioning.

Now, it was also mentioned that this PR is supposed to be even more useful for 32-bit mode, so let's try that. Once again, compiler is gcc v9.4.0, only performance flags are changed, meaning that vector extensions are likely disabled by default.

compiler flags dev lib size pr802 lib size difference
-m32 -O3 111860 73174 -38686
-m32 -O2 70924 46968 -23956
-m32 -Os 34876 34920 +44

The first thing to note is how big these binary sizes are, compared to x64 binaries. This might be because there is no vector extension by default. This could be tested in more details by separating XXH3 variants from the rest.
The binary size decrease from pr802 is huge for the performance modes, especially at -O3.
Once again, -Os is against the trend, though the difference is pretty minimal in this case, that's essentially a tie.

Now, what about performance ?

compiler flags dev GB/s pr802 GB/s note
-m32 -O3 6.9, 3.1, 5.7 6.9, 2.9, 5.6
-m32 -O2 6.9, 2.8, 2.8 6.9, 2.7, 2.7
-m32 -Os 6.9, 2.7, 2.6 6.9, 2.6, 2.4

Once again, xxh32 proves very resilient, offering the same performance at all compilation modes, and note that it's the same performance as the x64 mode, despite the lower nb of registers.
We already knew that xxh64 is a bad fit for 32-bit mode, and this is confirmed again in these measurements, no surprise. Note though that in this case, -O3 does deliver some speed benefits (almost +10%), so it's no longer clear that -Os is the better choice for XXH64.
XXH128 performance is also much lower, most likely due to the disappearance of vector extensions. If SSE2 was still there, it would be a very different story. Anyway, in this case, -O3 improves the performance of XXH128 by a whopping +100%, so it really matters.
Finally, in these measurements, performance of pr802 seems slightly lower than dev. Nothing extravagant, but worth mentioning.

It's difficult to draw conclusions for this compilation exercise. pr802 delivers significant size benefits for -O3 and -O2 , and the minor performance losses are probably acceptable. What's more concerning is the -Os mode, for which pr802 is both slightly larger, and slightly slower. Nothing "serious", but still not in the right direction. And I initially expected this mode to be the most beneficial for pr802, so clearly, there was some misunderstanding.

There are other axis that could be analyzed. For example, add an aarch64 test. And maybe a 32-bit armv7 too. Measure the binary size impact of XXH3 separately from legacy XXH32 and XXH64. Add a few different vectors. Add a few different compilers. The test matrix quickly becomes overwhelming, and it's likely going to be difficult to produce a change that's only beneficial.

One potential idea : is it possible to break this PR into smaller stages ? Maybe some of these stages produce uncontroversial benefits that could be quickly merged ? Maybe other require pondering some trade-offs (which can still be acceptable) ? Finally, maybe there are useful learnings that could be used to improve other parts of the PR ?
Open for discussion.

Also, I would recommend rebasing this PR on top of dev, just to make sure that observed performance differences come solely from this PR, and are not confused with some parallel PR that would have been merged to dev but not pr802.

@Cyan4973
Copy link
Owner

Cyan4973 commented Feb 27, 2023

Since the focus of this PR is on binary size, I wanted to have a more in-depth look at these evolutions.
In the following tables, I'm separating the variants XXH32, XXH64 and XXH128, by producing limited dynamic library binaries (note: maybe I should have used static library size instead).
For XXH32, the produced libxxhash only contains XXH32 and some common functions.
For XXH64, the produced libxxhash has both previous content and XXH64, but no XXH3 variant.
Finally, for XXH128, it's the full library.
The hope is that it will make it possible to observe the binary size changes of XXH32 and XXH64 separately, without mixing the (probably larger) changes due to vector extensions exploited by XXH3 variants.
In the same vein, it's going to be more natural to add more vector extensions for comparisons.

Let's start this exercise with XXH32 :

algo target compiler flags dev size pr802 size lib size diff
XXH32 x64 gcc v9.4.0 -O3 16576 16556 -20
XXH32 x64 gcc v9.4.0 -O2 16616 16616 0
XXH32 x64 gcc v9.4.0 -Os 16600 16664 +64
XXH32 x86 gcc v9.4.0 -m32 -O3 15456 15456 0
XXH32 x86 gcc v9.4.0 -m32 -O2 15488 15488 0
XXH32 x86 gcc v9.4.0 -m32 -Os 15440 15532 +92
XXH32 aarch64 M1 Apple clang v14.0 -O3 33934 33934 0
XXH32 aarch64 M1 Apple clang v14.0 -O2 33934 33934 0
XXH32 aarch64 M1 Apple clang v14.0 -Os 33966 34014 +48

A few interesting details here.
To begin with, we know that performance remains largely unaffected for XXH32, staying at 6.9 GB/s on the i7-9700k no-turbo target cpu, whatever the compilation mode and flags. The story is essentially the same on M1, with XXH32 delivering consistently 7.0 GB/s, and small (unimportant) degradation at pr802 with -Os setting.
We also note that the library size differences between modes and commits remain extremely small, which is consistent.

Two remarquable trends are worth mentioning though :

  • Modes expected to produce smaller binary sizes actually produce larger binary sizes. This trend is clearer for pr802. At the end of the story, the differences are not large, but that still goes against expectations.
  • pr802 is especially worse at binary size for the -Os mode. This is also the reverse of my initial expectations.

It could be that binary size differences are in fact unrelated to XXH32 proper, and due to minor differences in other generic symbols still compiled in the library, such as XXH_versionNumber() for example.

In conclusion, when it comes to XXH32, this pr802 is roughly neutral.

Now, onto XXH64 :

algo target compiler flags dev size pr802 size lib size diff perf evol
XXH64 x64 gcc v9.4.0 -O3 17072 17072 0 13.7 GB/s
XXH64 x64 gcc v9.4.0 -O2 17016 17016 0 13.7 GB/s
XXH64 x64 gcc v9.4.0 -Os 17040 17144 +104 13.7 GB/s
XXH64 x86 gcc v9.4.0 -m32 -O3 19840 19872 +32 3.1 -> 2.9 GB/s
XXH64 x86 gcc v9.4.0 -m32 -O2 19900 15924 -3976 2.8 -> 2.7 GB/s
XXH64 x86 gcc v9.4.0 -m32 -Os 19852 15916 -3936 2.7 -> 2.6 GB/s
XXH64 aarch64 M1 Apple clang v14.0 -O3 34414 34414 0 14.0 GB/s
XXH64 aarch64 M1 Apple clang v14.0 -O2 34414 34414 0 14.0 GB/s
XXH64 aarch64 M1 Apple clang v14.0 -Os 34478 34718 +304 14.0 -> 13.7 GB/s

Some learnings:
In x64 mode, the library size difference is very small compared to the XXH32-only library. Less than ~0.5KB is added. This is a similar story for aarch64. This seems to show that XXH64 is compact enough, at all compilation settings.
As we already know, speed remains unaffected, at 13.7 GB/s on i7-9700k. It is only slightly degraded at -Os with pr802 on M1 Pro, still offering roughly double XXH32 speed.
Unfortunately, pr802 doesn't offer any binary size savings. Actually, -Os is slightly bigger than dev.

On x86 32-bit mode, there is a more substantial binary size difference, and dev branch requires > +4 KB to add XXH64. This suggests that the XXH64 algorithm gets expanded significantly. Probably some loop unrolling.
Binary size is improved by pr802, showing that it's possible to add XXH64 with a budget of roughly ~0.5KB, similar to x64 mode.

We also know that XXH64 performance is much lower in 32-bit mode, due to reliance on 64-bit multiply operations.
And unfortunately, pr802 makes this situation slightly worse.
This would probably remain an acceptable trade-off when it saves almost ~4 KB of binary size, but strangely, at -O3, there is no such binary size saving, yet the performance difference is still very much present (checked several times).

It's unclear to me why there is a (small) performance difference between dev and pr802 in 32-bit x86 mode.
It's also unclear to me why binary size savings seem ineffective at -O3 specifically, while -O2 seems to benefit from it.

In conclusion, when it comes to XXH64, this pr802 is slightly negative for performance. There is an interesting binary size saving on x86 at -O2 and -Os settings, that could potentially compensate the small speed degradation. But for 64-bit modes, there is no positive to report.

XXH128 :

For this last one, we can simply re-employ results from previous measurement exercise.

algo target compiler flags dev size pr802 size lib size diff perf evol
XXH128 x64 + SSE2 gcc v9.4.0 -O3 76392 60080 -16312 20.1 GB/s
XXH128 x64 + SSE2 gcc v9.4.0 -O2 51936 43912 -8024 14.0 GB/s
XXH128 x64 + SSE2 gcc v9.4.0 -Os 32208 36600 +4392 12.6 -> 14.9 GB/s
XXH128 x86 + scalar gcc v9.4.0 -m32 -O3 111860 75436 -36424 5.7 -> 5.6 GB/s
XXH128 x86 + scalar gcc v9.4.0 -m32 -O2 70924 46968 -23956 2.8 -> 2.7 GB/s
XXH128 x86 + scalar gcc v9.4.0 -m32 -Os 34876 34920 +44 2.2 -> 2.4 GB/s
XXH128 aarch64 M1 + NEON Apple clang v14.0 -O3 69678 53246 -16432 35.8 GB/s
XXH128 aarch64 M1 + NEON Apple clang v14.0 -O2 69678 53246 -16432 35.8 GB/s
XXH128 aarch64 M1 + NEON Apple clang v14.0 -Os 37198 37198 0 10.9 GB/s

To begin with, it's clear that library size increases significantly with XXH3 variants. Sure, there are 2 variants, divided into a number of range strategies, plus streaming capabilities, ok. Yet, I'm nonetheless surprised that it requires so much. The -Os compilation mode is the only one that seems in the "acceptable" range.

Here, pr802 brings pretty big binary size savings, at -O2 and especially -O3.
For -Os though, it doesn't. It actually increases size for the x64 mode.

-O3 remains an important compilation flag for best performance of XXH128 on gcc: speed drops very significantly with -O2 and -Os.
On clang, the picture is slightly different : speed is exactly the same at -O3 and -O2. But at -Os, the performance drop is very steep.

The performance impact of pr802 is all over the place, from neutral, to slightly negative, to positive.
The most interesting aspect is seeing -Os mode receiving a speed boost on gcc + intel cpu.
For x64, this speed boost is significant enough to beat -O2, which is really nice since binary is also smaller.
For x86, the speed boost is not as important, but still a hefty +10% compared to dev.
So there's probably something worth investigating there.

To be continued : add more targets, more compilers, etc.

@Cyan4973
Copy link
Owner

Cyan4973 commented Feb 27, 2023

The previous analysis was using Dynamic Library to compare generates binary sizes. It was assumed that this wouldn't make much difference compared to Static Libraries, with maybe just a constant factor added.
This is incorrect : Static and Dynamic Library sizes are loosely correlated, but can evolve somewhat differently.
This makes it difficult to decide which one is worth monitoring.
It could be argued that Static Library is closer to the concept of "generated binary size", so that would justify prefering Static library sizes. But this disregards the issue that these fragments of code, once integrated into a real executable (library or program) have to respect additional rules, such as code and data alignment. As a consequence, the impact of a change will be different depending on the target, and depending on the linking process. Therefore, it could also be argued that the Dynamic Library size gives a better sense of the final impact of a change.

This is relatively difficult to make sense of, and difficult to decide, so I left both analysis in the thread. We can continue from the Static or the Dynamic library size analysis, depending on preference.


Since the focus of this PR is on binary size, I wanted to have a more in-depth look at these evolutions.
In the following tables, I'm separating the variants XXH32, XXH64 and XXH128, by producing limited Static library binaries.
For XXH32, the produced libxxhash only contains XXH32 and some common functions.
For XXH64, the produced libxxhash has both previous content and XXH64, but no XXH3 variant.
Finally, for XXH128, it's the full library.
The hope is that it will make it possible to observe the binary size changes of XXH32 and XXH64 separately, without mixing the (probably larger) changes due to vector extensions exploited by XXH3 variants.
In the same vein, it's going to be more natural to add more vector extensions for comparisons.

Let's start this exercise with XXH32 :

algo target compiler flags dev size pr802 size lib size diff
XXH32 x64 gcc v9.4.0 -O3 4268 4404 +136
XXH32 x64 gcc v9.4.0 -O2 4108 4228 +120
XXH32 x64 gcc v9.4.0 -Os 3780 3836 +56
XXH32 x86 gcc v9.4.0 -m32 -O3 4412 4460 +48
XXH32 x86 gcc v9.4.0 -m32 -O2 4200 4248 +48
XXH32 x86 gcc v9.4.0 -m32 -Os 3594 3590 -4
XXH32 aarch64 M1 Apple clang v14.0 -O3 3416 3432 +16
XXH32 aarch64 M1 Apple clang v14.0 -O2 3400 3416 +16
XXH32 aarch64 M1 Apple clang v14.0 -Os 3392 3432 +40

We know that XXH32 speed remains largely unaffected, staying at 6.9 GB/s on the i7-9700k no-turbo target cpu, whatever the compilation mode and flags. The story is essentially the same on M1, with XXH32 delivering consistently 7.0 GB/s, and a small (unimportant) degradation at pr802 with -Os setting.

The library size differences between modes and commits remain small, which is unsurprising.
But we can also notice that pr802 impact is generally negative for binary size.
Not by a large amount, therefore it's not a big deal.

Since there is no corresponding speed benefit, pr802 looks globally slightly negative for the XXH32-only scenario.
But as mentioned, this is minor, and can probably be ignored if there is a much larger benefit for other scenarios.

Now, onto XXH64 :

algo target compiler flags dev size pr802 size lib size diff perf evol
XXH64 x64 gcc v9.4.0 -O3 7526 7886 +360 13.7 GB/s
XXH64 x64 gcc v9.4.0 -O2 7126 7446 +320 13.7 GB/s
XXH64 x64 gcc v9.4.0 -Os 6326 6406 +80 13.7 GB/s
XXH64 x86 gcc v9.4.0 -m32 -O3 10638 9866 -772 3.1 -> 2.9 GB/s
XXH64 x86 gcc v9.4.0 -m32 -O2 9110 7814 -1296 2.8 -> 2.7 GB/s
XXH64 x86 gcc v9.4.0 -m32 -Os 8048 6472 -1576 2.7 -> 2.6 GB/s
XXH64 aarch64 M1 Apple clang v14.0 -O3 5896 5912 +16 14.0 GB/s
XXH64 aarch64 M1 Apple clang v14.0 -O2 5832 5848 +16 14.0 GB/s
XXH64 aarch64 M1 Apple clang v14.0 -Os 5848 6128 +280 14.0 -> 13.7 GB/s

Some learnings:
As we already know, in 64-bit modes (x64, aarch64), speed remains unaffected, at 13.7 GB/s on i7-9700k. It is only slightly degraded at -Os with pr802 on M1 Pro, still offering roughly double XXH32 speed.
Unfortunately, pr802 doesn't offer any binary size savings for this scenario. It is actually slightly bigger than dev.

On x86 32-bit mode, there is a more substantial binary size difference, and dev branch requires 2x binary size budget to add XXH64. This suggests that the XXH64 algorithm gets expanded significantly. Probably some loop unrolling.
Binary size is improved by pr802, showing that it's possible to add XXH64 with a budget roughly similar to x64 mode.

We also know that XXH64 performance is much lower in 32-bit mode, due to reliance on 64-bit multiply operations.
And unfortunately, pr802 makes this situation slightly worse.
This would probably remain an acceptable trade-off when it saves binary size, but strangely, at -O3, binary size savings are not great, yet the performance difference is very much present (checked multiple times).

It's unclear to me why there is a (small) performance difference between dev and pr802 in 32-bit x86 mode.
It's also unclear to me why binary size savings is less effective at -O3 specifically, while -O2 seems to benefit more from it.

In conclusion, when it comes to XXH64, this pr802 seems slightly negative for performance. There is an interesting binary size saving on x86, especially at -O2 and -Os settings, that could potentially compensate the small speed degradation. But for 64-bit modes, there is unfortunately no positive outcome to report .

XXH128 :

Let's redo binary size measurements, using static library size instead.

algo target compiler flags dev size pr802 size lib size diff perf evol
XXH128 x64 + SSE2 gcc v9.4.0 -O3 67232 53200 -14032 20.1 GB/s
XXH128 x64 + SSE2 gcc v9.4.0 -O2 40880 36864 -4016 14.0 GB/s
XXH128 x64 + SSE2 gcc v9.4.0 -Os 19752 20432 +680 12.6 -> 14.9 GB/s
XXH128 x86 + scalar gcc v9.4.0 -m32 -O3 107280 73174 -34106 5.7 -> 5.6 GB/s
XXH128 x86 + scalar gcc v9.4.0 -m32 -O2 60796 39188 -21608 2.8 -> 2.7 GB/s
XXH128 x86 + scalar gcc v9.4.0 -m32 -Os 25152 23278 -1878 2.2 -> 2.4 GB/s
XXH128 aarch64 M1 + NEON Apple clang v14.0 -O3 50456 44408 -6048 35.8 GB/s
XXH128 aarch64 M1 + NEON Apple clang v14.0 -O2 48768 42248 -6520 35.8 GB/s
XXH128 aarch64 M1 + NEON Apple clang v14.0 -Os 21600 24336 +2736 10.9 GB/s

To begin with, it's clear that library size increases significantly with XXH3 variants. Sure, there are 2 variants, divided into a number of range strategies, plus streaming capabilities, ok. Yet, I'm nonetheless surprised that it requires so much. The -Os compilation mode is the only one that lies in the "acceptable" range.

Here, pr802 brings pretty big binary size savings, at -O2 and especially -O3.
For -Os though, it's less clear, and can actually increase size.

-O3 remains an important compilation flag for best performance of XXH128 on gcc: speed drops very significantly with -O2 and -Os.
On clang, the picture is slightly different : speed is exactly the same at -O3 and -O2. But at -Os, the performance drop is very steep.

The performance impact of pr802 is all over the place, from neutral, to slightly negative, to positive.
The most interesting aspect is seeing -Os mode receiving a speed boost on gcc + intel cpu.
For x64, this speed boost is significant enough to beat -O2, which is really nice since binary size is also smaller.
For x86, the speed boost is not as important, but still a hefty +10% compared to dev.
So there's probably something worth investigating there.

Note : some results are sometimes surprising. Ensuring that pr802 is based on top of dev is also important to ensure this comparison analysis doesn't produce wrong conclusions.

To be continued : add more targets, more compilers, etc.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Feb 27, 2023

is it possible to break this PR into smaller stages ?

Sure, I can do that. Sorry I adhd'd again 😵‍💫

I think that the isolated changes that are not going to be controversial are:

  1. Force GCC 12 to -O3 because there doesn't seem to be a solution for -Og without ruining everything else
  2. Kill XXH_OLD_NAMES
  3. Rework the XXH3_update code to not dupe itself so much
  4. Force inline the lowest level functions on -Os. However, perhaps don't do it for anything above round or acc512, and don't use noinline either.
  5. Reduce the unrolling and inlining on 32-bit (but with some more benchmarking)
  6. Outline (and perhaps label) the individual steps better

The future changes could be
a. Implement the hashLong refactor
b. Improve the 4 and 5 selections on testing
c. If we are going to be reducing inlining, perhaps we could sneak in a __attribute__((regparm(3))) on x86 to make tail calls better?

the changes are beneficial to binary sizes for the performance mode, but when it comes to the size optimized -Os mode, binary size is actually increased.

Yes. That is expected.

On 64-bit targets, the size decrease on -O3 comes from XXH3_update, which has been refactored and is no longer inlined twice. There haven't been that many changes otherwise.

The size increase on -Os is expected too, as I have changed it to inline the utilities and core functions. However, 4kB is a little more than I want — perhaps that comes from the XXH_NO_INLINE clones.

This change is because -Os is a pretty common default optimization level in various projects, and the aggressive size optimizations should be more opt-in. This is actually the reason Clang added -Oz, -Os was originally an alias for -O2 because people liked to overuse it.

The performance issues were noticeable more on GCC ARMv4, where it outlined XXH_memcpy() for unaligned reads.

Although I will have to look into fixing the GCC x86 performance issues. This might just be from the XXH_INLINE_64BIT not inlining the right functions since the performance drop seems to be consistent.

To begin with, it's clear that library size increases significantly with XXH3 variants. Sure, there are 2 variants, divided into a number of range strategies, plus streaming capabilities, ok. Yet, I'm nonetheless surprised that it requires so much. The -Os compilation mode is the only one that seems in the "acceptable" range.

The way it is force inlined makes there be 8.

You can easily see the size of each function if you compile with -ffunction-sections, and use size -A xxhash.o | sort -nk2.

The short hashing code (XXH3_[64,128]bits_internal) is deceptively heavy on 32-bit. When compiled with XXH_NO_INLINE_HINTS, it is usually outlined and can be 1-2k itself on 64-bit targets, and 5-10k on 32-bit.

However, arguably, there is some benefit to this, as the short hash is more sensitive to latency, and the constant propagation of kSecret and seed==0 can be beneficial (especially on platforms with limited unaligned access since all secret reads are aligned).

Additionally, as I mentioned before, the way hashLong is set up causes inline clones as well, which afaik doesn't seem to be beneficial now that the codepaths are identical.

Another note is that on x86, the 64->128-bit multiply is a lot of instructions because of how disruptive the EAX:EDX requirements are. Both GCC and Clang seem to be unaware of this code weight and inline/unroll 60+ bytes of code each multiply. (I might make an XXH_NO_INLINE for all 32-bit but ARM with DSP) Also, GCC 10+ on x86 emit a literal imul reg, reg, 0 which means I might need to hack mult32to64 on that too.

@easyaspi314
Copy link
Contributor Author

The first thing I'll do is find the minimum amount of #pragma GCC optimize("-O3") or __attribute__((optimize("-O3"))) to shut up GCC

@easyaspi314 easyaspi314 mentioned this pull request Feb 27, 2023
@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 8, 2023

I have a rough draft of the hashLong refactor that is designed to be dispatched, but update() performance is pretty bad due to the function call latency (since consumeStripes() would need a __attribute__((target))). I might have to extract the entire second half of XXH3_update(). Changes start at ~line 5000.
https://gist.github.com/easyaspi314/795084ca1c04d4c884e6cff14cab80a4

@Cyan4973
Copy link
Owner

Cyan4973 commented Jul 2, 2023

I believe there were already several fixes merged for the gcc-12 -Og issue.
Therefore, is this PR still useful ? Or can it be closed ?

@Cyan4973
Copy link
Owner

The "fix gcc-12 -Og compilation" issue seems fixed now, thanks to other contributions.

However, the PR also lists "improve code size" as an objective.
Is there still something worth extracting from it ?

@t-mat
Copy link
Contributor

t-mat commented Jul 10, 2023

I compared dev ( 258351b ) and this PR ( 98be0b9 )

Note that this report contains all XXH algorithms.

algo target compiler flags dev size pr802 size lib size diff (pr802 - dev)
all x64 + SSE2 gcc v11.3.0 -O3 51176 52888 +1712
all x64 + SSE2 gcc v11.3.0 -O2 36752 37568 +816
all x64 + SSE2 gcc v11.3.0 -Os 19384 20432 +1048
all x86 + scalar gcc v11.3.0 -m32 -O3 89016 73770 -15246
all x86 + scalar gcc v11.3.0 -m32 -O2 52940 40494 -12446
all x86 + scalar gcc v11.3.0 -m32 -Os 24396 22734 -1662

We can still see some size difference in x86 mode.

CSV
commit-hash,compiler,flags,filename,size
258351b,gcc-11,-O2,libxxhash.a,36752
258351b,gcc-11,-O3,libxxhash.a,51176
258351b,gcc-11,-Os,libxxhash.a,19384
258351b,gcc-11,-m32 -O2,libxxhash.a,52940
258351b,gcc-11,-m32 -O3,libxxhash.a,89016
258351b,gcc-11,-m32 -Os,libxxhash.a,24396
98be0b9,gcc-11,-O2,libxxhash.a,37568
98be0b9,gcc-11,-O3,libxxhash.a,52888
98be0b9,gcc-11,-Os,libxxhash.a,20432
98be0b9,gcc-11,-m32 -O2,libxxhash.a,40494
98be0b9,gcc-11,-m32 -O3,libxxhash.a,73770
98be0b9,gcc-11,-m32 -Os,libxxhash.a,22734
Procedure
cd
git clone https://github.com/Cyan4973/xxHash.git xxhash-pr-802
cd xxhash-pr-802

mkdir report
export CC="gcc-11"


git checkout dev
export CH=$(git --no-pager log -1 --format="%h")

export CFLAGS="-O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"

export CFLAGS="-m32 -O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"


git pull origin pull/802/head
git checkout FETCH_HEAD
git log -1
export CH=$(git --no-pager log -1 --format="%h")

export CFLAGS="-O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"

export CFLAGS="-m32 -O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"

# Show CSV
cat report/*

@t-mat
Copy link
Contributor

t-mat commented Jul 10, 2023

As for code size, c6dc92f also reduces it and may implement more robust dispatch functionality.

@easyaspi314
Copy link
Contributor Author

As for code size, c6dc92f also reduces it and may implement more robust dispatch functionality.

I should go back and finish that. I kinda got distracted by another project 😅

@Cyan4973
Copy link
Owner

Cyan4973 commented Jul 12, 2023

I should go back and finish that. I kinda got distracted by another project 😅

Ideally, I would like to publish release v0.8.2 in the very near future.

I presume completing this objective is not trivial, and therefore should rather be expected as a potential contribution to a later version ? (either v0.8.3 or v0.9.0)

@easyaspi314
Copy link
Contributor Author

I should go back and finish that. I kinda got distracted by another project 😅

Ideally, I would like to publish release v0.8.2 in the very near future.

I presume completing this objective is not trivial, and therefore should rather be expected as a potential contribution to a later version ? (either v0.8.3 or v0.9.0)

Looking over the code I think it just needs some clean up, documentation, and benchmarking.

@Cyan4973 Cyan4973 added this to the v0.8.3 milestone Jul 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants