Refactor to improve code size, fix GCC 12 #802

easyaspi314 · 2023-02-23T04:44:17Z

Overview

Kinda a dump of (mostly) related changes I've been sitting on and never bothered to commit.

This is a set of changes to:

Reduce code size on -O3
- Especially on 32-bit, where -O3 could get well over 100 kB .text
- I feel this could still be improved if hashLong constant propagation can be reduced
Improve -Os/XXH_SIZE_OPT performance
Make things more clean and modular

These haven't been micro-benchmarked yet. The "what is worth inlining" is a rough estimate, and I want to go through the various recent compilers to find what works best.

Major changes:

GCC 12 will now be forced to -O3 to work around the -Og bug. Fixes GCC-12 fails to inline XXX_FORCED_INLINE functions, XXH3_accumulate_512_sse2(), and XXH3_scrambleAcc_sse2() with -Og flag. #800, Pb when building with gcc-12 -Og #720
- Only applies when __OPTIMIZE__, !__NO_INLINE__ and XXH_SIZE_OPT <= 0 (!__OPTIMIZE_SIZE__), so -O2 and -Og both apply
- I tried with adjusting the inline hints but that tanked performance.
XXH_INLINE: Acts like existing XXH_FORCE_INLINE where it is disabled when XXH_SIZE_OPT is used.
- XXH_FORCE_INLINE now actually force inlines unconditionally (when inline hints are on) and is used on XXH3 and XXH32's core loops and utilities.
- Greatly improves GCC -Os performance with only a small code size overhead
XXH_32BIT_OPT, XXH_INLINE_64BIT: These are used to reduce code bloat on 32-bit
- This with the other related optimizations reduce clang armv4t -O3 code size by about half, now being <100 kB
XXH32 and XXH64 have been rewritten to reuse the algorithm steps
XXH3_update has been rewritten to be much easier to understand and much smaller.
XXH3_128bits_update now tail calls XXH3_64bits_update, greatly reducing code size at the cost of one branch

Minor changes:

GCC 11 and higher no longer force -O2 on AVX2, that bug has been fixed.
XXH32_state_t and XXH64_state_t have some fields renamed and now use unsigned char [N] for their buffer
XXH_OLD_NAMES has been removed, it's been long enough
Fixed a few inline hints
Some reused snippets in XXH3 have been extracted into functions.
Fixed a theoretical integer overflow in the update code

- Fix the state structs to use unsigned char, update names - Extract the algorithm steps into inline subroutines - Fix a theoretical integer overflow bug - Reroll XXH64 on 32-bit (it is going to run like crap anyway)

It's been long enough

- Add `XXH_32BIT_OPT` which is enabled for targets without 64-bit arithmetic - Add a 64-bit only inline hint, apply it to the functions that need them - TODO benchmark them more - Tail call XXH3_64bits_update instead of inlining XXH3_update - Reroll midrange on 32-bit (ARM is the only one that doesn't choke on the 128-bit multiplies) - Extract the copy-pasted midrange finalization code in XXH128

XXH3_consumeStripes can now process multiple blocks, greatly simplifying the logic and making it resemble XXH32 and XXH64. This also reduces code size by a solid amount.

- XXH_SIZE_OPT now, instead of disabling inline hints, turns off *most* inline hints - Things like acc512 and utilities remain force inline - XXH_FORCE_INLINE now always force inlines for functions that need it.

Not the best solution, but not force inlining the accumulate functions is more of a problem than forcing GCC to -O3.

easyaspi314 · 2023-02-23T05:04:07Z

Seems like some apt repos are down which is causing some things to fail.

TIL C90 doesn't have SIZE_MAX.

easyaspi314 · 2023-02-23T08:24:20Z

I also have an idea that seems to get full performance (at least on large scale, I'll have to test the latency) with only 2 copies of XXH3_hashLong_internal_loop.

This manages to get clang aarch64 -O3 to 24 kB compared to 36 kB, and clang armv4t -O3 from 92 kB to 61 kB (far less than half of the 160 kB before this PR)

typedef struct {
    XXH_ALIGN_MEMBER(64, xxh_u64 acc[8]);
} XXH3_acc_t;

XXH_FORCE_INLINE XXH3_acc_t
XXH3_hashLong_internal_loop_impl(...)
{
    XXH3_acc_t acc = { XXH3_INIT_ACC };
    // normal internal_loop stuff
    return acc;
}

XXH_NO_INLINE XXH3_acc_t
XXH3_hashLong(const xxh_u8* XXH_RESTRICT input, size_t len,
              const xxh_u8* XXH_RESTRICT secret, size_t secretSize)
{
    return XXH3_hashLong_internal_loop_impl(input, len, secret, secretSize, XXH3_accumulate,
XXH3_scrambleAcc);
}
#if XXH_SIZE_OPT <= 0
/* Constant folded for the default secret size */
XXH_NO_INLINE XXH3_acc_t
XXH3_hashLong_defaultSecretSize(const xxh_u8* XXH_RESTRICT input, size_t len,
              const xxh_u8* XXH_RESTRICT secret, size_t secretSize)
{
    (void) secretSize;
    return XXH3_hashLong_internal_loop_impl(input, len, secret, sizeof(XXH3_kSecret), XXH3_accumulate,
XXH3_scrambleAcc);
}
#else
/* Don't constant fold on size opt */
#  define XXH3_hashLong_defaultSecretSize XXH3_hashLong
#endif

Returning a struct like this is functionally identical to memcpying to an out pointer on the ABI level, but the compiler knows more about it and can optimize it better. (Just having an output pointer without an inlining is 40% slower).

I would have to clean it up, rename it, and test it some more though, so I would probably keep this to another PR.

Edit: This could be a step towards making dispatching on x86 standard as well since the overridden functions are now only 2-3 (also the feature test is 99% logging - it normally compiles to like 50 instructions).

easyaspi314 · 2023-02-23T21:29:04Z

How do the rest of these changes look aside from that (which since it would affect dispatching is a huge change that will need some more testing)? The changes shouldn't affect 64-bit codegen that much (although XXH3_update may be worth some benchmarking)

The biggest changes codegen-wise are 32-bit not being forced to inline and unroll ugly functions as much (which have more latency than a call frame/loop and aren't worth the code size increase) and -Os being faster.

easyaspi314 · 2023-02-25T18:15:09Z

I'm going to leave the hashLong rework to another PR because that would need reworking dispatching, and that would be a perfect time to redo that mess.

Cyan4973 · 2023-02-26T21:37:36Z

I'm trying to get a sense of the scope of the changes in this PR.

To rephrase it :

It's mostly focused on reducing the binary size, though speed might also be affected as a side effect
It mostly concerns -Os compilation mode, performance modes -O2 and -O3 should not be negatively impacted
It mostly concerns 32-bit compilation mode, 64-bit should not be negatively impacted
Speed change is expected to be roughly neutral, except for 32-bit -Os where it's expected to be positive.
Not sure if arm target is differently impacted than x86/x64 one.

If that's the correct scope, then there are a number of measurements to do and publish to document this PR.

Longer term, I wonder if there is something that could / should be done to ensure these properties are preserved in the future. Otherwise, it would be easy for a future PR to inflate the binary size without anyone noticing it.

Cyan4973 · 2023-02-27T00:27:09Z

I made a series of measurements, leading to surprising results (at least for me), when comparing this pr802 to dev branch.

Let's start with x64 compilation mode. Compiler is gcc v9.4.0, cpu is i7-9700k (turbo off). Performance flag only, meaning the vector extension is default SSE2. Let's measure the size of produced libxxhash.

compiler flags	`dev` lib size	`pr802` lib size	difference
`-O3`	76392	60080	-16312
`-O2`	51936	43912	-8024
`-Os`	32208	36600	+4392

So that's the first surprise : the changes are beneficial to binary sizes for the performance mode, but when it comes to the size optimized -Os mode, binary size is actually increased. It's not necessarily a bad deal, but since it seems to achieve the inverse of what's advertised, I figured it's worth underlining.

Now let's have a look to performance. The scenario measured is the default -b, aka 100 KB (long hash). Numbers measure XXH32, XXH64 and XXH128 successively

compiler flags	`dev` GB/s	`pr802` GB/s	note
`-O3`	6.9, 13.7, 20.1	6.9, 13.7, 20.1	identical
`-O2`	6.9, 13.7, 14.0	6.9, 13.7, 14.0	identical
`-Os`	6.9, 13.7, 12.6	6.9, 13.7, 14.9	`XXH128` is improved, even faster than `-O2`

To begin with, at -O3 and -O2, pr802 features no speed change. So, combined with the lower binary size, this pr802 is a win globally. Nicely done.
Note how performance of XXH128 plummets seriously at -O2, which is probably because gcc needs -O3 to generate efficient vector code. XXH32 and XXH64 in contrast remain unaffected.
Finely, at -Os, XXH32 and XXH64 remain performance neutral, while XXH128 improves with pr802, to the point of running slightly faster than -O2 (yet far from -O3). I wonder if there are some useful learning there.
That makes -Os plainly better than -O2 on pr802, since it wins on both speed and binary size.

So, some surprises, but all in all, this PR has positives.
One could still argue that, for XXH32 and XXH64, dev + -Os is actually the better choice, since performance is unaffected, while binary size is the smallest.
But for XXH128, note that the larger binary size of -Os comes with better performance too, so it's a reasonable trade off. In any case, pr802 + -Os is superior in all metrics to dev + -O2, since it provides both smaller binary size and faster speed for XXH128. That's worth mentioning.

Now, it was also mentioned that this PR is supposed to be even more useful for 32-bit mode, so let's try that. Once again, compiler is gcc v9.4.0, only performance flags are changed, meaning that vector extensions are likely disabled by default.

compiler flags	`dev` lib size	`pr802` lib size	difference
`-m32 -O3`	111860	73174	-38686
`-m32 -O2`	70924	46968	-23956
`-m32 -Os`	34876	34920	+44

The first thing to note is how big these binary sizes are, compared to x64 binaries. This might be because there is no vector extension by default. This could be tested in more details by separating XXH3 variants from the rest.
The binary size decrease from pr802 is huge for the performance modes, especially at -O3.
Once again, -Os is against the trend, though the difference is pretty minimal in this case, that's essentially a tie.

Now, what about performance ?

compiler flags	`dev` GB/s	`pr802` GB/s
`-m32 -O3`	6.9, 3.1, 5.7	6.9, 2.9, 5.6
`-m32 -O2`	6.9, 2.8, 2.8	6.9, 2.7, 2.7
`-m32 -Os`	6.9, 2.7, 2.6	6.9, 2.6, 2.4

Once again, xxh32 proves very resilient, offering the same performance at all compilation modes, and note that it's the same performance as the x64 mode, despite the lower nb of registers.
We already knew that xxh64 is a bad fit for 32-bit mode, and this is confirmed again in these measurements, no surprise. Note though that in this case, -O3 does deliver some speed benefits (almost +10%), so it's no longer clear that -Os is the better choice for XXH64.
XXH128 performance is also much lower, most likely due to the disappearance of vector extensions. If SSE2 was still there, it would be a very different story. Anyway, in this case, -O3 improves the performance of XXH128 by a whopping +100%, so it really matters.
Finally, in these measurements, performance of pr802 seems slightly lower than dev. Nothing extravagant, but worth mentioning.

It's difficult to draw conclusions for this compilation exercise. pr802 delivers significant size benefits for -O3 and -O2 , and the minor performance losses are probably acceptable. What's more concerning is the -Os mode, for which pr802 is both slightly larger, and slightly slower. Nothing "serious", but still not in the right direction. And I initially expected this mode to be the most beneficial for pr802, so clearly, there was some misunderstanding.

There are other axis that could be analyzed. For example, add an aarch64 test. And maybe a 32-bit armv7 too. Measure the binary size impact of XXH3 separately from legacy XXH32 and XXH64. Add a few different vectors. Add a few different compilers. The test matrix quickly becomes overwhelming, and it's likely going to be difficult to produce a change that's only beneficial.

One potential idea : is it possible to break this PR into smaller stages ? Maybe some of these stages produce uncontroversial benefits that could be quickly merged ? Maybe other require pondering some trade-offs (which can still be acceptable) ? Finally, maybe there are useful learnings that could be used to improve other parts of the PR ?
Open for discussion.

Also, I would recommend rebasing this PR on top of dev, just to make sure that observed performance differences come solely from this PR, and are not confused with some parallel PR that would have been merged to dev but not pr802.

Cyan4973 · 2023-02-27T01:37:16Z

Since the focus of this PR is on binary size, I wanted to have a more in-depth look at these evolutions.
In the following tables, I'm separating the variants XXH32, XXH64 and XXH128, by producing limited dynamic library binaries (note: maybe I should have used static library size instead).
For XXH32, the produced libxxhash only contains XXH32 and some common functions.
For XXH64, the produced libxxhash has both previous content and XXH64, but no XXH3 variant.
Finally, for XXH128, it's the full library.
The hope is that it will make it possible to observe the binary size changes of XXH32 and XXH64 separately, without mixing the (probably larger) changes due to vector extensions exploited by XXH3 variants.
In the same vein, it's going to be more natural to add more vector extensions for comparisons.

Let's start this exercise with `XXH32` :

algo	target	compiler	flags	`dev` size	`pr802` size	lib size diff
XXH32	`x64`	`gcc` v9.4.0	`-O3`	16576	16556	-20
XXH32	`x64`	`gcc` v9.4.0	`-O2`	16616	16616	0
XXH32	`x64`	`gcc` v9.4.0	`-Os`	16600	16664	+64
XXH32	`x86`	`gcc` v9.4.0	`-m32 -O3`	15456	15456	0
XXH32	`x86`	`gcc` v9.4.0	`-m32 -O2`	15488	15488	0
XXH32	`x86`	`gcc` v9.4.0	`-m32 -Os`	15440	15532	+92
XXH32	`aarch64` M1	Apple `clang` v14.0	`-O3`	33934	33934	0
XXH32	`aarch64` M1	Apple `clang` v14.0	`-O2`	33934	33934	0
XXH32	`aarch64` M1	Apple `clang` v14.0	`-Os`	33966	34014	+48

A few interesting details here.
To begin with, we know that performance remains largely unaffected for XXH32, staying at 6.9 GB/s on the i7-9700k no-turbo target cpu, whatever the compilation mode and flags. The story is essentially the same on M1, with XXH32 delivering consistently 7.0 GB/s, and small (unimportant) degradation at pr802 with -Os setting.
We also note that the library size differences between modes and commits remain extremely small, which is consistent.

Two remarquable trends are worth mentioning though :

Modes expected to produce smaller binary sizes actually produce larger binary sizes. This trend is clearer for pr802. At the end of the story, the differences are not large, but that still goes against expectations.
pr802 is especially worse at binary size for the -Os mode. This is also the reverse of my initial expectations.

It could be that binary size differences are in fact unrelated to XXH32 proper, and due to minor differences in other generic symbols still compiled in the library, such as XXH_versionNumber() for example.

In conclusion, when it comes to XXH32, this pr802 is roughly neutral.

Now, onto `XXH64` :

algo	target	compiler	flags	`dev` size	`pr802` size	lib size diff	perf evol
XXH64	`x64`	`gcc` v9.4.0	`-O3`	17072	17072	0	13.7 GB/s
XXH64	`x64`	`gcc` v9.4.0	`-O2`	17016	17016	0	13.7 GB/s
XXH64	`x64`	`gcc` v9.4.0	`-Os`	17040	17144	+104	13.7 GB/s
XXH64	`x86`	`gcc` v9.4.0	`-m32 -O3`	19840	19872	+32	3.1 -> 2.9 GB/s
XXH64	`x86`	`gcc` v9.4.0	`-m32 -O2`	19900	15924	-3976	2.8 -> 2.7 GB/s
XXH64	`x86`	`gcc` v9.4.0	`-m32 -Os`	19852	15916	-3936	2.7 -> 2.6 GB/s
XXH64	`aarch64` M1	Apple `clang` v14.0	`-O3`	34414	34414	0	14.0 GB/s
XXH64	`aarch64` M1	Apple `clang` v14.0	`-O2`	34414	34414	0	14.0 GB/s
XXH64	`aarch64` M1	Apple `clang` v14.0	`-Os`	34478	34718	+304	14.0 -> 13.7 GB/s

Some learnings:
In x64 mode, the library size difference is very small compared to the XXH32-only library. Less than ~0.5KB is added. This is a similar story for aarch64. This seems to show that XXH64 is compact enough, at all compilation settings.
As we already know, speed remains unaffected, at 13.7 GB/s on i7-9700k. It is only slightly degraded at -Os with pr802 on M1 Pro, still offering roughly double XXH32 speed.
Unfortunately, pr802 doesn't offer any binary size savings. Actually, -Os is slightly bigger than dev.

On x86 32-bit mode, there is a more substantial binary size difference, and dev branch requires > +4 KB to add XXH64. This suggests that the XXH64 algorithm gets expanded significantly. Probably some loop unrolling.
Binary size is improved by pr802, showing that it's possible to add XXH64 with a budget of roughly ~0.5KB, similar to x64 mode.

We also know that XXH64 performance is much lower in 32-bit mode, due to reliance on 64-bit multiply operations.
And unfortunately, pr802 makes this situation slightly worse.
This would probably remain an acceptable trade-off when it saves almost ~4 KB of binary size, but strangely, at -O3, there is no such binary size saving, yet the performance difference is still very much present (checked several times).

It's unclear to me why there is a (small) performance difference between dev and pr802 in 32-bit x86 mode.
It's also unclear to me why binary size savings seem ineffective at -O3 specifically, while -O2 seems to benefit from it.

In conclusion, when it comes to XXH64, this pr802 is slightly negative for performance. There is an interesting binary size saving on x86 at -O2 and -Os settings, that could potentially compensate the small speed degradation. But for 64-bit modes, there is no positive to report.

`XXH128` :

For this last one, we can simply re-employ results from previous measurement exercise.

algo	target	compiler	flags	`dev` size	`pr802` size	lib size diff	perf evol
XXH128	`x64` + `SSE2`	`gcc` v9.4.0	`-O3`	76392	60080	-16312	20.1 GB/s
XXH128	`x64` + `SSE2`	`gcc` v9.4.0	`-O2`	51936	43912	-8024	14.0 GB/s
XXH128	`x64` + `SSE2`	`gcc` v9.4.0	`-Os`	32208	36600	+4392	12.6 -> 14.9 GB/s
XXH128	`x86` + `scalar`	`gcc` v9.4.0	`-m32 -O3`	111860	75436	-36424	5.7 -> 5.6 GB/s
XXH128	`x86` + `scalar`	`gcc` v9.4.0	`-m32 -O2`	70924	46968	-23956	2.8 -> 2.7 GB/s
XXH128	`x86` + `scalar`	`gcc` v9.4.0	`-m32 -Os`	34876	34920	+44	2.2 -> 2.4 GB/s
XXH128	`aarch64` M1 + `NEON`	Apple `clang` v14.0	`-O3`	69678	53246	-16432	35.8 GB/s
XXH128	`aarch64` M1 + `NEON`	Apple `clang` v14.0	`-O2`	69678	53246	-16432	35.8 GB/s
XXH128	`aarch64` M1 + `NEON`	Apple `clang` v14.0	`-Os`	37198	37198	0	10.9 GB/s

To begin with, it's clear that library size increases significantly with XXH3 variants. Sure, there are 2 variants, divided into a number of range strategies, plus streaming capabilities, ok. Yet, I'm nonetheless surprised that it requires so much. The -Os compilation mode is the only one that seems in the "acceptable" range.

Here, pr802 brings pretty big binary size savings, at -O2 and especially -O3.
For -Os though, it doesn't. It actually increases size for the x64 mode.

-O3 remains an important compilation flag for best performance of XXH128 on gcc: speed drops very significantly with -O2 and -Os.
On clang, the picture is slightly different : speed is exactly the same at -O3 and -O2. But at -Os, the performance drop is very steep.

The performance impact of pr802 is all over the place, from neutral, to slightly negative, to positive.
The most interesting aspect is seeing -Os mode receiving a speed boost on gcc + intel cpu.
For x64, this speed boost is significant enough to beat -O2, which is really nice since binary is also smaller.
For x86, the speed boost is not as important, but still a hefty +10% compared to dev.
So there's probably something worth investigating there.

To be continued : add more targets, more compilers, etc.

Cyan4973 · 2023-02-27T05:08:13Z

The previous analysis was using Dynamic Library to compare generates binary sizes. It was assumed that this wouldn't make much difference compared to Static Libraries, with maybe just a constant factor added.
This is incorrect : Static and Dynamic Library sizes are loosely correlated, but can evolve somewhat differently.
This makes it difficult to decide which one is worth monitoring.
It could be argued that Static Library is closer to the concept of "generated binary size", so that would justify prefering Static library sizes. But this disregards the issue that these fragments of code, once integrated into a real executable (library or program) have to respect additional rules, such as code and data alignment. As a consequence, the impact of a change will be different depending on the target, and depending on the linking process. Therefore, it could also be argued that the Dynamic Library size gives a better sense of the final impact of a change.

This is relatively difficult to make sense of, and difficult to decide, so I left both analysis in the thread. We can continue from the Static or the Dynamic library size analysis, depending on preference.

Since the focus of this PR is on binary size, I wanted to have a more in-depth look at these evolutions.
In the following tables, I'm separating the variants XXH32, XXH64 and XXH128, by producing limited Static library binaries.
For XXH32, the produced libxxhash only contains XXH32 and some common functions.
For XXH64, the produced libxxhash has both previous content and XXH64, but no XXH3 variant.
Finally, for XXH128, it's the full library.
The hope is that it will make it possible to observe the binary size changes of XXH32 and XXH64 separately, without mixing the (probably larger) changes due to vector extensions exploited by XXH3 variants.
In the same vein, it's going to be more natural to add more vector extensions for comparisons.

Let's start this exercise with `XXH32` :

algo	target	compiler	flags	`dev` size	`pr802` size	lib size diff
XXH32	`x64`	`gcc` v9.4.0	`-O3`	4268	4404	+136
XXH32	`x64`	`gcc` v9.4.0	`-O2`	4108	4228	+120
XXH32	`x64`	`gcc` v9.4.0	`-Os`	3780	3836	+56
XXH32	`x86`	`gcc` v9.4.0	`-m32 -O3`	4412	4460	+48
XXH32	`x86`	`gcc` v9.4.0	`-m32 -O2`	4200	4248	+48
XXH32	`x86`	`gcc` v9.4.0	`-m32 -Os`	3594	3590	-4
XXH32	`aarch64` M1	Apple `clang` v14.0	`-O3`	3416	3432	+16
XXH32	`aarch64` M1	Apple `clang` v14.0	`-O2`	3400	3416	+16
XXH32	`aarch64` M1	Apple `clang` v14.0	`-Os`	3392	3432	+40

We know that XXH32 speed remains largely unaffected, staying at 6.9 GB/s on the i7-9700k no-turbo target cpu, whatever the compilation mode and flags. The story is essentially the same on M1, with XXH32 delivering consistently 7.0 GB/s, and a small (unimportant) degradation at pr802 with -Os setting.

The library size differences between modes and commits remain small, which is unsurprising.
But we can also notice that pr802 impact is generally negative for binary size.
Not by a large amount, therefore it's not a big deal.

Since there is no corresponding speed benefit, pr802 looks globally slightly negative for the XXH32-only scenario.
But as mentioned, this is minor, and can probably be ignored if there is a much larger benefit for other scenarios.

Now, onto `XXH64` :

algo	target	compiler	flags	`dev` size	`pr802` size	lib size diff	perf evol
XXH64	`x64`	`gcc` v9.4.0	`-O3`	7526	7886	+360	13.7 GB/s
XXH64	`x64`	`gcc` v9.4.0	`-O2`	7126	7446	+320	13.7 GB/s
XXH64	`x64`	`gcc` v9.4.0	`-Os`	6326	6406	+80	13.7 GB/s
XXH64	`x86`	`gcc` v9.4.0	`-m32 -O3`	10638	9866	-772	3.1 -> 2.9 GB/s
XXH64	`x86`	`gcc` v9.4.0	`-m32 -O2`	9110	7814	-1296	2.8 -> 2.7 GB/s
XXH64	`x86`	`gcc` v9.4.0	`-m32 -Os`	8048	6472	-1576	2.7 -> 2.6 GB/s
XXH64	`aarch64` M1	Apple `clang` v14.0	`-O3`	5896	5912	+16	14.0 GB/s
XXH64	`aarch64` M1	Apple `clang` v14.0	`-O2`	5832	5848	+16	14.0 GB/s
XXH64	`aarch64` M1	Apple `clang` v14.0	`-Os`	5848	6128	+280	14.0 -> 13.7 GB/s

Some learnings:
As we already know, in 64-bit modes (x64, aarch64), speed remains unaffected, at 13.7 GB/s on i7-9700k. It is only slightly degraded at -Os with pr802 on M1 Pro, still offering roughly double XXH32 speed.
Unfortunately, pr802 doesn't offer any binary size savings for this scenario. It is actually slightly bigger than dev.

On x86 32-bit mode, there is a more substantial binary size difference, and dev branch requires 2x binary size budget to add XXH64. This suggests that the XXH64 algorithm gets expanded significantly. Probably some loop unrolling.
Binary size is improved by pr802, showing that it's possible to add XXH64 with a budget roughly similar to x64 mode.

We also know that XXH64 performance is much lower in 32-bit mode, due to reliance on 64-bit multiply operations.
And unfortunately, pr802 makes this situation slightly worse.
This would probably remain an acceptable trade-off when it saves binary size, but strangely, at -O3, binary size savings are not great, yet the performance difference is very much present (checked multiple times).

It's unclear to me why there is a (small) performance difference between dev and pr802 in 32-bit x86 mode.
It's also unclear to me why binary size savings is less effective at -O3 specifically, while -O2 seems to benefit more from it.

In conclusion, when it comes to XXH64, this pr802 seems slightly negative for performance. There is an interesting binary size saving on x86, especially at -O2 and -Os settings, that could potentially compensate the small speed degradation. But for 64-bit modes, there is unfortunately no positive outcome to report .

`XXH128` :

Let's redo binary size measurements, using static library size instead.

algo	target	compiler	flags	`dev` size	`pr802` size	lib size diff	perf evol
XXH128	`x64` + `SSE2`	`gcc` v9.4.0	`-O3`	67232	53200	-14032	20.1 GB/s
XXH128	`x64` + `SSE2`	`gcc` v9.4.0	`-O2`	40880	36864	-4016	14.0 GB/s
XXH128	`x64` + `SSE2`	`gcc` v9.4.0	`-Os`	19752	20432	+680	12.6 -> 14.9 GB/s
XXH128	`x86` + `scalar`	`gcc` v9.4.0	`-m32 -O3`	107280	73174	-34106	5.7 -> 5.6 GB/s
XXH128	`x86` + `scalar`	`gcc` v9.4.0	`-m32 -O2`	60796	39188	-21608	2.8 -> 2.7 GB/s
XXH128	`x86` + `scalar`	`gcc` v9.4.0	`-m32 -Os`	25152	23278	-1878	2.2 -> 2.4 GB/s
XXH128	`aarch64` M1 + `NEON`	Apple `clang` v14.0	`-O3`	50456	44408	-6048	35.8 GB/s
XXH128	`aarch64` M1 + `NEON`	Apple `clang` v14.0	`-O2`	48768	42248	-6520	35.8 GB/s
XXH128	`aarch64` M1 + `NEON`	Apple `clang` v14.0	`-Os`	21600	24336	+2736	10.9 GB/s

To begin with, it's clear that library size increases significantly with XXH3 variants. Sure, there are 2 variants, divided into a number of range strategies, plus streaming capabilities, ok. Yet, I'm nonetheless surprised that it requires so much. The -Os compilation mode is the only one that lies in the "acceptable" range.

Here, pr802 brings pretty big binary size savings, at -O2 and especially -O3.
For -Os though, it's less clear, and can actually increase size.

-O3 remains an important compilation flag for best performance of XXH128 on gcc: speed drops very significantly with -O2 and -Os.
On clang, the picture is slightly different : speed is exactly the same at -O3 and -O2. But at -Os, the performance drop is very steep.

The performance impact of pr802 is all over the place, from neutral, to slightly negative, to positive.
The most interesting aspect is seeing -Os mode receiving a speed boost on gcc + intel cpu.
For x64, this speed boost is significant enough to beat -O2, which is really nice since binary size is also smaller.
For x86, the speed boost is not as important, but still a hefty +10% compared to dev.
So there's probably something worth investigating there.

Note : some results are sometimes surprising. Ensuring that pr802 is based on top of dev is also important to ensure this comparison analysis doesn't produce wrong conclusions.

To be continued : add more targets, more compilers, etc.

easyaspi314 · 2023-02-27T18:54:47Z

is it possible to break this PR into smaller stages ?

Sure, I can do that. Sorry I adhd'd again 😵‍💫

I think that the isolated changes that are not going to be controversial are:

Force GCC 12 to -O3 because there doesn't seem to be a solution for -Og without ruining everything else
Kill XXH_OLD_NAMES
Rework the XXH3_update code to not dupe itself so much
Force inline the lowest level functions on -Os. However, perhaps don't do it for anything above round or acc512, and don't use noinline either.
Reduce the unrolling and inlining on 32-bit (but with some more benchmarking)
Outline (and perhaps label) the individual steps better

The future changes could be
a. Implement the hashLong refactor
b. Improve the 4 and 5 selections on testing
c. If we are going to be reducing inlining, perhaps we could sneak in a __attribute__((regparm(3))) on x86 to make tail calls better?

the changes are beneficial to binary sizes for the performance mode, but when it comes to the size optimized -Os mode, binary size is actually increased.

Yes. That is expected.

On 64-bit targets, the size decrease on -O3 comes from XXH3_update, which has been refactored and is no longer inlined twice. There haven't been that many changes otherwise.

The size increase on -Os is expected too, as I have changed it to inline the utilities and core functions. However, 4kB is a little more than I want — perhaps that comes from the XXH_NO_INLINE clones.

This change is because -Os is a pretty common default optimization level in various projects, and the aggressive size optimizations should be more opt-in. This is actually the reason Clang added -Oz, -Os was originally an alias for -O2 because people liked to overuse it.

The performance issues were noticeable more on GCC ARMv4, where it outlined XXH_memcpy() for unaligned reads.

Although I will have to look into fixing the GCC x86 performance issues. This might just be from the XXH_INLINE_64BIT not inlining the right functions since the performance drop seems to be consistent.

To begin with, it's clear that library size increases significantly with XXH3 variants. Sure, there are 2 variants, divided into a number of range strategies, plus streaming capabilities, ok. Yet, I'm nonetheless surprised that it requires so much. The -Os compilation mode is the only one that seems in the "acceptable" range.

The way it is force inlined makes there be 8.

You can easily see the size of each function if you compile with -ffunction-sections, and use size -A xxhash.o | sort -nk2.

The short hashing code (XXH3_[64,128]bits_internal) is deceptively heavy on 32-bit. When compiled with XXH_NO_INLINE_HINTS, it is usually outlined and can be 1-2k itself on 64-bit targets, and 5-10k on 32-bit.

However, arguably, there is some benefit to this, as the short hash is more sensitive to latency, and the constant propagation of kSecret and seed==0 can be beneficial (especially on platforms with limited unaligned access since all secret reads are aligned).

Additionally, as I mentioned before, the way hashLong is set up causes inline clones as well, which afaik doesn't seem to be beneficial now that the codepaths are identical.

Another note is that on x86, the 64->128-bit multiply is a lot of instructions because of how disruptive the EAX:EDX requirements are. Both GCC and Clang seem to be unaware of this code weight and inline/unroll 60+ bytes of code each multiply. (I might make an XXH_NO_INLINE for all 32-bit but ARM with DSP) Also, GCC 10+ on x86 emit a literal imul reg, reg, 0 which means I might need to hack mult32to64 on that too.

easyaspi314 · 2023-02-27T20:00:09Z

The first thing I'll do is find the minimum amount of #pragma GCC optimize("-O3") or __attribute__((optimize("-O3"))) to shut up GCC

easyaspi314 · 2023-03-08T22:07:29Z

I have a rough draft of the hashLong refactor that is designed to be dispatched, but update() performance is pretty bad due to the function call latency (since consumeStripes() would need a __attribute__((target))). I might have to extract the entire second half of XXH3_update(). Changes start at ~line 5000.
https://gist.github.com/easyaspi314/795084ca1c04d4c884e6cff14cab80a4

Cyan4973 · 2023-07-02T23:39:26Z

I believe there were already several fixes merged for the gcc-12 -Og issue.
Therefore, is this PR still useful ? Or can it be closed ?

Cyan4973 · 2023-07-10T06:11:53Z

The "fix gcc-12 -Og compilation" issue seems fixed now, thanks to other contributions.

However, the PR also lists "improve code size" as an objective.
Is there still something worth extracting from it ?

t-mat · 2023-07-10T10:26:59Z

I compared dev ( 258351b ) and this PR ( 98be0b9 )

Note that this report contains all XXH algorithms.

algo	target	compiler	flags	dev size	pr802 size	lib size diff (pr802 - dev)
all	x64 + SSE2	gcc v11.3.0	-O3	51176	52888	+1712
all	x64 + SSE2	gcc v11.3.0	-O2	36752	37568	+816
all	x64 + SSE2	gcc v11.3.0	-Os	19384	20432	+1048
all	x86 + scalar	gcc v11.3.0	-m32 -O3	89016	73770	-15246
all	x86 + scalar	gcc v11.3.0	-m32 -O2	52940	40494	-12446
all	x86 + scalar	gcc v11.3.0	-m32 -Os	24396	22734	-1662

We can still see some size difference in x86 mode.

CSV

commit-hash,compiler,flags,filename,size

258351b,gcc-11,-O2,libxxhash.a,36752
258351b,gcc-11,-O3,libxxhash.a,51176
258351b,gcc-11,-Os,libxxhash.a,19384
258351b,gcc-11,-m32 -O2,libxxhash.a,52940
258351b,gcc-11,-m32 -O3,libxxhash.a,89016
258351b,gcc-11,-m32 -Os,libxxhash.a,24396
98be0b9,gcc-11,-O2,libxxhash.a,37568
98be0b9,gcc-11,-O3,libxxhash.a,52888
98be0b9,gcc-11,-Os,libxxhash.a,20432
98be0b9,gcc-11,-m32 -O2,libxxhash.a,40494
98be0b9,gcc-11,-m32 -O3,libxxhash.a,73770
98be0b9,gcc-11,-m32 -Os,libxxhash.a,22734

Procedure

cd
git clone https://github.com/Cyan4973/xxHash.git xxhash-pr-802
cd xxhash-pr-802

mkdir report
export CC="gcc-11"


git checkout dev
export CH=$(git --no-pager log -1 --format="%h")

export CFLAGS="-O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"

export CFLAGS="-m32 -O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"


git pull origin pull/802/head
git checkout FETCH_HEAD
git log -1
export CH=$(git --no-pager log -1 --format="%h")

export CFLAGS="-O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"

export CFLAGS="-m32 -O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"

# Show CSV
cat report/*

t-mat · 2023-07-10T15:38:05Z

As for code size, c6dc92f also reduces it and may implement more robust dispatch functionality.

easyaspi314 · 2023-07-12T04:42:15Z

As for code size, c6dc92f also reduces it and may implement more robust dispatch functionality.

I should go back and finish that. I kinda got distracted by another project 😅

Cyan4973 · 2023-07-12T23:33:14Z

I should go back and finish that. I kinda got distracted by another project 😅

Ideally, I would like to publish release v0.8.2 in the very near future.

I presume completing this objective is not trivial, and therefore should rather be expected as a potential contribution to a later version ? (either v0.8.3 or v0.9.0)

easyaspi314 · 2023-07-13T12:17:44Z

I should go back and finish that. I kinda got distracted by another project 😅

Ideally, I would like to publish release v0.8.2 in the very near future.

I presume completing this objective is not trivial, and therefore should rather be expected as a potential contribution to a later version ? (either v0.8.3 or v0.9.0)

Looking over the code I think it just needs some clean up, documentation, and benchmarking.

easyaspi314 added 8 commits February 22, 2023 20:55

Modernize XXH32/XXH64

0a9db72

- Fix the state structs to use unsigned char, update names - Extract the algorithm steps into inline subroutines - Fix a theoretical integer overflow bug - Reroll XXH64 on 32-bit (it is going to run like crap anyway)

Kill XXH_OLD_NAMES

810498e

It's been long enough

Extract finalization from XXH3 hashLong

18e28a9

Refactor and simplify XXH3_update

cc4c915

XXH3_consumeStripes can now process multiple blocks, greatly simplifying the logic and making it resemble XXH32 and XXH64. This also reduces code size by a solid amount.

Add XXH_INLINE, force inline some things on sizeopt

9b82f8b

- XXH_SIZE_OPT now, instead of disabling inline hints, turns off *most* inline hints - Things like acc512 and utilities remain force inline - XXH_FORCE_INLINE now always force inlines for functions that need it.

Force GCC 12 -O2/-Og to -O3 to fix -Og bug

05a972c

Not the best solution, but not force inlining the accumulate functions is more of a problem than forcing GCC to -O3.

Add unused attribute

e3776e2

easyaspi314 force-pushed the streaming-refactor branch from d54cf26 to e3776e2 Compare February 23, 2023 04:59

easyaspi314 added 3 commits February 23, 2023 00:17

Better 32-bit check

3778858

TIL C90 doesn't have SIZE_MAX.

Fix -Wstrict-overflow

c590089

Merge the tail of XXH3_digestLong

98be0b9

easyaspi314 mentioned this pull request Feb 27, 2023

Fix GCC 12 -Og #804

Merged

easyaspi314 marked this pull request as draft March 14, 2023 21:33

t-mat mentioned this pull request Jul 1, 2023

GCC-12 fails to inline XXX_FORCED_INLINE functions, XXH3_accumulate_512_sse2(), and XXH3_scrambleAcc_sse2() with -Og flag. #800

Closed

Cyan4973 mentioned this pull request Jul 12, 2023

Suggestions list for future evolutions #458

Open

Cyan4973 added this to the v0.8.3 milestone Jul 16, 2023

JoeXDfk approved these changes Feb 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to improve code size, fix GCC 12 #802

Refactor to improve code size, fix GCC 12 #802

easyaspi314 commented Feb 23, 2023 •

edited

easyaspi314 commented Feb 23, 2023

easyaspi314 commented Feb 23, 2023 •

edited

easyaspi314 commented Feb 23, 2023 •

edited

easyaspi314 commented Feb 25, 2023

Cyan4973 commented Feb 26, 2023

Cyan4973 commented Feb 27, 2023 •

edited

Cyan4973 commented Feb 27, 2023 •

edited

Cyan4973 commented Feb 27, 2023 •

edited

easyaspi314 commented Feb 27, 2023 •

edited

easyaspi314 commented Feb 27, 2023

easyaspi314 commented Mar 8, 2023 •

edited

Cyan4973 commented Jul 2, 2023

Cyan4973 commented Jul 10, 2023

t-mat commented Jul 10, 2023 •

edited

t-mat commented Jul 10, 2023

easyaspi314 commented Jul 12, 2023

Cyan4973 commented Jul 12, 2023 •

edited

easyaspi314 commented Jul 13, 2023

Refactor to improve code size, fix GCC 12 #802

Are you sure you want to change the base?

Refactor to improve code size, fix GCC 12 #802

Conversation

easyaspi314 commented Feb 23, 2023 • edited

Overview

Major changes:

Minor changes:

easyaspi314 commented Feb 23, 2023

easyaspi314 commented Feb 23, 2023 • edited

easyaspi314 commented Feb 23, 2023 • edited

easyaspi314 commented Feb 25, 2023

Cyan4973 commented Feb 26, 2023

Cyan4973 commented Feb 27, 2023 • edited

Cyan4973 commented Feb 27, 2023 • edited

Let's start this exercise with XXH32 :

Now, onto XXH64 :

XXH128 :

Cyan4973 commented Feb 27, 2023 • edited

Let's start this exercise with XXH32 :

Now, onto XXH64 :

XXH128 :

easyaspi314 commented Feb 27, 2023 • edited

easyaspi314 commented Feb 27, 2023

easyaspi314 commented Mar 8, 2023 • edited

Cyan4973 commented Jul 2, 2023

Cyan4973 commented Jul 10, 2023

t-mat commented Jul 10, 2023 • edited

t-mat commented Jul 10, 2023

easyaspi314 commented Jul 12, 2023

Cyan4973 commented Jul 12, 2023 • edited

easyaspi314 commented Jul 13, 2023

easyaspi314 commented Feb 23, 2023 •

edited

easyaspi314 commented Feb 23, 2023 •

edited

easyaspi314 commented Feb 23, 2023 •

edited

Cyan4973 commented Feb 27, 2023 •

edited

Cyan4973 commented Feb 27, 2023 •

edited

Let's start this exercise with `XXH32` :

Now, onto `XXH64` :

`XXH128` :

Cyan4973 commented Feb 27, 2023 •

edited

Let's start this exercise with `XXH32` :

Now, onto `XXH64` :

`XXH128` :

easyaspi314 commented Feb 27, 2023 •

edited

easyaspi314 commented Mar 8, 2023 •

edited

t-mat commented Jul 10, 2023 •

edited

Cyan4973 commented Jul 12, 2023 •

edited