Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: XXH_INLINE_ALL does not always force inline? How to force inline? #902

Open
simonhf opened this issue Dec 2, 2023 · 4 comments · May be fixed by #903
Open

Bug: XXH_INLINE_ALL does not always force inline? How to force inline? #902

simonhf opened this issue Dec 2, 2023 · 4 comments · May be fixed by #903

Comments

@simonhf
Copy link

simonhf commented Dec 2, 2023

Here's a simple program:

$ cat mystery.c
#include <stdio.h>
#include <assert.h>
#include <sys/time.h>

#include "xxHash/xxhash.h"

// export USE_FOO2=0; gcc -DUSE_FOO2=$USE_FOO2 -DXXH_INLINE_ALL -O1 -S -o mystery$USE_FOO2.s mystery.c && gcc -c -o mystery$USE_FOO2.o mystery$USE_FOO2.s && gcc -o mystery$USE_FOO2.exe mystery$USE_FOO2.o && ./mystery$USE_FOO2.exe; cat mystery$USE_FOO2.s | egrep -i "(^[a-z].*:|call)"
// export USE_FOO2=1; gcc -DUSE_FOO2=$USE_FOO2 -DXXH_INLINE_ALL -O1 -S -o mystery$USE_FOO2.s mystery.c && gcc -c -o mystery$USE_FOO2.o mystery$USE_FOO2.s && gcc -o mystery$USE_FOO2.exe mystery$USE_FOO2.o && ./mystery$USE_FOO2.exe; cat mystery$USE_FOO2.s | egrep -i "(^[a-z].*:|call)"

double get_time_in_seconds(void) {
    struct timeval tv;
    assert(gettimeofday(&tv, NULL) >= 0);
    return (double)tv.tv_sec + 1.e-6 * (double)tv.tv_usec;
}

uint64_t loop = 500000000;

void __attribute__ ((noinline)) foo1(void) {
    XXH64_hash_t hash_xor = 0;
    double t1 = get_time_in_seconds();
    for(uint64_t i = 0; i < loop ; i ++) {
        XXH64_hash_t hash = XXH64(&i, sizeof(i), 123456 /* seed */);
        hash_xor = hash_xor ^ hash;
    }
    double t2 = get_time_in_seconds();
    printf("- %lu 64bit hashes created xor hash 0x%lx in %f seconds or %.0f per second\n", loop, hash_xor, t2 - t1, loop / (t2 - t1));
}

#if USE_FOO2
void __attribute__ ((noinline)) foo2(void) {
    XXH64_hash_t hash_xor = 0;
    double t1 = get_time_in_seconds();
    for(uint64_t i = 0; i < loop ; i ++) {
        XXH64_hash_t hash = XXH64(&i, sizeof(i), 123456 /* seed */);
        hash_xor = hash_xor ^ hash;
    }
    double t2 = get_time_in_seconds();
    printf("- %lu 64bit hashes created xor hash 0x%lx in %f seconds or %.0f per second\n", loop, hash_xor, t2 - t1, loop / (t2 - t1));
}
#endif

void main(void) {
    foo1();
    //foo2();
}

Compiling and running it twice, the 1st time XXH64() is inlined, but the 2nd time it is not inlined:

$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

$ git clone https://github.com/Cyan4973/xxHash.git

$ export USE_FOO2=0; gcc -DUSE_FOO2=$USE_FOO2 -DXXH_INLINE_ALL -O1 -S -o mystery$USE_FOO2.s mystery.c && gcc -c -o mystery$USE_FOO2.o mystery$USE_FOO2.s && gcc -o mystery$USE_FOO2.exe mystery$USE_FOO2.o && ./mystery$USE_FOO2.exe; cat mystery$USE_FOO2.s | egrep -i "(^[a-z].*:|call)"
- 500000000 64bit hashes created xor hash 0xde5d248747c84e34 in 1.452864 seconds or 344147851 per second
get_time_in_seconds:
	call	gettimeofday@PLT
	call	__assert_fail@PLT
	call	__stack_chk_fail@PLT
foo1:
	call	get_time_in_seconds
	call	get_time_in_seconds
	call	__printf_chk@PLT
main:
	call	foo1
loop:

$ export USE_FOO2=1; gcc -DUSE_FOO2=$USE_FOO2 -DXXH_INLINE_ALL -O1 -S -o mystery$USE_FOO2.s mystery.c && gcc -c -o mystery$USE_FOO2.o mystery$USE_FOO2.s && gcc -o mystery$USE_FOO2.exe mystery$USE_FOO2.o && ./mystery$USE_FOO2.exe; cat mystery$USE_FOO2.s | egrep -i "(^[a-z].*:|call)"
- 500000000 64bit hashes created xor hash 0xde5d248747c84e34 in 2.790860 seconds or 179156250 per second
XXH64_finalize:
get_time_in_seconds:
	call	gettimeofday@PLT
	call	__assert_fail@PLT
	call	__stack_chk_fail@PLT
foo1:
	call	get_time_in_seconds
	call	XXH64_finalize
	call	get_time_in_seconds
	call	__printf_chk@PLT
	call	__stack_chk_fail@PLT
foo2:
	call	get_time_in_seconds
	call	XXH64_finalize
	call	get_time_in_seconds
	call	__printf_chk@PLT
	call	__stack_chk_fail@PLT
main:
	call	foo1
loop:

$ wc --bytes mystery*
16280 mystery0.exe
 3088 mystery0.o
 3508 mystery0.s
16344 mystery1.exe
 4064 mystery1.o
 6976 mystery1.s
 3314 mystery.c

The xxHash README says "By default, xxHash uses attribute((always_inline)) and __forceinline to improve performance at the cost of code size." but does it really? Or is this a compiler bug? Or should I be compiling the above example differently somehow?

P.S. This also goes against my understanding of how the compiler decides to inline or not inline, if force inline is not specified. I always thought the compiler decides on code size in a function. But in this example, the code size of the function stays the same. We just add another function which is never actually called. So... ?!

@simonhf
Copy link
Author

simonhf commented Dec 2, 2023

Looking at the code, it looks like XXH_FORCE_INLINE is correctly defined but not used for XXH64_finalize() for some reason?

static XXH_PUREF xxh_u64
XXH64_finalize(xxh_u64 hash, const xxh_u8* ptr, size_t len, XXH_alignment align)
{
...
    return  XXH64_avalanche(hash);
}

XXH_FORCE_INLINE XXH_PUREF xxh_u64
XXH64_endian_align(const xxh_u8* input, size_t len, xxh_u64 seed, XXH_alignment align)
{
...
    return XXH64_finalize(h64, input, len, align);
}

Because it is not used then the compiler decides itself whether to inline XXH64_finalize() or not...

I tried the following and it worked for my example, but I don't know if it breaks anything else:

XXH_FORCE_INLINE XXH_PUREF xxh_u64
XXH64_finalize(xxh_u64 hash, const xxh_u8* ptr, size_t len, XXH_alignment align)
{
...
    return  XXH64_avalanche(hash);
}

Suggestion: Maybe it have some kind of test / CI test to compile to assembler and then to object file -- like I did above -- and then have a separate test which scans the generated assembler files and fails if xxHash functions are not inlined? Presumably such a mechanism would have found the above bug? And also, maybe some of the benchmarks are incorrect, because without such a mechanism, how can you be 100% sure all functions are inlined as desired? And as the above results show, not inlining can easily ~ halve performance...

@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 2, 2023

Great investigation @simonhf !

While the fix seems simple, a key point here will be to create a CI test able to catch such oversight, as you suggest, to avoid a repetition of such issue in future updates.

@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 8, 2023

Looking at this issue again,
I'm trying to create a test which would be able to catch such issue.

Within the repository, there is already a target called xxhsum_inlinedXXH which could be used for that.
It's main purpose is to check that compiling xxhsum with inlined libxxhash works as intended.
It's then possible to add a symbol extractor such as nm as a follow up test.

First surprise: on my macos laptop, I don't see this issue.

But then, I realize that the report uses gcc-11 as a compiler,
and reproducing the test with gcc-11 specifically
the _XXH64_finalize symbol is now visible.

It shows that the inlining issue is compiler (and possibly version) dependent. As expected btw.

All the more reasons to create a CI test to catch it.

@simonhf
Copy link
Author

simonhf commented Dec 9, 2023

@Cyan4973 sorry, I was planning to come up with a PR for this over the holidays because I thought it would be a great little project to work on with my son who is in his first year studying computer science! But you beat me to it! :-) Anyway, I left some comments and suggestions on the PR. HTH!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants