Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use function pointers for runtime dispatching #303

Open
Oppen opened this issue Jun 13, 2021 · 6 comments
Open

Use function pointers for runtime dispatching #303

Oppen opened this issue Jun 13, 2021 · 6 comments

Comments

@Oppen
Copy link
Collaborator

Oppen commented Jun 13, 2021

Rather than branching each time, consider doing something like this:

int (*run_container_cardinality)(const run_container_t *run) = run_container_cardinality_dispatch;

int run_container_cardinality_dispatch(const run_container_t *run) {
    if (croaring_avx2()) {
        run_container_cardinality = _avx2_run_container_cardinality;
    } else {
        run_container_cardinality = _scalar_run_container_cardinality;
    }
    return run_container_cardinality(run);
}

This way after the first call all calls will be direct. It may give a tiny performance gain.

@lemire
Copy link
Member

lemire commented Jun 13, 2021

Is it your expectation that the code you are proposing is thread safe?

I think you need atomic pointers.

Otherwise: pull requests are invited.

@Oppen
Copy link
Collaborator Author

Oppen commented Jun 13, 2021

Good point. Are indirect calls atomic in general? In ARM? x86?

@lemire
Copy link
Member

lemire commented Jun 13, 2021

Good point. Are indirect calls atomic in general? In ARM? x86?

Even if the underlying hardware does it for free, you will still get flagged by the sanitizers.

The croaring_avx2() does rely on atomics for that reason. But it is not nearly as pretty as I would have hoped.

@lemire
Copy link
Member

lemire commented Jun 13, 2021

See

#if defined(__x86_64__) || defined(_M_AMD64) // x64
#if defined(__cplusplus)
#include <atomic>
static inline uint32_t croaring_detect_supported_architectures() {
static std::atomic<int> buffer{CROARING_UNINITIALIZED};
if(buffer == CROARING_UNINITIALIZED) {
buffer = dynamic_croaring_detect_supported_architectures();
}
return buffer;
}
#elif defined(_MSC_VER) && !defined(__clang__)
// Visual Studio does not support C11 atomics.
static inline uint32_t croaring_detect_supported_architectures() {
static int buffer = CROARING_UNINITIALIZED;
if(buffer == CROARING_UNINITIALIZED) {
buffer = dynamic_croaring_detect_supported_architectures();
}
return buffer;
}
#else // defined(__cplusplus) and defined(_MSC_VER) && !defined(__clang__)
#include <stdatomic.h>
static inline uint32_t croaring_detect_supported_architectures() {
static _Atomic int buffer = CROARING_UNINITIALIZED;
if(buffer == CROARING_UNINITIALIZED) {
buffer = dynamic_croaring_detect_supported_architectures();
}
return buffer;
}

@lemire
Copy link
Member

lemire commented Jun 13, 2021

Note that we don't do runtime dispatching under ARM. With ARM, NEON is usually available by default so you would not need it now. Even so, we may hope that, for that reason, compilers do a good job using NEON via autovectorization if the proper optimization flags are provided under ARM.

We only do runtime dispatching for AVX2 support. It is where the gains are most important since compilers won't emit AVX2 by default. But most people have AVX2 support at this point in time. So there is huge performance gap there.

@lemire
Copy link
Member

lemire commented Jul 14, 2021

Note that I had a terrible performance bug in my initial implementation.

Fixed by 6403d44

I would call dynamic_croaring_detect_supported_architectures() on dispatching. We now call croaring_detect_supported_architectures().

The roaring_detect_supported_architectures() is much more efficient because it only does the CPU detection the first time it is called.

static inline uint32_t croaring_detect_supported_architectures() {
    static std::atomic<int> buffer{CROARING_UNINITIALIZED};
    if(buffer == CROARING_UNINITIALIZED) {
      buffer = dynamic_croaring_detect_supported_architectures();
    }
    return buffer;
}

It should be quite cheap.

Note that, in CRoaring, we use runtime dispatch strategically. It is not used for processing small blocks of data, it is always to process, e.g., a whole container or aggregate two containers. Of course, if you a ton of small containers, there is some overhead, but then you have other problems then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants