Use function pointers for runtime dispatching #303

Oppen · 2021-06-13T03:43:28Z

Rather than branching each time, consider doing something like this:

int (*run_container_cardinality)(const run_container_t *run) = run_container_cardinality_dispatch;

int run_container_cardinality_dispatch(const run_container_t *run) {
    if (croaring_avx2()) {
        run_container_cardinality = _avx2_run_container_cardinality;
    } else {
        run_container_cardinality = _scalar_run_container_cardinality;
    }
    return run_container_cardinality(run);
}

This way after the first call all calls will be direct. It may give a tiny performance gain.

lemire · 2021-06-13T19:03:36Z

Is it your expectation that the code you are proposing is thread safe?

I think you need atomic pointers.

Otherwise: pull requests are invited.

Oppen · 2021-06-13T19:19:36Z

Good point. Are indirect calls atomic in general? In ARM? x86?

lemire · 2021-06-13T20:07:18Z

Good point. Are indirect calls atomic in general? In ARM? x86?

Even if the underlying hardware does it for free, you will still get flagged by the sanitizers.

The croaring_avx2() does rely on atomics for that reason. But it is not nearly as pretty as I would have hoped.

lemire · 2021-06-13T20:10:42Z

See

CRoaring/include/roaring/isadetection.h

Lines 170 to 198 in a5656ae

    
           #if defined(__x86_64__) || defined(_M_AMD64) // x64 
        
           #if defined(__cplusplus) 
        
           #include <atomic> 
        
           static inline uint32_t croaring_detect_supported_architectures() { 
        
               static std::atomic<int> buffer{CROARING_UNINITIALIZED}; 
        
               if(buffer == CROARING_UNINITIALIZED) { 
        
                 buffer = dynamic_croaring_detect_supported_architectures(); 
        
               } 
        
               return buffer; 
        
           } 
        
           #elif defined(_MSC_VER) && !defined(__clang__) 
        
           // Visual Studio does not support C11 atomics. 
        
           static inline uint32_t croaring_detect_supported_architectures() { 
        
               static int buffer = CROARING_UNINITIALIZED; 
        
               if(buffer == CROARING_UNINITIALIZED) { 
        
                 buffer = dynamic_croaring_detect_supported_architectures(); 
        
               } 
        
               return buffer; 
        
           } 
        
           #else // defined(__cplusplus) and defined(_MSC_VER) && !defined(__clang__) 
        
           #include <stdatomic.h> 
        
           static inline uint32_t croaring_detect_supported_architectures() { 
        
               static _Atomic int buffer = CROARING_UNINITIALIZED; 
        
               if(buffer == CROARING_UNINITIALIZED) { 
        
                 buffer = dynamic_croaring_detect_supported_architectures(); 
        
               } 
        
               return buffer; 
        
           }

lemire · 2021-06-13T20:11:41Z

Note that we don't do runtime dispatching under ARM. With ARM, NEON is usually available by default so you would not need it now. Even so, we may hope that, for that reason, compilers do a good job using NEON via autovectorization if the proper optimization flags are provided under ARM.

We only do runtime dispatching for AVX2 support. It is where the gains are most important since compilers won't emit AVX2 by default. But most people have AVX2 support at this point in time. So there is huge performance gap there.

lemire · 2021-07-14T16:52:13Z

Note that I had a terrible performance bug in my initial implementation.

Fixed by 6403d44

I would call dynamic_croaring_detect_supported_architectures() on dispatching. We now call croaring_detect_supported_architectures().

The roaring_detect_supported_architectures() is much more efficient because it only does the CPU detection the first time it is called.

static inline uint32_t croaring_detect_supported_architectures() {
    static std::atomic<int> buffer{CROARING_UNINITIALIZED};
    if(buffer == CROARING_UNINITIALIZED) {
      buffer = dynamic_croaring_detect_supported_architectures();
    }
    return buffer;
}

It should be quite cheap.

Note that, in CRoaring, we use runtime dispatch strategically. It is not used for processing small blocks of data, it is always to process, e.g., a whole container or aggregate two containers. Of course, if you a ton of small containers, there is some overhead, but then you have other problems then.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use function pointers for runtime dispatching #303

Use function pointers for runtime dispatching #303

Oppen commented Jun 13, 2021

lemire commented Jun 13, 2021

Oppen commented Jun 13, 2021

lemire commented Jun 13, 2021

lemire commented Jun 13, 2021

lemire commented Jun 13, 2021

lemire commented Jul 14, 2021

Use function pointers for runtime dispatching #303

Use function pointers for runtime dispatching #303

Comments

Oppen commented Jun 13, 2021

lemire commented Jun 13, 2021

Oppen commented Jun 13, 2021

lemire commented Jun 13, 2021

lemire commented Jun 13, 2021

lemire commented Jun 13, 2021

lemire commented Jul 14, 2021