Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the AVX-512 versions of the following functions #454

Open
1 of 9 tasks
lemire opened this issue Mar 27, 2023 · 13 comments
Open
1 of 9 tasks

Implement the AVX-512 versions of the following functions #454

lemire opened this issue Mar 27, 2023 · 13 comments

Comments

@lemire
Copy link
Member

lemire commented Mar 27, 2023

Starting with version 1.0.0 of the library we have AVX-512 routines, but we are still missing many optimizations.

The following functions related to array containers may be upgraded to versions using AVX-512 instructions:

  • xor_vector16
  • union_vector16
  • difference_vector16
  • int32_t intersect_vector16
  • intersect_vector16_inplace
  • intersect_vector16_cardinality
  • bitset_set_list (also need AVX)
  • array_container_to_uint32_array
  • run_container_to_uint32_array (also need AVX)

This is an open issue and we are inviting pull requests.

cc @huihan365

@huihan365
Copy link
Contributor

I try to use avx2 to optimize union_vector16. but there is no improvement. Actually, i run the test comparing union_vector16 to union_uint16. the result is the same. Have you noticed this?

@CharlesChen888
Copy link
Contributor

CharlesChen888 commented Mar 28, 2023

i run the test comparing union_vector16 to union_uint16. the result is the same.

@huihan365 In our production environment, they have obvious performance gap. Are you sure you've tested it properly?

@lemire
Copy link
Member Author

lemire commented Mar 28, 2023

I need to write a better benchmark harness for this project. Give me a bit of time.

@huihan365
Copy link
Contributor

huihan365 commented Mar 29, 2023

i run the test comparing union_vector16 to union_uint16. the result is the same.

@huihan365 In our production environment, they have obvious performance gap. Are you sure you've tested it properly?

@CharlesChen888 I use the latest version and run the benchmark.(icelake Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz)
(base) root@icx-07:/data_ext/bak/hanhui/CRoaring/buildnoavx# taskset -c 0 benchmarks/array_container_benchmark
==intersection and union test 1
input 1 cardinality = 21846, input 2 cardinality = 13108
union cardinality = 30584
B1 card = 21846 B2 card = 13108
union_test(B1, B2, BO): 0.87 cycles per operation
==intersection and union test 2
input 1 cardinality = 4096, input 2 cardinality = 16
union cardinality = 4100
B1 card = 4096 B2 card = 16
union_test(B1, B2, BO): 0.08 cycles per operation
(base) root@icx-07:/data_ext/bak/hanhui/CRoaring/buildavx2# taskset -c 0 benchmarks/array_container_benchmark
==intersection and union test 1
input 1 cardinality = 21846, input 2 cardinality = 13108
union cardinality = 30584
B1 card = 21846 B2 card = 13108
union_test(B1, B2, BO): 0.99 cycles per operation
input 1 cardinality = 4096, input 2 cardinality = 16
union cardinality = 4100
B1 card = 4096 B2 card = 16
union_test(B1, B2, BO): 0.08 cycles per operation

@CharlesChen888
Copy link
Contributor

CharlesChen888 commented Mar 29, 2023

@huihan365 OK... I just did a benchmark on my mac (i7-9750H), and the results are similar. We need to test it with more different datasets.

AVX2 disabled:
==intersection and union test 1
union_test(B1, B2, BO): 0.72 cycles per operation
==intersection and union test 2
union_test(B1, B2, BO): 0.06 cycles per operation

AVX2 enabled:
==intersection and union test 1
union_test(B1, B2, BO): 0.53 cycles per operation
==intersection and union test 2
union_test(B1, B2, BO): 0.06 cycles per operation

@lemire
Copy link
Member Author

lemire commented Mar 29, 2023

The benchmarks that come with the library were always so-so. I wrote most of them very quickly and we did not rely on them to optimize the library. We should scrap them and do better.

I will prepare something, a first step forward, soon...

@lemire
Copy link
Member Author

lemire commented Mar 30, 2023

Note that my PR will come 'soon'. (Not weeks or months.)

@lemire
Copy link
Member Author

lemire commented Apr 1, 2023

@huihan365 @CharlesChen888 Have a look at PR
#460

I will merge it as soon as the tests are green. This gives you a much better way to benchmark the code. If you have privileged access to the system (via sudo) you will get performance counters.

@lemire
Copy link
Member Author

lemire commented Apr 1, 2023

Ok. I merged the PR and we now have sensible benchmarks with Google Benchmarks.

The instructions are...

Running microbenchmarks

We have microbenchmarks constructed with the Google Benchmarks.
Under Linux or macOS, you may run them as follows:

cmake --build build
./build/microbenchmarks/bench

By default, the benchmark tools picks one data set (e.g., CRoaring/benchmarks/realdata/census1881).
We have several data sets and you may pick others:

./build/microbenchmarks/bench benchmarks/realdata/wikileaks-noquotes 

You may disable some functionality for the purpose of benchmarking. For example, you could
benchmark the code without AVX-512 even if both your processor and compiler supports it:

cmake -B buildnoavx512  -D ROARING_DISABLE_AVX512=ON
cmake --build buildnoavx512
./buildnoavx512/microbenchmarks/bench

@lemire
Copy link
Member Author

lemire commented Apr 3, 2023

@huihan365 One can do some profiling using the new microbenchmarks, to identify the functions worth optimizing.

./build/microbenchmarks/bench --benchmark_filter=SuccessiveIntersectionCardinality

  43.72%  bench    libroaring.so.1.0.1   [.] intersect_skewed_uint16_cardinality
  35.35%  bench    libroaring.so.1.0.1   [.] intersect_vector16_cardinality
  11.83%  bench    libroaring.so.1.0.1   [.] roaring_bitmap_and_cardinality

./build/microbenchmarks/bench --benchmark_filter=SuccessiveUnionCardinality

  28.57%  bench    libroaring.so.1.0.1   [.] intersect_skewed_uint16_cardinality
  24.45%  bench    libroaring.so.1.0.1   [.] intersect_vector16_cardinality
  22.58%  bench    libroaring.so.1.0.1   [.] roaring_bitmap_get_cardinality
   8.06%  bench    libroaring.so.1.0.1   [.] roaring_bitmap_and_cardinality
   4.52%  bench    libroaring.so.1.0.1   [.] _avx512_run_container_cardinality.isra.0

./build/microbenchmarks/bench --benchmark_filter=ToArray

  93.92%  bench    libroaring.so.1.0.1   [.] array_container_to_uint32_array
   2.29%  bench    libroaring.so.1.0.1   [.] run_container_to_uint32_array
   1.73%  bench    libroaring.so.1.0.1   [.] ra_to_uint32_array

./build/microbenchmarks/bench --benchmark_filter="TotalUnion$"

  82.96%  bench    libroaring.so.1.0.1   [.] bitset_set_list
   4.74%  bench    libroaring.so.1.0.1   [.] bitset_container_from_array
   2.21%  bench    libroaring.so.1.0.1   [.] roaring_bitmap_lazy_or_inplace
   1.55%  bench    libc.so.6             [.] ____wcstod_l_internal
   1.45%  bench    libc.so.6             [.] ____wcstold_l_internal
   1.26%  bench    bench                 [.] load
   0.85%  bench    libroaring.so.1.0.1   [.] avx512_vpopcount.constprop.0

./build/microbenchmarks/bench --benchmark_filter="SuccessiveUnion$"

  21.74%  bench    libc.so.6             [.] ____wcstold_l_internal
  11.01%  bench    libroaring.so.1.0.1   [.] union_uint16
   9.00%  bench    libc.so.6             [.] ____wcstof_l_internal
   5.98%  bench    libc.so.6             [.] ____wcstod_l_internal
   4.47%  bench    libroaring.so.1.0.1      [.] array_run_container_union
   4.19%  bench    libroaring.so.1.0.1      [.] convert_run_to_efficient_container
   4.19%  bench    libroaring.so.1.0.1      [.] union_vector16

./build/microbenchmarks/bench --benchmark_filter="SuccessiveIntersection$"

  28.75%  bench    libroaring.so.1.0.1   [.] intersect_skewed_uint16
  19.10%  bench    libroaring.so.1.0.1   [.] intersect_vector16
  13.75%  bench    libc.so.6             [.] ____wcstof_l_internal

@lemire
Copy link
Member Author

lemire commented Apr 3, 2023

Note that you can disable AVX (and AVX-512) entirely if you want...

cmake -B buildnoavx -D ROARING_DISABLE_AVX=ON
cmake --build buildnoavx
./buildnoavx/microbenchmarks/bench

@lemire
Copy link
Member Author

lemire commented Apr 3, 2023

I have committed a PR that vectorizes array_container_to_uint32_array.

@lemire
Copy link
Member Author

lemire commented Apr 3, 2023

My commit is in the main branch. The difference between a vectorized and a non-vectorized approach is rather obvious:

$ ./build/microbenchmarks/bench --benchmark_filter=ToArray
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
ToArray        137763 ns       137104 ns         5276


$ ./noavx/microbenchmarks/bench --benchmark_filter=ToArray
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
ToArray        570715 ns       570574 ns         1221

Roughly a 2x gain.

The AVX-512 routine is not faster in this instance, but SIMD definitively helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants