Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SVE with assembly implementation #762

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

hzhuang1
Copy link
Contributor

With assembly implementation, performance on SVE could be improved continuously.

@hzhuang1 hzhuang1 mentioned this pull request Nov 24, 2022
@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 1, 2022

I presume we'll wait for the currently ongoing issues to be fixed (CI tests, notably for SVE mode, broken dispatch) before reviewing this PR.

Rename xxh_x86dispatch.h to xxh_dispatch.h since it will be shared to
other architectures.

Signed-off-by: Haojian Zhuang <haojian.zhuang@linaro.org>
Dispatch SVE, NEON and SCALAR implementations on arm64 by selecting
different macros.

Since SVE implementation can't be supported by compiler well, use assembly
code instead.

In intrinsic SVE implementation:
  * Avoid to access the ACC array in memory frequently in accumulation
    routine.

In assembly SVE implementation (dispatcher):
  * Avoid to access the ACC array in memory frequently in accumulation
    routine.
  * Use assemly code in scramble routine.
  * Since there's both accumulation and scramble routine in internal loop,
    convert the internal loop to assembly version. At this time, avoid to
    access the ACC array in memory frequently in the internal loop.

Signed-off-by: Haojian Zhuang <haojian.zhuang@linaro.org>
Make bench tests to support assembly SVE routine.

While SVE intrinsic implementation is enabled, the building commands and
performance data are in below.

  $export CPP_FLAGS="-DXXH_VECTOR=XXH_SVE"
  $export CFLAGS="-O3 -march=armv8-a+sve -fPIC -DXXH_VECTOR=XXH_SVE"
  $make

 ===  benchmarking 4 hash functions  ===
benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27)
xxh3   ,  3679,  6019,  7807,  8945,  9862, 10343, 10622, 10604, 10782, 10697, 10763, 10900, 10913,  9959,  6374,  5979,  6057,  6076,  6108
XXH32  ,  1326,  1440,  1495,  1523,  1534,  1541,  1545,  1534,  1505,  1506,  1507,  1506,  1508,  1456,  1248,  1195,  1199,  1201,  1200
XXH64  ,  2510,  2803,  2978,  3072,  3121,  3139,  3155,  3127,  3051,  3046,  3059,  3060,  3059,  2899,  2117,  1983,  1991,  1993,  1991
XXH128 ,  3421,  5791,  7501,  8891,  9787, 10363, 10646, 10435, 10809, 10935, 10974, 10999, 11002,  9916,  6099,  5773,  6110,  6109,  6119

While SVE asembly implementation is enabled, the building commands and
performance data are in below.

  $export CPP_FLAGS="-DXXH_VECTOR=XXH_SVE"
  $export CFLAGS="-O3 -march=armv8-a+sve -fPIC -DXXH_VECTOR=XXH_SVE"
  $make DISPATCH=1

 ===  benchmarking 4 hash functions  ===
benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27)
xxh3   ,  4142,  6663,  9745, 12327, 13990, 15064, 15631, 15515, 15412, 14055, 14105, 14135, 14126, 11953,  4585,  4000,  4013,  4042,  4033
XXH32  ,  1326,  1440,  1495,  1523,  1535,  1543,  1547,  1536,  1500,  1503,  1503,  1502,  1485,  1452,  1243,  1192,  1199,  1199,  1197
XXH64  ,  2499,  2760,  2975,  3071,  3122,  3137,  3153,  3133,  3041,  3044,  3015,  3051,  3030,  2897,  2124,  1977,  1988,  1989,  1967
XXH128 ,  3903,  6454,  9485, 11954, 13807, 15135, 15891, 15381, 15376, 15442, 15678, 15677, 15728, 13096,  4698,  4132,  4046,  4044,  4051

Signed-off-by: Haojian Zhuang <haojian.zhuang@linaro.org>
If an assembler source contains no GNU-stack note, the system by default
assumes that an executable stack may be required. GCC generates code to
be executed on the stack when it implements a trampoline for nested
functions. The default behavior brings out security issue.

Signed-off-by: Haojian Zhuang <haojian.zhuang@linaro.org>
@hzhuang1
Copy link
Contributor Author

Rebase the patch set since CI issue has been fixed.

@Cyan4973
Copy link
Owner

Cyan4973 commented Feb 2, 2023

This is a more complex PR, it will take some time to get through it.

To begin with, we may have another opportunity to divide and conquer here, and that would make each part easier to review. It seems there are 2 combined efforts that could be isolated :

  • SVE assembly : if I'm not mistaken, we already have an SVE code path in C. I presume the assembly version is probably more efficient. I also presume it neatly substitutes itself to the pure C code path, without any other change of logic.
    This would make it possible to analyse this part on its own, separately.
    The main issues will be around building and testing.
    It's not yet clear if it can be merged into mainline, or is the territory of specialized forks.

  • arm dispatch : my understanding is that it extends the current x86dispatch.
    It's unclear to me if this is a stand-alone capability that can be selected manually,
    or if it's tied to some kind of automatic target determination system.
    Here also, if it's possible to split that scope into smaller steps, this will increase the chances to pass review and get merged.
    The main issues will be around complexity, both code and build complexities.
    It has a decent chance to make it into mainline, but this outcome directly depends on the complexity introduced.

@Cyan4973
Copy link
Owner

Cyan4973 commented Feb 3, 2023

Speaking of Assembly version :

I was lucky enough to access a server with SVE support, a c7g unit from aws.
Now, this SoC supports SVE, but that's really all that can be said.
Performance wasn't much different from NEON, so I assume the vector unit is still limited to 128-bit.

Nonetheless, this was a good opportunity to compare the C SVE code path with the Assembly one.

And the difference was quite small, approximately +5% in favor of assembly.

Now, this could be because c7g is not a great SVE platform.
So a general question is : have you observed larger performance differences when benchmarking assembly SVE ? If yes, on which platforms and which scenarios ?

This matters, because assembly introduces a substantial build difficulty, so it should be matched by some corresponding benefit. If +5% is about the right expectation, then mainline is probably not the best target for it (though it is still a good reason to create a specialized SVE fork with this code).

@easyaspi314
Copy link
Contributor

From a small amount of digging. c7g (AWS Graviton 3) seems to be based on the Neoverse V1, which is SVE-256.

Looking at the optimization guide:

  • Interleaving the two vector operations is probably best going from just how ARMs work, as much as ACCRND helps readability
  • AND is better than UXTW as it has more throughput and can be executed on both pipelines (whyyyy it is literally the same thing 😕)
  • As a side note, XXH3_NEON_LANES=8 is probably optimal on this processor since it has 4 NEON decoders.

However, I question how much of the C performance is just compilers not being quite ready yet and in a few months the entire code will be obsolete (especially since now there are SVE compatible machines that are more readily available now).

@Cyan4973
Copy link
Owner

Cyan4973 commented Feb 4, 2023

So your guess is that SVE performance on Graviton3 could have been better,
but it's a matter of correctly optimizing for this architecture (manually or via compiler).

@easyaspi314
Copy link
Contributor

easyaspi314 commented Feb 4, 2023

So your guess is that SVE performance on Graviton3 could have been better,
but it's a matter of correctly optimizing for this architecture (manually or via compiler).

Yes. If Graviton3 is an ARM design and not designed in-house, it is likely just as sensitive to instruction ordering and pipelining as the Cortexes I've been fiddling with.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Feb 4, 2023

As for the interleaved SVE, try replacing the sve256 loop block (L294 to L306) with this. Disclaimer, I haven't tested this and my ordering might not be ideal.

10:
    // since vector length is known, avoid predicates for more freedom
    ldr    z19, [x1]
    ldr    z20, [x1, #1, mul vl]
    ldr    z21, [x2]
    ldr    z22, [x2, #1, mul vl]
    prfd   pldl1strm, p7, [x1, #31, mul vl]
    eor    z21.d, z21.d, z19.d
    eor    z22.d, z22.d, z20.d
    tbl    z23.d, {z19.d}, z7.d
    tbl    z24.d, {z20.d}, z7.d
    lsr    z25.d, z21.d, #32
    and    z21.d, z21.d, #0xFFFFFFFF
    // Pretty sure this will forward to the add
    mad    z23.d, p7/m, z25.d, z21.d
      // Encourage V1 usage
      lsr    z26.d, z22.d, #32
    add    z17.d, z17.d, z23.d
      and    z22.d, z22.d, #0xFFFFFFFF
    mad    z24.d, p7/m, z26.d, z22.d
    add    z18.d, z18.d, z24.d
    add    x1, x1, #64
    add    x2, x2, #8
    add    x10, x10, #1
    cmp    x10, x3
    b.lt   10b

@Cyan4973
Copy link
Owner

Cyan4973 commented Feb 4, 2023

As for the interleaved SVE, try replacing the sve256 loop block (L294 to L306) with this. Disclaimer, I haven't tested this.

Unfortunately, my access to this platform was very temporary.
I don't have it anymore...

@easyaspi314
Copy link
Contributor

As a side note I wonder if it is beneficial to just use NEON on SVE-128.

@easyaspi314
Copy link
Contributor

As for the interleaved SVE, try replacing the sve256 loop block (L294 to L306) with this. Disclaimer, I haven't tested this.

Unfortunately, my access to this platform was very temporary.
I don't have it anymore...

No biggie.

@hzhuang1
Copy link
Contributor Author

hzhuang1 commented Feb 5, 2023

Speaking of Assembly version :

I was lucky enough to access a server with SVE support, a c7g unit from aws. Now, this SoC supports SVE, but that's really all that can be said. Performance wasn't much different from NEON, so I assume the vector unit is still limited to 128-bit.

Let me check whether I can get aws c7g.

Nonetheless, this was a good opportunity to compare the C SVE code path with the Assembly one.

And the difference was quite small, approximately +5% in favor of assembly.

I mainly worked on Fujitsu A64FX platform. The performance is improved a lot with assembly code. As I remember, A64FX is based on ARMv8.2, and AWS c7g maybe based on ARMv8.4.

Now, this could be because c7g is not a great SVE platform. So a general question is : have you observed larger performance differences when benchmarking assembly SVE ? If yes, on which platforms and which scenarios ?

Yes, I mainly verified it on Fujitsu A64FX.

Implementation Performance in Descending
SVE assembly Best (reach for 15GB/s for the best case)
SVE C Better (Reach for 10GB/s for the best case)
Scalar Good (Less than 4GB/s for the best case)
NEON Bad than Scalar

xxhash_sve

@easyaspi314
Copy link
Contributor

easyaspi314 commented Feb 5, 2023

By the way I did some research on the A64FX, and from what it appears, NEON has been severely performance-deprecated (as in everything but the trivial instructions having 6-12 cycles latency) to put more power into SVE. This is an understandable design choice because it is designed for specialized uses and not general purpose.

So NEON being bad on this particular platform is unavoidable.

@hzhuang1
Copy link
Contributor Author

hzhuang1 commented Feb 15, 2023

I tried it on c7g.
Screenshot 2023-02-15 at 10 51 13

SVE doesn't work as expected. I'll keep investigating. As my understanding, SVE should work a bit better than NEON.

@easyaspi314 There's some issue in your code snippet. "LSR <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>". If I switch to AND instruction from UXTW, I need one more instruction. So I could not gain benefit.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Feb 19, 2023

I think that for now we should only do SVE-512.

Looking at the optimization guide, c7g is a tradeoff because while SVE can process 2x the data, NEON always has at least 2x the IPC. (Also NEON already has 256-bit loads).

Also scalar can still be executed in the background, and we lose that benefit unless we mix in NEON which would just be messy and not worth the complexity.

However, with 100% optimal code, SVE will lose because the multiply is significantly slower and can't be parallelized:

Instruction Latency Throughput Pipelines
NEON umlal 4 (1) 2 V0, V2
SVE mad 5 (2) 1/2 (!!!) V0

(Parentheses are if the result can't be forwarded to another add instruction, which isn't the case in XXH3)

The ARM pipeline is so fun 💀

SVE2 may have better performance because it can use umlalb instead of and/uxtw + mad, but currently all SVE2-compatible processors are 128-bit, and the only reason SVE is viable is bandwidth.

@hzhuang1
Copy link
Contributor Author

hzhuang1 commented Feb 20, 2023

I think that for now we should only do SVE-512.

Looking at the optimization guide, c7g is a tradeoff because while SVE can process 2x the data, NEON always has at least 2x the IPC. (Also NEON already has 256-bit loads).

Yes, I agree on it. SVE don't improve a lot performance on SVE-128 & SVE-256.

On SVE-256 (V1 core), I tried to tune assembly code. The latest performance is attached.
image

But it also drops while buffer size is increased. It's very similar as what happened on A64FX. I'll continue to check whether it could be improved.

I can simply list what I did.

  • Make similar SVE instruction running at the same time. It could fully make use of pipeline throughput.
  • Since scalar pipeline is different from vector pipeline. Make instructions interleaved.
  • Do not prefetch data in each loop. Prefetch once, and rest for other loops.

The debug code is in branch debug_sve_03 of my repo.

Also scalar can still be executed in the background, and we lose that benefit unless we mix in NEON which would just be messy and not worth the complexity.

However, with 100% optimal code, SVE will lose because the multiply is significantly slower and can't be parallelized:

Instruction Latency Throughput Pipelines
NEON umlal 4 (1) 2 V0, V2
SVE mad 5 (2) 1/2 (!!!) V0

(Parentheses are if the result can't be forwarded to another add instruction, which isn't the case in XXH3)

The ARM pipeline is so fun 💀

I found something interesting on different chips.

We know that NEON works well on Neoverse N1 & V1 core. But I also find that NEON works worse than scalar on Fujitsu A64FX and Hisilicon Kirin920. These two chips are older than N1 & V1.

My point is that vector instruction may not bring great performance in initial. But the cores are always evolving. They may become better in the future.

SVE2 may have better performance because it can use umlalb instead of and/uxtw + mad, but currently all SVE2-compatible processors are 128-bit, and the only reason SVE is viable is bandwidth.

Good point. Let me check whether it could bing benefits.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Feb 21, 2023

Yes, I agree on it. SVE don't improve a lot performance on SVE-128 & SVE-256.

On SVE-256 (V1 core), I tried to tune assembly code. The latest performance is attached.
(Snip)
But it also drops while buffer size is increased. It's very similar as what happened on A64FX. I'll continue to check whether it could be improved.

I assume that SVE is more sensitive to cache.

I can simply list what I did.

  • Make similar SVE instruction running at the same time. It could fully make use of pipeline throughput.
  • Since scalar pipeline is different from vector pipeline. Make instructions interleaved.
  • Do not prefetch data in each loop. Prefetch once, and rest for other loops.

The debug code is in branch debug_sve_03 of my repo.

Will take a look, thanks.

I found something interesting on different chips.

We know that NEON works well on Neoverse N1 & V1 core. But I also find that NEON works worse than scalar on Fujitsu A64FX and Hisilicon Kirin920. These two chips are older than N1 & V1.

There are some issues with 32-bit ARM that I want to work out which would cause issues on the latter. A lot of it seems to be issues with load-store.

If I write XXH3_accumulate_512_neon in 32-bit ARM inline asm, I can get a significant performance boost of up to 12 GB/s vs 7 GB/s. However, this is at a tradeoff of practicality — Google and Apple require all apps to have 64-bit support, so this will only really benefit Android devices that are at least 6 years old and armv7l Linux. I want the main target to be AArch64 and only go for low-footprint optimizations for 32-bit ARM.

My point is that vector instruction may not bring great performance in initial. But the cores are always evolving. They may become better in the future.

Hopefully. It unfortunately seems like ARM has gone back to 128-bit SVE for its newest designs, including the V2. This might be because, as we see on c7g, the throughput penalty makes it not that much better than 2 NEON instructions.

Processor SVE version No. Units Width
A64FX 1 2? 512
Neo-V1 C7G 1 2 256
Neo-N2 2 2 128
Ctx-X2 2 4? 128
Neo-V2 2 4 128

SVE2 may have better performance because it can use umlalb instead of and/uxtw + mad, but currently all SVE2-compatible processors are 128-bit, and the only reason SVE is viable is bandwidth.

Good point. Let me check whether it could bing benefits.

c7g does not have SVE2.

However, if you have access to an SVE2 machine with a known 128-bit vector size, try replacing XXH3_accumulate_512_neon with this. This is just NEON but uses rev64+umlalb for a significantly faster shuffle+multiply. It requires manually setting the vector length though.

Also make sure to put #include <arm_neon.h> above the line that includes <arm_sve.h>.

typedef svuint64_t xxh_u64x2 __attribute__((arm_sve_vector_bits(128)));
typedef svuint32_t xxh_u32x4 __attribute__((arm_sve_vector_bits(128)));

XXH_FORCE_INLINE void XXH3_accumulate_512_neon(
    void *XXH_RESTRICT acc,
    const void *XXH_RESTRICT input,
    const void *XXH_RESTRICT secret
)
{
    size_t i;
    uint64x2_t *xacc = (uint64x2_t *) acc;
    const uint8_t *xinput = (const uint8_t *)input;
    const uint8_t *xsecret = (const uint8_t *)secret;
    XXH_ASSERT(svcntd() == 2);

    for (i = 6; i < 8; i++) {
        XXH3_scalarRound(acc, input, secret, i);
    }
    for (i = 0; i < 3; i++) {
        uint64x2_t data_vec = XXH_vld1q_u64(xinput + 16 * i);
        uint64x2_t key_vec = XXH_vld1q_u64(xsecret + 16 * i);
        uint64x2_t swapped = vextq_u64(data_vec, data_vec, 1);
        uint64x2_t mixed_lo = veorq_u64(data_vec, key_vec);
        /* (x << 32) | (x >> 32) */
        uint32x4_t mixed_hi = vrev64q_u32(vreinterpretq_u32_u64(mixed_lo));
        uint64x2_t mul = (uint64x2_t)(xxh_u64x2)
        svmlalb_u64(
            (xxh_u64x2)swapped,
            (xxh_u32x4)mixed_lo,
            (xxh_u32x4)mixed_hi
        );
        xacc[i] = vaddq_u64(mul, xacc[i]);
    }
}

When compiling, use these flags (also try both GCC and Clang)

make CFLAGS="-O3 -DXXH_VECTOR=XXH_NEON -march=armv8-a+sve2 -msve-vector-bits=128 -fno-tree-vectorize"

@hzhuang1
Copy link
Contributor Author

hzhuang1 commented Feb 25, 2023

@easyaspi314 Although both NEON and SVE2 support the multiplication from two 32-bit inputs to 64-bit result. They're totally different.

     ---------------------
     D1L   D1H   D2L   D2H
     ---------------------
     D3L   D3H   D4L   D4H
     ---------------------
    
    SVE-128 (UMULLB):
     ---------------------
     D1L * D3L | D2L * D4L
     ---------------------
     D1H * D3H | D2H * D4H
     ---------------------
    
    NEON (UMULL):
     ---------------------
     D1L * D3L | D1H * D3H
     ---------------------
     D2L * D4L | D2H * D4H
     ---------------------

@easyaspi314
Copy link
Contributor

easyaspi314 commented Feb 25, 2023

Yes, and the reason it is favorable is that instead of requiring the uzp1/uzp2 setup, it can be done with rev64. The complicated shuffle is what makes NEON less efficient than SSE2.

@hzhuang1
Copy link
Contributor Author

Yes, and the reason it is favorable is that instead of requiring the uzp1/uzp2 setup, it can be done with rev64. The complicated shuffle is what makes NEON less efficient than SSE2.

Sorry, I didn't get the point.

After uzp1/uzp2 instruction, the data layout is changed in below.

             -----------------------------
0x00   | D1L       D2L       D3L        D4L
             ------------------------------
0x10    | D1H      D2H      D3H        D4H
             ------------------------------

After umullb instruction, the data layout is changed in below. For convenience, I don't use umlalb at here.

  -------------------------------
  D1L  *   D1H    |  D3L  *   D3H
  -------------------------------
  D2L  *  D2L    |  D4L  * D4H
  -------------------------------

But we hope D3L*D3H should be stayed in 0x10, not 0x08. Then I need to more instructions to make them in order. rev64 can't help them in order at here.

@easyaspi314
Copy link
Contributor

easyaspi314 commented Feb 26, 2023

Ah, you are confused because the uzp trick is for two vectors at once. This is for only one.

Come to think of it this would actually have literally zero benefit over the two vector approach aside from a minor data dependency for things that are going to be executed 4 at a time with identical timings (on the performance cores)

Specifically, instead of uzp1+uzp2 for 2 vectors, it is rev64 for one, and those both are 2 cycles, 4 throughput.

I take back what I said, SVE-128 is garbage, just use NEON 🤪

@hzhuang1
Copy link
Contributor Author

I tried to mix NEON and SVE2. At first, compiler reported error while I convert NEON data type to SVE type. Then, I switched to assembly. I met some strange error. While declaring .arch armv8-a+sve2 in assembly, compiler reported sve architecture is required. I don't know what's wrong with it. I failed it both in QEMU and SVE2 hardware that is provided by Ali cloud.

@hzhuang1
Copy link
Contributor Author

I plan to simplify the patch set. In dispatch, it checks the cpu. If it's SVE512, turn to assembly routine. If it's not, turn to NEON routine. Could it be acceptable?

@hzhuang1
Copy link
Contributor Author

I tried to mix NEON and SVE2. At first, compiler reported error while I convert NEON data type to SVE type. Then, I switched to assembly. I met some strange error. While declaring .arch armv8-a+sve2 in assembly, compiler reported sve architecture is required. I don't know what's wrong with it. I failed it both in QEMU and SVE2 hardware that is provided by Ali cloud.

It's fixed. image

Since it's assembly, I only mix NEON and SVE2. I'll mix Scalar, NEON & SVE2 later.

With the help of SVE2, it could save one instruction and gain the performance. (RED vs GREEN)

@easyaspi314
Copy link
Contributor

easyaspi314 commented Feb 28, 2023

That difference might solely be from it being handwritten assembly. However, even if it wasn't, I'd say that even if it is interleaved with scalar it clearly isn't going to do much to be worth an entirely different target from NEON.

I would say that the minimum to warrant something like that would be 20%, and it should be on a newer target that will be worth even more in the future and not an older one (hence why I wouldn't want an entire ARMv7-A NEON inline assembly implementation, even though it benefits performance)

@hzhuang1
Copy link
Contributor Author

OK. How about SVE512 assembly code? Could it be accepted?

@easyaspi314
Copy link
Contributor

easyaspi314 commented Feb 28, 2023

I'd say yes, although I would recommend the following priority:

  1. C intrinsics if possible — The limitation to SVE512 or larger can probably improve performance due to fewer checks and loads
  2. Inline assembly hints/hacks (e.g. compiler guards, single instructions)
  3. Inline assembly blocks
  4. External assembly file

Also, dispatching on POSIX ELF would be trivial:

#include <sys/auxv.h>

__attribute__((target("+sve")))
static int XXH_isSVE512(void)
{
    return svcntd() >= 8;
}

__attribute__((constructor))
static void XXH_featureTest(void)
{
    if ((getauxval(AT_HWCAP) & HWCAP_SVE) != 0 && XXH_isSVE512()) {
        // SVE 512
    } else {
        // neon
    }
}

@Cyan4973
Copy link
Owner

I would recommend the following priority:

  • C intrinsics if possible — The limitation to SVE512 or larger can probably improve performance due to fewer checks and loads

Agreed,
this first stage looks likely uncontroversial.

@Cyan4973
Copy link
Owner

We haven't looked at this PR in a while,
I'm trying to get back to it during this Summer
in an attempt to determine which parts can be merged for next release.

I believe it's fair to say that our current code base is not ready to add an assembly file at this point. Merging xxh_arm64sve.S is probably not going to happen this cycle.

Note that, In the future, we'll try to change the code base to support multiple files, which might open a door for assembly source code again, but that's for later.

As for now, I note that this PR was also introducing other interesting tools and capability.
xxh_arm64dispatch.c for example, which is an equivalent of xxh_x86dispatch.c but for aarch64 targets, looks interesting.
More refined arch detection in xsum_arch.h also looks interesting.
There might be more.

Cyan4973 added a commit that referenced this pull request Jul 2, 2023
display was mistakenly mentioning NEON instead.

Codemod inspired by @hzhuang1 's #762.
@Cyan4973
Copy link
Owner

I was wondering which parts of this PR could be salvaged for v0.8.2 release, and went for the simpler part, which is a SVE detection prompt for the CLI.

Another interesting technology presented in this PR is the arm64 automatic vector dispatch, which is a nice counterpoint to the x86/x64 one.

However, there is a non-negligible difference :
on x64 cpus, there is an obvious improvement path, from SSE2 to AVX2 and then AVX512. Each iteration is strictly better than the previous one.

For aarch64, it's not so clear . SVE might be better than NEON, but it may also not be.
If my understanding of previous discussions is correct, SVE is expected to beat NEON when it can operate, in hardware, with 512-bit vector lengths (or better). But at 128-bit, it's most likely less efficient than NEON. And 256-bit is not clear-cut either.

This means, the hardware detection is not limited to "just" detecting SVE support or not, but also what's the hardware width of vector units. My understanding is that the currently proposed dispatcher is not able to do that, and it's unclear if detecting this width at runtime is doable.

Anyway, this will require a bit more work in order to be merged.

At this point, I mostly wonder if this PR should remain opened, for reference, with the idea that a future aarch64 dispatcher could be created from it, or if it should be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants