NEON ARMv8 SHA3_2x

Update

This package is now support ARMv8.2-sha3 instruction The result improve significantly when use SHA-3 instruction.

Apple M1

SHA-3 Enabled

2022-05-06T06:04:41-04:00
Running ./benchmark
Run on (8 X 24.1207 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 2.07, 1.91, 1.75
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_F1600x2        156 ns          156 ns      4135039
BM_F1600          218 ns          218 ns      3166704

SHA-3 Disable

2022-05-06T06:09:10-04:00
Running ./benchmark
Run on (8 X 24.0697 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 2.40, 2.16, 1.90
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_F1600x2        355 ns          355 ns      1948260
BM_F1600          217 ns          217 ns      3220716

Anyway it’s still faster than 2 times Keccak-F1600.

NEON ARMv8 Keccak2x Implementation.

Since there is no SIMD128 for ARMv8, so I decide to implement one.

The result is not impressive, due to 2 reasons:

SHA3 uses native bit-wise operation like AND, NOT, XOR, those operation only take about 1 cycle in CPU, therefore:

No pipeline happens
No significant improvement if SIMD bitwidth is 128-bit, ARMv8 native register width is 64-bit, I suppose frequency in NEON mode is slower than Scalar mode. (I don’t know the term for this, please let me know)

This code can be faster than this benchmark if:

SIMD register bitwidth is wider: e.g 256, 512, …
Frequency in NEON mode is at least > 0.5 * (Scalar frequency)

What is inside this package?

NEON (ASIMD) ARMv8 implementation of:

KeccakP-1600
SHAKE128 : Absorb, Squeeze
SHAKE256 : Absorb, Squeeze
SHA3_256
SHA3_512

Result

System Information

Here is my benchmark on ARMv8 Raspberry Pi 64-bit Majaro:

OS

Distributor ID: Manjaro-ARM
Description:    Manjaro ARM Linux
Release:        20.10

CPU

Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
Vendor ID:                       ARM
Model:                           3
Model name:                      Cortex-A72
Stepping:                        r0p3
CPU max MHz:                     1900.0000
CPU min MHz:                     600.0000
Flags:                           fp asimd evtstrm crc32 cpuid

I overclocked Raspberry Pi to 1900 Mhz. The default CPU frequency is 1500 Mhz.

Result

All benchmarks were run via this command:

make all
taskset 0x1 ./benchmark_SHAKE128_256_1000.bin

taskset command pin process to only 1 CPU, avoid switching CPU cost

Table 1. Result

Output Length	Input Length	FIPS202x2 NEON	FIPS202x2 C
42	672	487	514
294	336	390	413
1008	42	586	606
2772	1008	2228	2287
3318	504	2230	2286
4074	1008	3004	3099

The result above iterate 1000 time. As set in #define TESTS 1000

You can view the full result, iterate 1,000 or 1,000,000 times in: data/

Graph

If the data/ is confuse to you, here is some graphs:

The orange line is the differences between C reference code and NEON implementation
The green line is average of 24 samples for C_ref - NEON
- Orange: C_ref - NEON
- Green: average of C_ref - NEON

You can notice that in some case, C Ref is better than NEON. For small output length, NEON is better than C Ref at about 5%.

Conclusion

The Keccak2x NEON version is always faster than 2 times Keccak C version. See bench() function

If you only call Keccak once, use C version, it’s faster
If you call Keccak multiple times, use NEON version, it saves sometimes.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
binaries		binaries
data		data
graph		graph
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.adoc		README.adoc
benchmark.cxx		benchmark.cxx
benchmark_rate.c		benchmark_rate.c
fips202.c		fips202.c
fips202.h		fips202.h
fips202x2.c		fips202x2.c
fips202x2.h		fips202x2.h

License

cothan/NEON-SHA3_2x

Folders and files

Latest commit

History

Repository files navigation

NEON ARMv8 SHA3_2x

Update

Apple M1

NEON ARMv8 Keccak2x Implementation.

What is inside this package?

Result

System Information

Result

Graph

Conclusion

About

Topics

Resources

License

Stars

Watchers

Forks

Languages