Add Quasi-Fused Multiply-Add/Subtract instructions #79

Maratyszcza · 2019-05-29T06:59:25Z

Introduction

Most modern processors support Fused Multiply-Add (FMA) family instructions. These instructions compute multiplication and addition of floating-point numbers without intermediate rounding. Elimination of intermediate rounding improves accuracy in most numerical algorithms, and moreover FMA instructions provide performance improvements on all processors which support these instructions:

Intel Haswell processors achieve 32 FLOPs/cycle with FMA, but only 16 FLOPs/cycle with separate multiply+add instructions.[1]
AMD Piledriver and Zen processors deliver 16 FLOPs/cycle with FMA, but only 8 FLOPs/cycle with separate multiply+add instructions.[1]
ARM Cortex-A55 and Cortex-A75 cores archieve 8 FLOPs/cycle with FMA, but only 4 FLOPs/cycle with separate multiply+add instructions.[2,3]

However, on processors which do not support FMA instructions, it is expensive to emulate these operations exactly, without intermediate rounding.

This PR introduce Quasi-Fused Multiply-Add instruction which provides the performance and accuracy benefits of FMA, where supported, while preserving compatibility with older processors. Quasi-Fused Multiply-Add (QFMA) instruction represents a + b * c with optional rounding after multiplication. WebAssembly implementations are required to be consistent, and either always generate FMA instruction, or always generate multiplication+addition pair for a QFMA instruction within a module. QFMA instruction is augmented by Quasi-Fused Multiply-Subtract (QFMA) instruction which represents a - b * c with similar optional rounding after multiplication.

Performance Impact

Fused Multiply-Add instructions improve performance in two ways:

They double floating-point throughput in most processors by occupying floating-point SIMD units for only one cycle, compared to two cycles for the separate multiplication + addition instructions.
The reduce register pressure when QFMA(a, b, c) result overwrites operand a. This situation is very typical in dense linear algebra and neural network computations, and without FMA the implementation would have to allocate a temporary register t for the result of multiplication (t <- b * c ; a <- a + t).

Evaluation on native ARM64 code

To estimate the speedup from FMA in practice, I replaced NEON FMA intrinsics (vfmaq_f32 and vfmaq_lane_f32) in ARM64 implementation of neural network inference with non-fused NEON intrinsics (vmlaq_f32 and vmlaq_lane_f32). Both versions were compiled with Clang to native (i.e. not WebAssembly) binaries for ARM64 architecture, and evaluated in single-threaded mode. Speedups of FMA-based version compared to version with separate multiplication + addition are presented in the table below:

Mobile phone	Xiaomi Mi A2 Lite	Pixel 2 XL	LG G8 ThinQ	Galaxy S8 (Exynos)
Processor core	Cortex-A53	Cortex-A73	Cortex-A76	Exynos-M2
MobileNet v2 [5]	2.05	2.11	2.14	1.79
Face Mesh [7]	1.83	1.67	1.75	1.63
Segmentation [8]	1.54	1.56	1.93	1.43

Across 3 neural network architectures and 4 mobile devices, the minimum speedup is 1.4X, and speedup on the most compute-intensive neural network (MobileNet v2) exceeds 2X. I suggest that an improvement this big justifies extending WebAssembly SIMD spec with 4 new instructions.

[October 3 update] Evaluation of QFMA prototype in V8

@tlively implemented experimental support of QFMA instruction in LLVM & Clang (commit) and Binaryen (PR), and @ngzhian implemented QFMA lowering in V8 for x86-64 (commit) and ARM64 (commit) architectures. Due to experimental nature of the prototype toolchain, it is conservative in leveraging QFMA instructions, and generates them only through an explicit intrinsic.

I ported the most critical micro-kernels in XNNPACK neural network operator library to use QFMA, and evaluated its performance on 9 neural network models. The table below presents the results:

Host	Xeon W-2135	Xiaomi Mi A2 Lite	Pixel 2 XL	LG G8 ThinQ	Galaxy S8	Galaxy S10
Processor core	Sky Lake	Cortex-A53	Cortex-A73	Cortex-A76	Exynos-M2	Exynos-M4
MobileNet v1 [4]	43%	17%	36%	48%	27%	44%
MobileNet v2 [5]	38%	14%	27%	36%	21%	35%
SSDLite [5]	36%	13%	28%	36%	21%	45%
SqueezeNet [6]	50%	28%	40%	41%	29%	50%
Face Mesh [7]	26%	15%	21%	29%	18%	24%
Segmentation [8]	29%	16%	21%	34%	18%	25%
BlazeFace [9]	27%	19%	19%	18%	9%	15%
Hand Mesh [10]	30%	9%	25%	38%	19%	34%
Hand Detector [10]	26%	10%	29%	36%	16%	32%
Geomean	33%	15%	26%	34%	19%	32%

While the speedup from QFMA in the prototype WebAssembly implementation is smaller than in native code, QFMA improved performance on all 6 evaluated devices, and on modern CPU microarchitectures (Intel Sky Lake, ARM Cortex-A76, Samsung Exynos-M4) QFMA improves performance on average by one third.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with FMA3 (but no FMA4) instruction set

These processors include Intel Haswell (and later) and AMD Zen (and later).

f32x4.qfma
- a = f32x4.qfma(a, b, c) is lowered to one of two options:
  1. VFMADD231PS xmm_a, xmm_b, xmm_c (c can be in-memory)
  2. VFMADD231PS xmm_a, xmm_c, xmm_b (b can be in-memory)
- b = f32x4.qfma(a, b, c) is lowered to one of two options:
  1. VFMADD132PS xmm_b, xmm_a, xmm_c (c can be in-memory)
  2. VFMADD213PS xmm_b, xmm_c, xmm_a (a can be in-memory)
- c = f32x4.qfma(a, b, c) is lowered to one of two options:
  1. VFMADD132PS xmm_c, xmm_a, xmm_b (b can be in-memory)
  2. VFMADD213PS xmm_c, xmm_b, xmm_a (a can be in-memory)
- d = f32x4.qfma(a, b, c) is lowered to one of six options:
  1. VMOVUPS xmm_d, xmm_a + VFMADD231PS xmm_d, xmm_b, xmm_c (a and c can be in-memory)
  2. VMOVUPS xmm_d, xmm_a + VFMADD231PS xmm_d, xmm_c, xmm_b (a and b can be in-memory)
  3. VMOVUPS xmm_d, xmm_b + VFMADD132PS xmm_d, xmm_a, xmm_c (b and c can be in-memory)
  4. VMOVUPS xmm_d, xmm_b + VFMADD213PS xmm_d, xmm_c, xmm_a (b and a can be in-memory)
  5. VMOVUPS xmm_d, xmm_c + VFMADD132PS xmm_d, xmm_a, xmm_b (c and b can be in-memory)
  6. VMOVUPS xmm_d, xmm_c + VFMADD213PS xmm_d, xmm_b, xmm_a (c and a can be in-memory)
f32x4.qfms
- a = f32x4.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADD231PS xmm_a, xmm_b, xmm_c (c can be in-memory)
  2. VFNMADD231PS xmm_a, xmm_c, xmm_b (b can be in-memory)
- b = f32x4.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADD132PS xmm_b, xmm_a, xmm_c (c can be in-memory)
  2. VFNMADD213PS xmm_b, xmm_c, xmm_a (a can be in-memory)
- c = f32x4.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADD132PS xmm_c, xmm_a, xmm_b (b can be in-memory)
  2. VFNMADD213PS xmm_c, xmm_b, xmm_a (a can be in-memory)
- d = f32x4.qfms(a, b, c) is lowered to one of six options:
  1. VMOVUPS xmm_d, xmm_a + VFNMADD231PS xmm_d, xmm_b, xmm_c (a and c can be in-memory)
  2. VMOVUPS xmm_d, xmm_a + VFNMADD231PS xmm_d, xmm_c, xmm_b (a and b can be in-memory)
  3. VMOVUPS xmm_d, xmm_b + VFNMADD132PS xmm_d, xmm_a, xmm_c (b and c can be in-memory)
  4. VMOVUPS xmm_d, xmm_b + VFNMADD213PS xmm_d, xmm_c, xmm_a (b and a can be in-memory)
  5. VMOVUPS xmm_d, xmm_c + VFNMADD132PS xmm_d, xmm_a, xmm_b (c and b can be in-memory)
  6. VMOVUPS xmm_d, xmm_c + VFNMADD213PS xmm_d, xmm_b, xmm_a (c and a can be in-memory)
f64x2.qfma
- a = f64x2.qfma(a, b, c) is lowered to one of two options:
  1. VFMADD231PD xmm_a, xmm_b, xmm_c (c can be in-memory)
  2. VFMADD231PD xmm_a, xmm_c, xmm_b (b can be in-memory)
- b = f64x2.qfma(a, b, c) is lowered to one of two options:
  1. VFMADD132PD xmm_b, xmm_a, xmm_c (c can be in-memory)
  2. VFMADD213PD xmm_b, xmm_c, xmm_a (a can be in-memory)
- c = f64x2.qfma(a, b, c) is lowered to one of two options:
  1. VFMADD132PD xmm_c, xmm_a, xmm_b (b can be in-memory)
  2. VFMADD213PD xmm_c, xmm_b, xmm_a (a can be in-memory)
- d = f64x2.qfma(a, b, c) is lowered to one of six options:
  1. VMOVUPD xmm_d, xmm_a + VFMADD231PD xmm_d, xmm_b, xmm_c (a and c can be in-memory)
  2. VMOVUPD xmm_d, xmm_a + VFMADD231PD xmm_d, xmm_c, xmm_b (a and b can be in-memory)
  3. VMOVUPD xmm_d, xmm_b + VFMADD132PD xmm_d, xmm_a, xmm_c (b and c can be in-memory)
  4. VMOVUPD xmm_d, xmm_b + VFMADD213PD xmm_d, xmm_c, xmm_a (b and a can be in-memory)
  5. VMOVUPD xmm_d, xmm_c + VFMADD132PD xmm_d, xmm_a, xmm_b (c and b can be in-memory)
  6. VMOVUPD xmm_d, xmm_c + VFMADD213PD xmm_d, xmm_b, xmm_a (c and a can be in-memory)
f64x2.qfms
- a = f64x2.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADD231PD xmm_a, xmm_b, xmm_c (c can be in-memory)
  2. VFNMADD231PD xmm_a, xmm_c, xmm_b (b can be in-memory)
- b = f64x2.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADD132PD xmm_b, xmm_a, xmm_c (c can be in-memory)
  2. VFNMADD213PD xmm_b, xmm_c, xmm_a (a can be in-memory)
- c = f64x2.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADD132PD xmm_c, xmm_a, xmm_b (b can be in-memory)
  2. VFNMADD213PD xmm_c, xmm_b, xmm_a (a can be in-memory)
- d = f64x2.qfms(a, b, c) is lowered to one of six options:
  1. VMOVUPD xmm_d, xmm_a + VFNMADD231PD xmm_d, xmm_b, xmm_c (a and c can be in-memory)
  2. VMOVUPD xmm_d, xmm_a + VFNMADD231PD xmm_d, xmm_c, xmm_b (a and b can be in-memory)
  3. VMOVUPD xmm_d, xmm_b + VFNMADD132PD xmm_d, xmm_a, xmm_c (b and c can be in-memory)
  4. VMOVUPD xmm_d, xmm_b + VFNMADD213PD xmm_d, xmm_c, xmm_a (b and a can be in-memory)
  5. VMOVUPD xmm_d, xmm_c + VFNMADD132PD xmm_d, xmm_a, xmm_b (c and b can be in-memory)
  6. VMOVUPD xmm_d, xmm_c + VFNMADD213PD xmm_d, xmm_b, xmm_a (c and a can be in-memory)

x86/x86-64 processors with FMA3 and FMA4 instruction sets

These processors include AMD Piledriver, AMD Steamroller, AMD Excavator, but not AMD Zen.

f32x4.qfma
- a = f32x4.qfma(a, b, c) is lowered to one of two options:
  1. VFMADD231PS xmm_a, xmm_b, xmm_c (c can be in-memory)
  2. VFMADD231PS xmm_a, xmm_c, xmm_b (b can be in-memory)
- b = f32x4.qfma(a, b, c) is lowered to one of two options:
  1. VFMADD132PS xmm_b, xmm_a, xmm_c (c can be in-memory)
  2. VFMADD213PS xmm_b, xmm_c, xmm_a (a can be in-memory)
- c = f32x4.qfma(a, b, c) is lowered to one of two options:
  1. VFMADD132PS xmm_c, xmm_a, xmm_b (b can be in-memory)
  2. VFMADD213PS xmm_c, xmm_b, xmm_a (a can be in-memory)
- d = f32x4.qfma(a, b, c) is lowered to one of two options:
  1. VFMADDPS xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
  2. VFMADDPS xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)
    be in-memory)
f32x4.qfms
- a = f32x4.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADD231PS xmm_a, xmm_b, xmm_c (c can be in-memory)
  2. VFNMADD231PS xmm_a, xmm_c, xmm_b (b can be in-memory)
- b = f32x4.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADD132PS xmm_b, xmm_a, xmm_c (c can be in-memory)
  2. VFNMADD213PS xmm_b, xmm_c, xmm_a (a can be in-memory)
- c = f32x4.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADD132PS xmm_c, xmm_a, xmm_b (b can be in-memory)
  2. VFNMADD213PS xmm_c, xmm_b, xmm_a (a can be in-memory)
- d = f32x4.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADDPS xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
  2. VFNMADDPS xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)
    be in-memory)
f64x2.qfma
- a = f64x2.qfma(a, b, c) is lowered to one of two options:
  1. VFMADD231PD xmm_a, xmm_b, xmm_c (c can be in-memory)
  2. VFMADD231PD xmm_a, xmm_c, xmm_b (b can be in-memory)
- b = f64x2.qfma(a, b, c) is lowered to one of two options:
  1. VFMADD132PD xmm_b, xmm_a, xmm_c (c can be in-memory)
  2. VFMADD213PD xmm_b, xmm_c, xmm_a (a can be in-memory)
- c = f64x2.qfma(a, b, c) is lowered to one of two options:
  1. VFMADD132PD xmm_c, xmm_a, xmm_b (b can be in-memory)
  2. VFMADD213PD xmm_c, xmm_b, xmm_a (a can be in-memory)
- d = f64x2.qfma(a, b, c) is lowered to one of two options:
  1. VFMADDPD xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
  2. VFMADDPD xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)
    be in-memory)
f64x2.qfms
- a = f64x2.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADD231PD xmm_a, xmm_b, xmm_c (c can be in-memory)
  2. VFNMADD231PD xmm_a, xmm_c, xmm_b (b can be in-memory)
- b = f64x2.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADD132PD xmm_b, xmm_a, xmm_c (c can be in-memory)
  2. VFNMADD213PD xmm_b, xmm_c, xmm_a (a can be in-memory)
- c = f64x2.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADD132PD xmm_c, xmm_a, xmm_b (b can be in-memory)
  2. VFNMADD213PD xmm_c, xmm_b, xmm_a (a can be in-memory)
- d = f64x2.qfms(a, b, c) is lowered to one of two options:
  1. VFNMADDPD xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
  2. VFNMADDPD xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)

x86/x86-64 processors with FMA4 (and no FMA3) instruction sets

AMD Bulldozer is the only family of such processors.

d = f32x4.qfma(a, b, c) is lowered to one of two options:
1. VFMADDPS xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
2. VFMADDPS xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)
d = f32x4.qfms(a, b, c) is lowered to one of two options:
1. VFNMADDPS xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
2. VFNMADDPS xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)
d = f64x2.qfma(a, b, c) is lowered to one of two options:
1. VFMADDPD xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
2. VFMADDPD xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)
d = f64x2.qfms(a, b, c) is lowered to one of two options:
1. VFNMADDPD xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
2. VFNMADDPD xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)

ARM64 processors

All ARM64 application processors support SIMD with FMA

f32x4.qfma
- a = f32x4.qfma(a, b, c) is lowered to FMLA Va.4S, Vb.4S, Vc.4S
- d = f32x4.qfma(a, b, c) is lowered to MOV Vd.16B, Va.16B + FMLA Va.4S, Vb.4S, Vc.4S
f32x4.qfms
- a = f32x4.qfms(a, b, c) is lowered to FMLS Va.4S, Vb.4S, Vc.4S
- d = f32x4.qfms(a, b, c) is lowered to MOV Vd.16B, Va.16B + FMLS Va.4S, Vb.4S, Vc.4S
f64x2.qfma
- a = f64x2.qfma(a, b, c) is lowered to FMLA Va.2D, Vb.2D, Vc.2D
- d = f64x2.qfma(a, b, c) is lowered to MOV Vd.16B, Va.16B + FMLA Va.2D, Vb.2D, Vc.2D
f64x2.qfms
- a = f64x2.qfms(a, b, c) is lowered to FMLS Va.2D, Vb.2D, Vc.2D
- d = f64x2.qfms(a, b, c) is lowered to MOV Vd.16B, Va.16B + FMLS Va.2D, Vb.2D, Vc.2D

ARMv7 processors with NEONv2 (NEON-FMA) instruction set

Most 32-bit ARM application processors support SIMD (NEON) with FMA instructions.

f32x4.qfma
- a = f32x4.qfma(a, b, c) is lowered to VFMA.F32 q_a, q_b, q_c
- d = f32x4.qfma(a, b, c) is lowered to VMOV q_d, q_a + VFMA.F32 q_d, q_b, q_c
f32x4.qfms
- a = f32x4.qfms(a, b, c) is lowered to VFMS.F32 q_a, q_b, q_c
- d = f32x4.qfms(a, b, c) is lowered to VMOV q_d, q_a + VFMS.F32 q_d, q_b, q_c
f64x2.qfma
- a = f64x2.qfma(a, b, c) is lowered to VFMA.F64 d_a0, d_b0, d_c0 + VFMA.F64 d_a1, d_b1, d_c1
- d = f64x2.qfma(a, b, c) is lowered to VMOV q_d, q_a + VFMA.F64 q_d0, q_b0, q_c0 + VFMA.F64 q_d1, q_b1, q_c1
f64x2.qfms
- a = f64x2.qfms(a, b, c) is lowered to VFMS.F64 d_a0, d_b0, d_c0 + VFMS.F64 d_a1, d_b1, d_c1
- d = f64x2.qfms(a, b, c) is lowered to VMOV q_d, q_a + VFMA.F64 q_d0, q_b0, q_c0 + VFMS.F64 q_d1, q_b1, q_c1

ARMv7 processors with NEON (but without FMA) instruction set

ARM Cortex-A8, ARM Cortex-A9, and Qualcomm Scorpion are the only significant cores which support SIMD (NEON), but not the FMA extension

f32x4.qfma
- a = f32x4.qfma(a, b, c) is lowered to VMLA.F32 q_a, q_b, q_c (note: multiply-add with intermediate rounding)
- d = f32x4.qfma(a, b, c) is lowered to VMUL.F32 q_d, q_b, q_c + VADD.F32 q_d, q_a, q_d
f32x4.qfms
- a = f32x4.qfms(a, b, c) is lowered to VMLS.F32 q_a, q_b, q_c (note: multiply-subtract with intermediate rounding)
- d = f32x4.qfms(a, b, c) is lowered to VMUL.F32 q_d, q_b, q_c + VSUB.F32 q_d, q_a, q_d
f64x2.qfma
- a = f64x2.qfma(a, b, c) is lowered to VMLA.F64 d_a0, d_b0, d_c0 + VMLA.F64 d_a1, d_b1, d_c1 (note: multiply-add with intermediate rounding)
- d = f64x2.qfma(a, b, c) is lowered to VMUL.F64 d_d0, d_b0, d_c0 + VMUL.F64 d_d1, d_b1, d_c1 + VADD.F64 d_d0, d_a0, d_d0 + VADD.F64 d_d1, d_a1, d_d1
f64x2.qfms
- a = f64x2.qfms(a, b, c) is lowered to VMLS.F64 d_a0, d_b0, d_c0 + VMLS.F64 d_a1, d_b1, d_c1 (note: multiply-subtract with intermediate rounding)
- d = f64x2.qfms(a, b, c) is lowered to VMUL.F64 d_d0, d_b0, d_c0 + VMUL.F64 d_d1, d_b1, d_c1 + VSUB.F64 d_d0, d_a0, d_d0 + VSUB.F64 d_d1, d_a1, d_d1

POWER processors with VSX instruction set

IBM POWER processors starting with POWER7

f32x4.qfma
- a = f32x4.qfma(a, b, c) is lowered to XVMADDASP x_a, x_b, x_c
- b = f32x4.qfma(a, b, c) is lowered to XVMADDMSP x_b, x_c, x_a
- c = f32x4.qfma(a, b, c) is lowered to XVMADDMSP x_c, x_b, x_a
- d = f32x4.qfma(a, b, c) is lowered to VMR x_d, x_a + XVMADDASP x_d, x_b, x_c
f32x4.qfms
- a = f32x4.qfms(a, b, c) is lowered to XVNMSUBASP x_a, x_b, x_c
- b = f32x4.qfms(a, b, c) is lowered to XVNMSUBMSP x_b, x_c, x_a
- c = f32x4.qfms(a, b, c) is lowered to XVNMSUBMSP x_c, x_b, x_a
- d = f32x4.qfms(a, b, c) is lowered to VMR x_d, x_a + XVNMSUBASP x_d, x_b, x_c
f64x2.qfma
- a = f64x2.qfma(a, b, c) is lowered to XVMADDADP x_a, x_b, x_c
- b = f64x2.qfma(a, b, c) is lowered to XVMADDMDP x_b, x_c, x_a
- c = f64x2.qfma(a, b, c) is lowered to XVMADDMDP x_c, x_b, x_a
- d = f64x2.qfma(a, b, c) is lowered to VMR x_d, x_a + XVMADDADP x_d, x_b, x_c
f64x2.qfms
- a = f64x2.qfms(a, b, c) is lowered to XVNMSUBADP x_a, x_b, x_c
- b = f64x2.qfms(a, b, c) is lowered to XVNMSUBMDP x_b, x_c, x_a
- c = f64x2.qfms(a, b, c) is lowered to XVNMSUBMDP x_c, x_b, x_a
- d = f64x2.qfms(a, b, c) is lowered to VMR x_d, x_a + XVNMSUBADP x_d, x_b, x_c

Other processors and instruction sets

f32x4.qfma
- d = f32x4.qfma(a, b, c) is lowered like d = f32x4.add(a, f32x4.mul(b, c))
f32x4.qfms
- d = f32x4.qfms(a, b, c) is lowered like d = f32x4.sub(a, f32x4.mul(b, c))
f64x2.qfma
- d = f64x2.qfma(a, b, c) is lowered like d = f64x2.add(a, f64x2.mul(b, c))
f64x2.qfms
- d = f64x2.qfms(a, b, c) is lowered like d = f64x2.sub(a, f64x2.mul(b, c))

References

[1] Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD
and VIA CPUs by Agner Fog
[2] ARM Cortex-A55 Software Optimization Guide
[3] ARM Cortex-A75 Software Optimization Guide
[4] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
[5] MobileNetV2: Inverted Residuals and Linear Bottlenecks
[6] SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
[7] Real-Time AR Self-Expression with Machine Learning
[8] Mobile Real-time Video Segmentation
[9] BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs
[10] On-Device, Real-Time Hand Tracking with MediaPipe

gnzlbg · 2019-05-29T08:51:11Z

Is this equivalent to LLVM's fmul_add ? If not, could you expand on how they differ ?

sunfishcode · 2019-05-29T14:18:59Z

Thanks for the thorough analysis! I agree that FMA is desirable, and it's always always been about "how" and "when" rather than "if".

In light of the decision to remove the subnormal nondeterminism, effectively declaring 32-bit ARM NEON implementations non-conforming, another option here would be to just add a regular FMA instruction, without the nondeterminism. This has portability downsides, but on the other hand it would address some use cases not covered by qfma, because some algorithms depend on the fused behavior of FMA. And in the long term, if we believe new hardware platforms will generally support FMA, the platform would benefit from greater simplicity and determinism.

A variant would be to standardize simd128 without FMA, and then add FMA as a separate feature, which implementations could decide to implement, and toolchains could decide to depend on, separately from the base simd128. In the long term, perhaps all implementations would support FMA, while in the short term this might allow for a transition period.

Is this equivalent to LLVM's fmul_add ? If not, could you expand on how they differ ?

LLVM's fmul_add leaves it entirely nondeterministic whether the operation is fused or not. The description of qfma above says "WebAssembly implementations are required to be consistent, and either always generate FMA instruction, or always generate multiplication+addition pair for a QFMA instruction within a module." This addresses an important application constraint, but also raises some interesting questions.

It would no longer be semantics-preserving for a tool to naively split parts of a module into separate modules, because a module may be depending on the guarantee that FMA nondeterminism is consistent within a module.
Toolchains which support inlining across module boundaries might still hit discontinuity problems if a user function is inlined into multiple modules, and then at runtime there are differences in rounding between modules.

gnzlbg · 2019-05-29T17:34:11Z

LLVM's fmul_add leaves it entirely nondeterministic whether the operation is fused or not. The description of qfma above says "WebAssembly implementations are required to be consistent, and either always generate FMA instruction, or always generate multiplication+addition pair for a QFMA instruction within a module."

I thought that LLVM fmul_add generates a fusedMultiplyAdd if the target has an instruction for this and if this instruction is faster than a multiply followed by an add. That is, that all invocations of fmul_add for a particular target do the exact same thing, as opposed to, deciding what is more efficient on each invocation. I might have misunderstood this.

A variant would be to standardize simd128 without FMA, and then add FMA as a separate feature, which implementations could decide to implement, and toolchains could decide to depend on, separately from the base simd128. In the long term, perhaps all implementations would support FMA, while in the short term this might allow for a transition period.

Note that under this proposal a module that requires FMA would fail to validate on an implementation that does not support FMA, so such a transition period might result in users having to ship multiple modules (w/o FMA). Something like #80 would help here.

Maratyszcza · 2019-05-29T17:56:12Z

@gnzlbg qfma is similar to fmul_add. But IIUC, for fmul_add it would be permissible to lower to FMA in one situation (e.g. when result overwrites accumulator), but to multiplication+addition in another (e.g. when result doesn't overwrite any inputs). For qfma both cases would have to lower to similar instructions (either both to FMA, or both to multiply + add).

Maratyszcza · 2019-05-29T18:22:34Z

another option here would be to just add a regular FMA instruction, without the nondeterminism.

There are plenty of new x86 CPUs which don't support FMA operation. First, low-power series from Intel (Goldmont/Goldmont Plus microarchitectures) and AMD (Jaguar microarchitecture) lack FMA instructions. Secondly, Intel has a long history of stripping any post-SSE4 ISA extension from low-cost versions (i.e. Celeron & Pentium) of its mainstream architectures. E.g. Pentium Gold G5600 is based on the latest Coffee Lake microarchitecture, but doesn't support AVX and FMA instruction sets.

LLVM's fmul_add leaves it entirely nondeterministic whether the operation is fused or not. The description of qfma above says "WebAssembly implementations are required to be consistent, and either always generate FMA instruction, or always generate multiplication+addition pair for a QFMA instruction within a module."

I expect that decision whether to use FMA would be based only on hardware characteristics, i.e. whether processor support FMA and whether it is fast on this processor. Unlike LLVM's fmul_add, WAsm code generator can't choose FMA/mul+add form depending on e.g. which input operand gets overwritten by qfma result. Then it is valid both to split WAsm module and to do inlining across modules.

Maratyszcza · 2019-05-30T06:53:09Z

It seems that I misuse the concept of "WebAssembly module". By "WebAssembly module" I mean everything in a ".wasm" or ".wast" file, but "WebAssembly module" seems to have a different meaning in the WAsm spec. What would be the right wording here?

dtig · 2019-07-18T22:46:05Z

These are sometimes used interchangeable, but a WebAssembly module is the result of compiling wasm bytecode. Depending on the context you may be looking for different words - for everything in a .wasm file, it could just be the Wasm binary, but in the cases where there are references to failing to validate/compile - module would still be the right wording IMO.

Thanks for the thorough analysis! Having thought about this a little more, I have a few concerns about the inclusion of this in the SIMD MVP. Firstly the performance inconsistencies between the FMA/non-FMA enabled hardware for this set of operations would be hard to miss. We have usually tried to stick to the instructions that are available across the board, and setting SSE4.1, and Neon as thresholds to guarantee performance across a large cross section of devices. Introduction of FMA makes this predictable performance across devices somewhat nebulous.

Secondly, there is a compromise to be made here with the inclusion of QFMA, that it's not strictly FMA, and including just FMA without non-determinism violates portability. It is true that this is already the case due to removal of subnormal determinism, but I'm more comfortable with that because it is consistent with the existing Wasm spec, the number of devices that that would affect is not large, and shrinking, vs. the FMA non-determinsim is a larger surface area.

I'm leaning towards having FMA without the non-determinism, but punting this to Post-MVP.

That said, we have an open V8 bug to prototype/experiment with variants of this and their Polyfills on non-FMA enabled hardware. I'm labeling this with pending prototype data to follow up with after we have a prototype.

Maratyszcza · 2019-10-04T07:12:38Z

Thanks to great work of @tlively who added QFMA in LLVM, Clang, and Binaryen, and @ngzhian who implemented QFMA in V8, it is now possible to try this instruction on real workloads. I evaluated a prototype QFMA-enabled implementation of neural network inference, and updated the top-level post with the new results.

zeux · 2019-12-06T04:41:30Z

FWIW from my point of view, this is a good addition.

In native code [I am used to], the non-determinism between different compilers / architectures is a given - different optimizations / lowering / precision of different operations just is a fact of life. Within these constraints, automatic fma substitution (-ffast-math -mfma or equivalent) usually gives appreciable performance gains in the 10-20% range on real floating-point-heavy code. In native world, it's not always easy to exercise this because you don't know if FMA is supported on the target platform, and compiling multiple versions usually doesn't work out of the box. But in WebAssembly due to JIT compilation something like qfma can work well.

So in practice:

Without -ffast-math, I wouldn't expect LLVM to use qfma outside of explicit intrinsic calls because it doesn't have a consistently defined behavior. So we retain de-facto determinism.
With -ffast-math, there are going to be noticeable performance gains on a lot of real-world floating-point-heavy benchmarks. With -ffast-math the author of the code is explicitly saying "I am willing to tolerate non-determinism"
Even without -ffast-math, this is a new, separate, intrinsic call - in areas where performance is vital, an application could use this consciously.

(I understand that this runs contrary to the desire to get identical results between different platforms... but the reality for SIMD is that for floating-point math, often performance trumps determinism in my experience. Of course in theory we could include instructions like qfma and estimated rsqrt in a separate proposal post-MVP, but this gets hard to deal with in practice when the actual support for different instruction sets between implementations varies)

nfrechette · 2019-12-06T05:08:40Z

My experience runs contrary to yours, @zeux . In my tests, FMA instructions are always slower under x64. I did notice fast-math automatically generating them and it is one of the many reasons why I hate fast-math. FMA instructions are designed for throughput, not latency. With modern processors easily being able to dual issue mul/add pairs, FMA offers little benefit aside from reducing the code size and potentially using fewer registers (although in practice I haven't observed this either). Here are some benchmarks I ran with it in my own math lib here. I had high hopes for it and judging from the replies I got on twitter, FMA is underwhelming. It seems to me that it would perform unusually well in a setup like neural network evaluation because it is most likely throughput bound.

Fast-math also doesn't give a gain anywhere near what you claim in the workloads I have seen, not for a long time. This might have been true at some point in the past but not with VS2017 and later generations. I disabled fast-math in my animation compression code because determinism is more important than whatever gain might arise. I saw no impact on the performance after turning it off. 3dsmax 2019 also disabled fast-math when they switched to VS2017. While some workloads saw a minor impact, overall we only saw benefits from turning it off.

Determinism matters more than performance unless you really know what you are doing and I would argue that if you do, then fast-math offers little benefit because you can do what the compiler does by hand. Fast-math means that from compiler version to version, behavior can change, sometimes drastically as we've seen with VS2017. It introduced sin/cos pair calculation through SSE float32 arithmetic while prior compilers calculated both separately with float64 arithmetic (the VS stdlib does this for transcendental functions). This leads to a dramatic reduction in accuracy and can lead to visual artifacts and other undesired behavior. Visual studio defaults to precise math and that is a sane default IMO. I hope V8 & cie use a similar default.

zeux · 2019-12-06T05:19:59Z

In my tests, FMA instructions are always slower under x64.

This obviously depends on the workload. In (some) floating point heavy code that I see, FMA results in performance gains.

FMA instructions are designed for throughput, not latency.

On modern Intel processors the latency is the same as of multiplication, no? So you're making the latency strictly shorter by (potentially) reducing the critical path. Or reducing the port pressure, allowing for more coissue opportunities. On Skylake, both multiplication and FMA have 0.5 cycle rec. throughput and 4 cycles latency. So on computations that have a lot of opportunity for fusing, you're strictly winning - you're not going to lose. Of course this can vary with architecture.

Fast-math also doesn't give a gain anywhere near what you claim in the workloads I have seen, not for a long time.

I'm specifically referring to floating-point-heavy matrix-like code that's dominated by multiplies and adds and clang. Please refer to numbers posted by @Maratyszcza for even larger gains, gains I'm used to are more moderate. I'm not sure to what extent it's valid to compare this on Visual Studio.

Visual studio defaults to precise math and that is a sane default IMO. I hope V8 & cie use a similar default.

The default is always precise math, this seems orthogonal? My point is precisely that if we do have qfma support, the developer of the code is in control - they can enable the use of fused instruction for extra performance gains, or keep it off. If we don't have qfma support, we don't have this optimization opportunity,

zeux · 2019-12-06T05:59:22Z

Here's a motivating example from my experiments (caveats apply, ymmv, etc etc.):

This is an implementation of an slerp optimization from https://zeux.io/2015/07/23/approximating-slerp/ + https://zeux.io/2016/05/05/optimizing-slerp/ for fitted nlerp (the middle ground between nlerp and much more precise version, onlerp):

Baseline: https://gcc.godbolt.org/z/w-j8QW, 64.97 cycles per loop iteration
fast-math: https://gcc.godbolt.org/z/hVsk2B, 62.96 cycles per loop iteration
fast-math + avx2: https://gcc.godbolt.org/z/7mvWPe, 55.82 cycles per loop iteration
fast-math + fma: https://gcc.godbolt.org/z/XRKt4C, 40.79 cycles per loop iteration

I would expect that the results produced with llvm-mca closely match the actual timing results on stream-transform using this function, e.g. "given two streams of quaternions, interpolate between them".

lemaitre · 2019-12-06T08:24:23Z

Baseline: https://gcc.godbolt.org/z/w-j8QW, 64.97 cycles per loop iteration
fast-math: https://gcc.godbolt.org/z/hVsk2B, 62.96 cycles per loop iteration
fast-math + fma: https://gcc.godbolt.org/z/XRKt4C, 40.79 cycles per loop iteration

Beware that your baseline is in SSE while fast-math+fma is in AVX2.
Even if you write sse intrinsics, the compiler will generate 128-bit AVX2 instructions if you specify that the target supports AVX2.
If you add -mavx2 to your baseline, you will see that no fma will be used, but all instructions use VEX encoding.
The effect here is to drop the time from 64.97 to 57.82 cycles per loop iteration (supposedly because of the 3 operand instructions and the broadcast instruction).

This does not change the overall conclusion, though.

As a side note, since version 5, GCC automatically fuses MULs and ADDs into FMAs by default at -O2 without -ffast-math and even for intrinsics code.

lemaitre · 2019-12-06T08:53:05Z

@nfrechette Using FMAs cannot be slower on recent hardware because the FMA instruction has the exact same latency and throughput as the multiplication (latency 4c, throughput 2/c on skylake).
Basically, the addition is for free.

-ffast-math can sometimes be slower than without, but this is always a compiler bug (it should not do the transformation if it is actually slower).
But FMA is only a tiny part of -ffast-math.
I am pretty sure your slow down was never about FMAs, but some other "optimizations" from fast-math.

Also, you mention MSVC, but beware that MSVC lags behind all its competitor when it comes to speed.
If I take the simple code from @zeux and try to compile with both clang and msvc (https://gcc.godbolt.org/z/KarV3c), you will see that MSVC does not use any FMA at all (in this example) and is not able to keep all variables in register (spill code at the beginning).

zeux · 2019-12-06T10:57:48Z

Beware that your baseline is in SSE while fast-math+fma is in AVX2.

Thanks! Good catch, I forgot about this. I've updated the post to include fast-math avx2 (55.82 cycles). As you say it doesn't change the overall conclusion much, FMA vs AVX2 here is 55.82 / 40.79 = 1.36x speedup.

dsmilkov · 2019-12-06T18:48:48Z

Thank you for this proposal! We just released an alpha version of the TensorFlow.js WASM backend, which was one of our most requested features. Benchmarks show 30-50% speedup with QFMA on various ML models, on top of existing SIMD. Adding these instructions would greatly benefit machine learning and numerical computation libraries in general.

nfrechette · 2019-12-07T00:24:44Z

FMA is definitely not free on Ryzen. The picture isn't clear cut. For example, Ryzen executes up to 5 instructions per cycle. addps takes 1 op, has a latency of 3, reciprocal execution of 0.5, and executes on pipes 2 or 3 mulps takes 1 op, has a latency of 3, reciprocal execution of 0.5, and executes on pipes 0 or 1 fmadd takes 1 op, has a latency of 5, reciprocal execution of 0.5, and executes on pipes 0 or 1 In comparison, Skylake executes up to 4 instructions per cycle. addps takes 1 uop, has a latency of 4, a reciprocal execution of 0.5, and executes on port 0 or 1 mulps takes 1 uop, has a latency of 4, a reciprocal execution of 0.5, and executes on port 0 or 1 fmadd takes 1 uop, has a latency of 4, a reciprocal execution of 0.5, and executes on port 0 or 1 In code that does mul/add pairs, on ryzen we can execute up to 4 per cycle, two adds (pipe 2 and 3) and two mul (pipes 0 and 1). But we can only execute 2 fmadd on pipes 0 and 1 per cycle. And the latency is higher. In code that can mix in other things on the other ports to increase the instruction per cycle rate that is probably fine but definitely not a clear win. On Skylake, regardless of whether we use mul/add pairs of fmadd, we can only execute 2 instructions per cycle on port 0 and 1. Here the latency is the same and it does appear that the addition becomes more or less free. If we look at 1 generation earlier with Broadwell. addps takes 1 uop, has a latency of 3, a reciprocal execution of 1, and executes on port 1 mulps takes 1 uop, has a latency of 3, a reciprocal execution of 0.5, and executes on port 0 or 1 fmadd takes 1 uop, has a latency of 5, a reciprocal execution of 0.5, and executes on port 0 or 1 Here, at most with mul/add pairs we'll execute 2 instructions per cycle, same as fmadd but it now has a much longer latency. Earlier generations on either platform have similar poor performance with FMA. It appears that knight's landing has 1 uop per instruction and a latency of 6 for all 3 instructions with a reciprocal execution of 0.5 but while add/mul can use either port 0 or 1, fmadd isn't documented by agner, it seems a sane assumption that it executes on the same ports, same as Skylake. Mobile cpus might similarly have fewer execution ports or longer latency for some instructions to reduce the power draw and the heat generated. They might not fare as well. FMA can be a net win, clearly, but it is too early IMO to claim it is always so on common hardware today. Most people with a browser will not have a high end x64 CPU and most workloads might not be throughput bound. I can't speak for FMA on ARM NEON beyond my observation that it was a net win on the iPad Pro I have. Here as well though, its performance varies significantly from my other ARM64 devices. I would consider support for FMA to be unnecessary for now although I do see the benefit of supporting it.

…

On Fri, Dec 6, 2019 at 1:48 PM Daniel Smilkov ***@***.***> wrote: Thank you for this proposal! We just released an alpha version of the TensorFlow.js WASM backend, which was one of our most requested features. Benchmarks show 30-50% speedup with QFMA on various ML models, on top of existing SIMD. Adding these instructions would greatly benefit machine learning and numerical computation libraries in general. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#79?email_source=notifications&email_token=AC4ERDX6AOVKPH7UMICCZPTQXKNBDA5CNFSM4HQJVBQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGFADUQ#issuecomment-562692562>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC4ERDSF77WHHBRZZPSCIEDQXKNBDANCNFSM4HQJVBQQ> .

zeux · 2019-12-07T03:41:24Z

With fmadd having the latency of 5 on older CPUs, you still save latency because the mul+add pair has a combined latency of 6 cycles due to the dependency. You don't save as much as you do on Skylake of course.

If I take the benchmark I posted earlier and ask llvm-mca to generate timings for Broadwell, I get 48.51 cycles for AVX2 fast-math version and 42.29 cycles for FMA fast-math version, which is still 15% faster.

The beauty of this proposal is that even if somehow there are CPUs that execute fma slower than mul+add pairs (although I still don't see how / when this would be possible?), the code generation engine can disable it. Clearly there are demonstrable noticeable wins on multiple workloads on modern Intel chips, it seems like we would not have much to discuss if the faster execution didn't come at a price of having different results on different CPUs.

FWIW curiously gcc fuses mul+add into fmadd by default at -O2 when targeting ISAs with FMA at least for x64, without requiring -ffast-math at all.

munrocket · 2019-12-10T01:41:29Z

Am I right that QFMA care only about performance and become not correctly rounded according to true FMA operation in IEEE 757-2008? There are another way to polyfill FMA with respect to correctness. It is much harder than simply RN(a + RN(b * c)) but it useful for applications with arbitrary precision arithmetic.

nsthorat · 2019-12-10T16:23:55Z

I also want to pile onto this discussion and add support for this proposal. TensorFlow.js released a new WASM backend which will greatly lifts the floor for CPU accelerated machine learning. This QFMA proposal will give us even larger wins on top of SIMD which starts reducing the divide between CPU and WebGL accelerators (WebGL has a lot of driver issues w.r.t precision / correctness). This proposal would really be great for machine learning on the web!

munrocket · 2019-12-12T06:28:55Z

With this Quasi FMA proposal you are going to bake new specification that not grantee the same results on different machines and will fail CI tests not only in spec tests but also in all another software. Probably you will get about the same results in ML model, because you use matrix multiplication, but this error will increase after each operation and another algorithms may be not robust at all, warn everybody about this in manuals.

Why you don't want to allow feature detect and inlining functions? Here basically the same but give to all a raw predictable API

QFMA(a, b, c) = isFmaAvailable() ? fma(a, b, c) : a * b + c

Another software want correctness and robustness in execution in general. You solve precision problems on GPU in machine learing but make a huge problem to another software, you literally return FP arithmetic in 70s before IEEE, where no one could garantee anything. Also in your proposal qfms(a, b, c) === qfma(a, -b, c) because FP operations not associative in general but support negation propagation.

Also check how it is implemented in Java and Julia, they have FMAC (FMA Accurate) but sometime it very slow. By the way it not so slow as a BigNumber implementation, because it is based on Sylvie Boldo theorems, at least in Julia.

FMAC(a, b, c) = isFmaAvailable() ? fma(a, b, c) : sylvie_boldo_polyfill(a, b, c)

Also exist third variant, and it can be popular in Rust, AssemblyScript and Emscripten, because you don't have sin/pow/ln/tan operations in wasm and they want implement this aglorithms fast and correct.

wasm_pow(x) = ifFmaAvailable() ? algo_with_fma(x) : algo_witout_fma(x)

munrocket · 2019-12-14T22:11:19Z

Ok, even if we will have QFMA with feature detect #80, we still can create all of the functions. Because

FMAC(a, b, c) = isFmaAvailable() ? QFMA(a, b, c) : sylvie_boldo_polyfill(a, b, c)
wasm_pow(x, e) = isFmaAvailable() ? algo_with_fma(x, e, QFMA) : algo_witout_fma(x, e)

But only if you use real fma when it available.

dtig · 2020-03-17T19:03:38Z

The performance numbers from the QFMA operations have been quite compelling, and @Maratyszcza's data demonstrates 30-50% speedup over the current SIMD proposal. In the interest of standardizing the MVP of the SIMD proposal, one of the goals though is minimizing non-determinisim, and different results on different platforms unfortunately out of scope for the current proposal as it stands. That said, these will be available in a future SIMD proposal that is currently under discussion and biases towards performance over consistency. Till we have a repository for the proposal, please continue to use this issue for discussion. Marking this with a Post MVP label.

penzn mentioned this pull request May 29, 2019

SPECID at runtime #80

Closed

dtig added the pending prototype data label Jul 18, 2019

Maratyszcza force-pushed the quasi-fma branch from 5f5e20c to 78a77e5 Compare October 4, 2019 06:13

Maratyszcza force-pushed the quasi-fma branch from 78a77e5 to aa2f1fb Compare October 21, 2019 19:33

sunfishcode mentioned this pull request Oct 29, 2019

Pseudo-Minimum and Pseudo-Maximum instructions #122

Merged

Add Quasi-Fused Multiply-Add/Subtract instructions

3b13f43

Maratyszcza force-pushed the quasi-fma branch from aa2f1fb to 3b13f43 Compare February 15, 2020 01:45

dtig added post SIMD MVP and removed pending prototype data labels Mar 17, 2020

bjacob mentioned this pull request May 11, 2020

Integer multiply-add instructions #224

Open

tlively mentioned this pull request Dec 15, 2020

[post-MVP proposal] FMA instruction WebAssembly/design#1391

Open

arunetm mentioned this pull request Jan 12, 2021

Proposal: Fp IEEE compliance level flags for Wasm (FP-Fast-Math for Wasm Scalar & SIMD) WebAssembly/design#1393

Closed

ngzhian mentioned this pull request Mar 1, 2021

Relaxed SIMD WebAssembly/design#1401

Closed

ngzhian mentioned this pull request Mar 18, 2021

Consider FMA WebAssembly/relaxed-simd#5

Open

ngzhian mentioned this pull request Jun 16, 2021

relaxed fused multiply-add and fused multiply-subtract WebAssembly/relaxed-simd#27

Open

Maratyszcza mentioned this pull request Sep 28, 2021

Phase 2 WebAssembly/relaxed-simd#38

Closed

marcoscaceres mentioned this pull request Jun 29, 2022

WebAssembly Relaxed SIMD WebKit/standards-positions#4

Closed

dtig mentioned this pull request Oct 10, 2022

Request for position: WebAssembly Relaxed SIMD mozilla/standards-positions#651

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Quasi-Fused Multiply-Add/Subtract instructions #79

Add Quasi-Fused Multiply-Add/Subtract instructions #79

Maratyszcza commented May 29, 2019 •

edited

gnzlbg commented May 29, 2019

sunfishcode commented May 29, 2019

gnzlbg commented May 29, 2019

Maratyszcza commented May 29, 2019

Maratyszcza commented May 29, 2019

Maratyszcza commented May 30, 2019 •

edited

dtig commented Jul 18, 2019 •

edited

Maratyszcza commented Oct 4, 2019 •

edited

zeux commented Dec 6, 2019

nfrechette commented Dec 6, 2019

zeux commented Dec 6, 2019 •

edited

zeux commented Dec 6, 2019 •

edited

lemaitre commented Dec 6, 2019

lemaitre commented Dec 6, 2019

zeux commented Dec 6, 2019 •

edited

dsmilkov commented Dec 6, 2019

nfrechette commented Dec 7, 2019 via email

zeux commented Dec 7, 2019

munrocket commented Dec 10, 2019

nsthorat commented Dec 10, 2019

munrocket commented Dec 12, 2019

munrocket commented Dec 14, 2019

dtig commented Mar 17, 2020

Add Quasi-Fused Multiply-Add/Subtract instructions #79

Are you sure you want to change the base?

Add Quasi-Fused Multiply-Add/Subtract instructions #79

Conversation

Maratyszcza commented May 29, 2019 • edited

Introduction

Performance Impact

Evaluation on native ARM64 code

[October 3 update] Evaluation of QFMA prototype in V8

Mapping to Common Instruction Sets

x86/x86-64 processors with FMA3 (but no FMA4) instruction set

x86/x86-64 processors with FMA3 and FMA4 instruction sets

x86/x86-64 processors with FMA4 (and no FMA3) instruction sets

ARM64 processors

ARMv7 processors with NEONv2 (NEON-FMA) instruction set

ARMv7 processors with NEON (but without FMA) instruction set

POWER processors with VSX instruction set

Other processors and instruction sets

References

gnzlbg commented May 29, 2019

sunfishcode commented May 29, 2019

gnzlbg commented May 29, 2019

Maratyszcza commented May 29, 2019

Maratyszcza commented May 29, 2019

Maratyszcza commented May 30, 2019 • edited

dtig commented Jul 18, 2019 • edited

Maratyszcza commented Oct 4, 2019 • edited

zeux commented Dec 6, 2019

nfrechette commented Dec 6, 2019

zeux commented Dec 6, 2019 • edited

zeux commented Dec 6, 2019 • edited

lemaitre commented Dec 6, 2019

lemaitre commented Dec 6, 2019

zeux commented Dec 6, 2019 • edited

dsmilkov commented Dec 6, 2019

nfrechette commented Dec 7, 2019 via email

zeux commented Dec 7, 2019

munrocket commented Dec 10, 2019

nsthorat commented Dec 10, 2019

munrocket commented Dec 12, 2019

munrocket commented Dec 14, 2019

dtig commented Mar 17, 2020

Maratyszcza commented May 29, 2019 •

edited

Maratyszcza commented May 30, 2019 •

edited

dtig commented Jul 18, 2019 •

edited

Maratyszcza commented Oct 4, 2019 •

edited

zeux commented Dec 6, 2019 •

edited

zeux commented Dec 6, 2019 •

edited

zeux commented Dec 6, 2019 •

edited