Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Add Quasi-Fused Multiply-Add/Subtract instructions #79

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented May 29, 2019

Introduction

Most modern processors support Fused Multiply-Add (FMA) family instructions. These instructions compute multiplication and addition of floating-point numbers without intermediate rounding. Elimination of intermediate rounding improves accuracy in most numerical algorithms, and moreover FMA instructions provide performance improvements on all processors which support these instructions:

  • Intel Haswell processors achieve 32 FLOPs/cycle with FMA, but only 16 FLOPs/cycle with separate multiply+add instructions.[1]
  • AMD Piledriver and Zen processors deliver 16 FLOPs/cycle with FMA, but only 8 FLOPs/cycle with separate multiply+add instructions.[1]
  • ARM Cortex-A55 and Cortex-A75 cores archieve 8 FLOPs/cycle with FMA, but only 4 FLOPs/cycle with separate multiply+add instructions.[2,3]

However, on processors which do not support FMA instructions, it is expensive to emulate these operations exactly, without intermediate rounding.

This PR introduce Quasi-Fused Multiply-Add instruction which provides the performance and accuracy benefits of FMA, where supported, while preserving compatibility with older processors. Quasi-Fused Multiply-Add (QFMA) instruction represents a + b * c with optional rounding after multiplication. WebAssembly implementations are required to be consistent, and either always generate FMA instruction, or always generate multiplication+addition pair for a QFMA instruction within a module. QFMA instruction is augmented by Quasi-Fused Multiply-Subtract (QFMA) instruction which represents a - b * c with similar optional rounding after multiplication.

Performance Impact

Fused Multiply-Add instructions improve performance in two ways:

  • They double floating-point throughput in most processors by occupying floating-point SIMD units for only one cycle, compared to two cycles for the separate multiplication + addition instructions.
  • The reduce register pressure when QFMA(a, b, c) result overwrites operand a. This situation is very typical in dense linear algebra and neural network computations, and without FMA the implementation would have to allocate a temporary register t for the result of multiplication (t <- b * c ; a <- a + t).

Evaluation on native ARM64 code

To estimate the speedup from FMA in practice, I replaced NEON FMA intrinsics (vfmaq_f32 and vfmaq_lane_f32) in ARM64 implementation of neural network inference with non-fused NEON intrinsics (vmlaq_f32 and vmlaq_lane_f32). Both versions were compiled with Clang to native (i.e. not WebAssembly) binaries for ARM64 architecture, and evaluated in single-threaded mode. Speedups of FMA-based version compared to version with separate multiplication + addition are presented in the table below:

Mobile phone Xiaomi Mi A2 Lite Pixel 2 XL LG G8 ThinQ Galaxy S8 (Exynos)
Processor core Cortex-A53 Cortex-A73 Cortex-A76 Exynos-M2
MobileNet v2 [5] 2.05 2.11 2.14 1.79
Face Mesh [7] 1.83 1.67 1.75 1.63
Segmentation [8] 1.54 1.56 1.93 1.43

Across 3 neural network architectures and 4 mobile devices, the minimum speedup is 1.4X, and speedup on the most compute-intensive neural network (MobileNet v2) exceeds 2X. I suggest that an improvement this big justifies extending WebAssembly SIMD spec with 4 new instructions.

[October 3 update] Evaluation of QFMA prototype in V8

@tlively implemented experimental support of QFMA instruction in LLVM & Clang (commit) and Binaryen (PR), and @ngzhian implemented QFMA lowering in V8 for x86-64 (commit) and ARM64 (commit) architectures. Due to experimental nature of the prototype toolchain, it is conservative in leveraging QFMA instructions, and generates them only through an explicit intrinsic.

I ported the most critical micro-kernels in XNNPACK neural network operator library to use QFMA, and evaluated its performance on 9 neural network models. The table below presents the results:

Host Xeon W-2135 Xiaomi Mi A2 Lite Pixel 2 XL LG G8 ThinQ Galaxy S8 Galaxy S10
Processor core Sky Lake Cortex-A53 Cortex-A73 Cortex-A76 Exynos-M2 Exynos-M4
MobileNet v1 [4] 43% 17% 36% 48% 27% 44%
MobileNet v2 [5] 38% 14% 27% 36% 21% 35%
SSDLite [5] 36% 13% 28% 36% 21% 45%
SqueezeNet [6] 50% 28% 40% 41% 29% 50%
Face Mesh [7] 26% 15% 21% 29% 18% 24%
Segmentation [8] 29% 16% 21% 34% 18% 25%
BlazeFace [9] 27% 19% 19% 18% 9% 15%
Hand Mesh [10] 30% 9% 25% 38% 19% 34%
Hand Detector [10] 26% 10% 29% 36% 16% 32%
Geomean 33% 15% 26% 34% 19% 32%

While the speedup from QFMA in the prototype WebAssembly implementation is smaller than in native code, QFMA improved performance on all 6 evaluated devices, and on modern CPU microarchitectures (Intel Sky Lake, ARM Cortex-A76, Samsung Exynos-M4) QFMA improves performance on average by one third.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with FMA3 (but no FMA4) instruction set

These processors include Intel Haswell (and later) and AMD Zen (and later).

  • f32x4.qfma
    • a = f32x4.qfma(a, b, c) is lowered to one of two options:
      1. VFMADD231PS xmm_a, xmm_b, xmm_c (c can be in-memory)
      2. VFMADD231PS xmm_a, xmm_c, xmm_b (b can be in-memory)
    • b = f32x4.qfma(a, b, c) is lowered to one of two options:
      1. VFMADD132PS xmm_b, xmm_a, xmm_c (c can be in-memory)
      2. VFMADD213PS xmm_b, xmm_c, xmm_a (a can be in-memory)
    • c = f32x4.qfma(a, b, c) is lowered to one of two options:
      1. VFMADD132PS xmm_c, xmm_a, xmm_b (b can be in-memory)
      2. VFMADD213PS xmm_c, xmm_b, xmm_a (a can be in-memory)
    • d = f32x4.qfma(a, b, c) is lowered to one of six options:
      1. VMOVUPS xmm_d, xmm_a + VFMADD231PS xmm_d, xmm_b, xmm_c (a and c can be in-memory)
      2. VMOVUPS xmm_d, xmm_a + VFMADD231PS xmm_d, xmm_c, xmm_b (a and b can be in-memory)
      3. VMOVUPS xmm_d, xmm_b + VFMADD132PS xmm_d, xmm_a, xmm_c (b and c can be in-memory)
      4. VMOVUPS xmm_d, xmm_b + VFMADD213PS xmm_d, xmm_c, xmm_a (b and a can be in-memory)
      5. VMOVUPS xmm_d, xmm_c + VFMADD132PS xmm_d, xmm_a, xmm_b (c and b can be in-memory)
      6. VMOVUPS xmm_d, xmm_c + VFMADD213PS xmm_d, xmm_b, xmm_a (c and a can be in-memory)
  • f32x4.qfms
    • a = f32x4.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADD231PS xmm_a, xmm_b, xmm_c (c can be in-memory)
      2. VFNMADD231PS xmm_a, xmm_c, xmm_b (b can be in-memory)
    • b = f32x4.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADD132PS xmm_b, xmm_a, xmm_c (c can be in-memory)
      2. VFNMADD213PS xmm_b, xmm_c, xmm_a (a can be in-memory)
    • c = f32x4.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADD132PS xmm_c, xmm_a, xmm_b (b can be in-memory)
      2. VFNMADD213PS xmm_c, xmm_b, xmm_a (a can be in-memory)
    • d = f32x4.qfms(a, b, c) is lowered to one of six options:
      1. VMOVUPS xmm_d, xmm_a + VFNMADD231PS xmm_d, xmm_b, xmm_c (a and c can be in-memory)
      2. VMOVUPS xmm_d, xmm_a + VFNMADD231PS xmm_d, xmm_c, xmm_b (a and b can be in-memory)
      3. VMOVUPS xmm_d, xmm_b + VFNMADD132PS xmm_d, xmm_a, xmm_c (b and c can be in-memory)
      4. VMOVUPS xmm_d, xmm_b + VFNMADD213PS xmm_d, xmm_c, xmm_a (b and a can be in-memory)
      5. VMOVUPS xmm_d, xmm_c + VFNMADD132PS xmm_d, xmm_a, xmm_b (c and b can be in-memory)
      6. VMOVUPS xmm_d, xmm_c + VFNMADD213PS xmm_d, xmm_b, xmm_a (c and a can be in-memory)
  • f64x2.qfma
    • a = f64x2.qfma(a, b, c) is lowered to one of two options:
      1. VFMADD231PD xmm_a, xmm_b, xmm_c (c can be in-memory)
      2. VFMADD231PD xmm_a, xmm_c, xmm_b (b can be in-memory)
    • b = f64x2.qfma(a, b, c) is lowered to one of two options:
      1. VFMADD132PD xmm_b, xmm_a, xmm_c (c can be in-memory)
      2. VFMADD213PD xmm_b, xmm_c, xmm_a (a can be in-memory)
    • c = f64x2.qfma(a, b, c) is lowered to one of two options:
      1. VFMADD132PD xmm_c, xmm_a, xmm_b (b can be in-memory)
      2. VFMADD213PD xmm_c, xmm_b, xmm_a (a can be in-memory)
    • d = f64x2.qfma(a, b, c) is lowered to one of six options:
      1. VMOVUPD xmm_d, xmm_a + VFMADD231PD xmm_d, xmm_b, xmm_c (a and c can be in-memory)
      2. VMOVUPD xmm_d, xmm_a + VFMADD231PD xmm_d, xmm_c, xmm_b (a and b can be in-memory)
      3. VMOVUPD xmm_d, xmm_b + VFMADD132PD xmm_d, xmm_a, xmm_c (b and c can be in-memory)
      4. VMOVUPD xmm_d, xmm_b + VFMADD213PD xmm_d, xmm_c, xmm_a (b and a can be in-memory)
      5. VMOVUPD xmm_d, xmm_c + VFMADD132PD xmm_d, xmm_a, xmm_b (c and b can be in-memory)
      6. VMOVUPD xmm_d, xmm_c + VFMADD213PD xmm_d, xmm_b, xmm_a (c and a can be in-memory)
  • f64x2.qfms
    • a = f64x2.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADD231PD xmm_a, xmm_b, xmm_c (c can be in-memory)
      2. VFNMADD231PD xmm_a, xmm_c, xmm_b (b can be in-memory)
    • b = f64x2.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADD132PD xmm_b, xmm_a, xmm_c (c can be in-memory)
      2. VFNMADD213PD xmm_b, xmm_c, xmm_a (a can be in-memory)
    • c = f64x2.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADD132PD xmm_c, xmm_a, xmm_b (b can be in-memory)
      2. VFNMADD213PD xmm_c, xmm_b, xmm_a (a can be in-memory)
    • d = f64x2.qfms(a, b, c) is lowered to one of six options:
      1. VMOVUPD xmm_d, xmm_a + VFNMADD231PD xmm_d, xmm_b, xmm_c (a and c can be in-memory)
      2. VMOVUPD xmm_d, xmm_a + VFNMADD231PD xmm_d, xmm_c, xmm_b (a and b can be in-memory)
      3. VMOVUPD xmm_d, xmm_b + VFNMADD132PD xmm_d, xmm_a, xmm_c (b and c can be in-memory)
      4. VMOVUPD xmm_d, xmm_b + VFNMADD213PD xmm_d, xmm_c, xmm_a (b and a can be in-memory)
      5. VMOVUPD xmm_d, xmm_c + VFNMADD132PD xmm_d, xmm_a, xmm_b (c and b can be in-memory)
      6. VMOVUPD xmm_d, xmm_c + VFNMADD213PD xmm_d, xmm_b, xmm_a (c and a can be in-memory)

x86/x86-64 processors with FMA3 and FMA4 instruction sets

These processors include AMD Piledriver, AMD Steamroller, AMD Excavator, but not AMD Zen.

  • f32x4.qfma
    • a = f32x4.qfma(a, b, c) is lowered to one of two options:
      1. VFMADD231PS xmm_a, xmm_b, xmm_c (c can be in-memory)
      2. VFMADD231PS xmm_a, xmm_c, xmm_b (b can be in-memory)
    • b = f32x4.qfma(a, b, c) is lowered to one of two options:
      1. VFMADD132PS xmm_b, xmm_a, xmm_c (c can be in-memory)
      2. VFMADD213PS xmm_b, xmm_c, xmm_a (a can be in-memory)
    • c = f32x4.qfma(a, b, c) is lowered to one of two options:
      1. VFMADD132PS xmm_c, xmm_a, xmm_b (b can be in-memory)
      2. VFMADD213PS xmm_c, xmm_b, xmm_a (a can be in-memory)
    • d = f32x4.qfma(a, b, c) is lowered to one of two options:
      1. VFMADDPS xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
      2. VFMADDPS xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)
        be in-memory)
  • f32x4.qfms
    • a = f32x4.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADD231PS xmm_a, xmm_b, xmm_c (c can be in-memory)
      2. VFNMADD231PS xmm_a, xmm_c, xmm_b (b can be in-memory)
    • b = f32x4.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADD132PS xmm_b, xmm_a, xmm_c (c can be in-memory)
      2. VFNMADD213PS xmm_b, xmm_c, xmm_a (a can be in-memory)
    • c = f32x4.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADD132PS xmm_c, xmm_a, xmm_b (b can be in-memory)
      2. VFNMADD213PS xmm_c, xmm_b, xmm_a (a can be in-memory)
    • d = f32x4.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADDPS xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
      2. VFNMADDPS xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)
        be in-memory)
  • f64x2.qfma
    • a = f64x2.qfma(a, b, c) is lowered to one of two options:
      1. VFMADD231PD xmm_a, xmm_b, xmm_c (c can be in-memory)
      2. VFMADD231PD xmm_a, xmm_c, xmm_b (b can be in-memory)
    • b = f64x2.qfma(a, b, c) is lowered to one of two options:
      1. VFMADD132PD xmm_b, xmm_a, xmm_c (c can be in-memory)
      2. VFMADD213PD xmm_b, xmm_c, xmm_a (a can be in-memory)
    • c = f64x2.qfma(a, b, c) is lowered to one of two options:
      1. VFMADD132PD xmm_c, xmm_a, xmm_b (b can be in-memory)
      2. VFMADD213PD xmm_c, xmm_b, xmm_a (a can be in-memory)
    • d = f64x2.qfma(a, b, c) is lowered to one of two options:
      1. VFMADDPD xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
      2. VFMADDPD xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)
        be in-memory)
  • f64x2.qfms
    • a = f64x2.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADD231PD xmm_a, xmm_b, xmm_c (c can be in-memory)
      2. VFNMADD231PD xmm_a, xmm_c, xmm_b (b can be in-memory)
    • b = f64x2.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADD132PD xmm_b, xmm_a, xmm_c (c can be in-memory)
      2. VFNMADD213PD xmm_b, xmm_c, xmm_a (a can be in-memory)
    • c = f64x2.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADD132PD xmm_c, xmm_a, xmm_b (b can be in-memory)
      2. VFNMADD213PD xmm_c, xmm_b, xmm_a (a can be in-memory)
    • d = f64x2.qfms(a, b, c) is lowered to one of two options:
      1. VFNMADDPD xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
      2. VFNMADDPD xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)

x86/x86-64 processors with FMA4 (and no FMA3) instruction sets

AMD Bulldozer is the only family of such processors.

  • d = f32x4.qfma(a, b, c) is lowered to one of two options:
    1. VFMADDPS xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
    2. VFMADDPS xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)
  • d = f32x4.qfms(a, b, c) is lowered to one of two options:
    1. VFNMADDPS xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
    2. VFNMADDPS xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)
  • d = f64x2.qfma(a, b, c) is lowered to one of two options:
    1. VFMADDPD xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
    2. VFMADDPD xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)
  • d = f64x2.qfms(a, b, c) is lowered to one of two options:
    1. VFNMADDPD xmm_d, xmm_b, xmm_c, xmm_a (a or c can be in-memory)
    2. VFNMADDPD xmm_d, xmm_c, xmm_b, xmm_a (a or b can be in-memory)

ARM64 processors

All ARM64 application processors support SIMD with FMA

  • f32x4.qfma
    • a = f32x4.qfma(a, b, c) is lowered to FMLA Va.4S, Vb.4S, Vc.4S
    • d = f32x4.qfma(a, b, c) is lowered to MOV Vd.16B, Va.16B + FMLA Va.4S, Vb.4S, Vc.4S
  • f32x4.qfms
    • a = f32x4.qfms(a, b, c) is lowered to FMLS Va.4S, Vb.4S, Vc.4S
    • d = f32x4.qfms(a, b, c) is lowered to MOV Vd.16B, Va.16B + FMLS Va.4S, Vb.4S, Vc.4S
  • f64x2.qfma
    • a = f64x2.qfma(a, b, c) is lowered to FMLA Va.2D, Vb.2D, Vc.2D
    • d = f64x2.qfma(a, b, c) is lowered to MOV Vd.16B, Va.16B + FMLA Va.2D, Vb.2D, Vc.2D
  • f64x2.qfms
    • a = f64x2.qfms(a, b, c) is lowered to FMLS Va.2D, Vb.2D, Vc.2D
    • d = f64x2.qfms(a, b, c) is lowered to MOV Vd.16B, Va.16B + FMLS Va.2D, Vb.2D, Vc.2D

ARMv7 processors with NEONv2 (NEON-FMA) instruction set

Most 32-bit ARM application processors support SIMD (NEON) with FMA instructions.

  • f32x4.qfma
    • a = f32x4.qfma(a, b, c) is lowered to VFMA.F32 q_a, q_b, q_c
    • d = f32x4.qfma(a, b, c) is lowered to VMOV q_d, q_a + VFMA.F32 q_d, q_b, q_c
  • f32x4.qfms
    • a = f32x4.qfms(a, b, c) is lowered to VFMS.F32 q_a, q_b, q_c
    • d = f32x4.qfms(a, b, c) is lowered to VMOV q_d, q_a + VFMS.F32 q_d, q_b, q_c
  • f64x2.qfma
    • a = f64x2.qfma(a, b, c) is lowered to VFMA.F64 d_a0, d_b0, d_c0 + VFMA.F64 d_a1, d_b1, d_c1
    • d = f64x2.qfma(a, b, c) is lowered to VMOV q_d, q_a + VFMA.F64 q_d0, q_b0, q_c0 + VFMA.F64 q_d1, q_b1, q_c1
  • f64x2.qfms
    • a = f64x2.qfms(a, b, c) is lowered to VFMS.F64 d_a0, d_b0, d_c0 + VFMS.F64 d_a1, d_b1, d_c1
    • d = f64x2.qfms(a, b, c) is lowered to VMOV q_d, q_a + VFMA.F64 q_d0, q_b0, q_c0 + VFMS.F64 q_d1, q_b1, q_c1

ARMv7 processors with NEON (but without FMA) instruction set

ARM Cortex-A8, ARM Cortex-A9, and Qualcomm Scorpion are the only significant cores which support SIMD (NEON), but not the FMA extension

  • f32x4.qfma
    • a = f32x4.qfma(a, b, c) is lowered to VMLA.F32 q_a, q_b, q_c (note: multiply-add with intermediate rounding)
    • d = f32x4.qfma(a, b, c) is lowered to VMUL.F32 q_d, q_b, q_c + VADD.F32 q_d, q_a, q_d
  • f32x4.qfms
    • a = f32x4.qfms(a, b, c) is lowered to VMLS.F32 q_a, q_b, q_c (note: multiply-subtract with intermediate rounding)
    • d = f32x4.qfms(a, b, c) is lowered to VMUL.F32 q_d, q_b, q_c + VSUB.F32 q_d, q_a, q_d
  • f64x2.qfma
    • a = f64x2.qfma(a, b, c) is lowered to VMLA.F64 d_a0, d_b0, d_c0 + VMLA.F64 d_a1, d_b1, d_c1 (note: multiply-add with intermediate rounding)
    • d = f64x2.qfma(a, b, c) is lowered to VMUL.F64 d_d0, d_b0, d_c0 + VMUL.F64 d_d1, d_b1, d_c1 + VADD.F64 d_d0, d_a0, d_d0 + VADD.F64 d_d1, d_a1, d_d1
  • f64x2.qfms
    • a = f64x2.qfms(a, b, c) is lowered to VMLS.F64 d_a0, d_b0, d_c0 + VMLS.F64 d_a1, d_b1, d_c1 (note: multiply-subtract with intermediate rounding)
    • d = f64x2.qfms(a, b, c) is lowered to VMUL.F64 d_d0, d_b0, d_c0 + VMUL.F64 d_d1, d_b1, d_c1 + VSUB.F64 d_d0, d_a0, d_d0 + VSUB.F64 d_d1, d_a1, d_d1

POWER processors with VSX instruction set

IBM POWER processors starting with POWER7

  • f32x4.qfma
    • a = f32x4.qfma(a, b, c) is lowered to XVMADDASP x_a, x_b, x_c
    • b = f32x4.qfma(a, b, c) is lowered to XVMADDMSP x_b, x_c, x_a
    • c = f32x4.qfma(a, b, c) is lowered to XVMADDMSP x_c, x_b, x_a
    • d = f32x4.qfma(a, b, c) is lowered to VMR x_d, x_a + XVMADDASP x_d, x_b, x_c
  • f32x4.qfms
    • a = f32x4.qfms(a, b, c) is lowered to XVNMSUBASP x_a, x_b, x_c
    • b = f32x4.qfms(a, b, c) is lowered to XVNMSUBMSP x_b, x_c, x_a
    • c = f32x4.qfms(a, b, c) is lowered to XVNMSUBMSP x_c, x_b, x_a
    • d = f32x4.qfms(a, b, c) is lowered to VMR x_d, x_a + XVNMSUBASP x_d, x_b, x_c
  • f64x2.qfma
    • a = f64x2.qfma(a, b, c) is lowered to XVMADDADP x_a, x_b, x_c
    • b = f64x2.qfma(a, b, c) is lowered to XVMADDMDP x_b, x_c, x_a
    • c = f64x2.qfma(a, b, c) is lowered to XVMADDMDP x_c, x_b, x_a
    • d = f64x2.qfma(a, b, c) is lowered to VMR x_d, x_a + XVMADDADP x_d, x_b, x_c
  • f64x2.qfms
    • a = f64x2.qfms(a, b, c) is lowered to XVNMSUBADP x_a, x_b, x_c
    • b = f64x2.qfms(a, b, c) is lowered to XVNMSUBMDP x_b, x_c, x_a
    • c = f64x2.qfms(a, b, c) is lowered to XVNMSUBMDP x_c, x_b, x_a
    • d = f64x2.qfms(a, b, c) is lowered to VMR x_d, x_a + XVNMSUBADP x_d, x_b, x_c

Other processors and instruction sets

  • f32x4.qfma
    • d = f32x4.qfma(a, b, c) is lowered like d = f32x4.add(a, f32x4.mul(b, c))
  • f32x4.qfms
    • d = f32x4.qfms(a, b, c) is lowered like d = f32x4.sub(a, f32x4.mul(b, c))
  • f64x2.qfma
    • d = f64x2.qfma(a, b, c) is lowered like d = f64x2.add(a, f64x2.mul(b, c))
  • f64x2.qfms
    • d = f64x2.qfms(a, b, c) is lowered like d = f64x2.sub(a, f64x2.mul(b, c))

References

[1] Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD
and VIA CPUs
by Agner Fog
[2] ARM Cortex-A55 Software Optimization Guide
[3] ARM Cortex-A75 Software Optimization Guide
[4] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
[5] MobileNetV2: Inverted Residuals and Linear Bottlenecks
[6] SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
[7] Real-Time AR Self-Expression with Machine Learning
[8] Mobile Real-time Video Segmentation
[9] BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs
[10] On-Device, Real-Time Hand Tracking with MediaPipe

@gnzlbg
Copy link
Contributor

gnzlbg commented May 29, 2019

Is this equivalent to LLVM's fmul_add ? If not, could you expand on how they differ ?

@sunfishcode
Copy link
Member

Thanks for the thorough analysis! I agree that FMA is desirable, and it's always always been about "how" and "when" rather than "if".

In light of the decision to remove the subnormal nondeterminism, effectively declaring 32-bit ARM NEON implementations non-conforming, another option here would be to just add a regular FMA instruction, without the nondeterminism. This has portability downsides, but on the other hand it would address some use cases not covered by qfma, because some algorithms depend on the fused behavior of FMA. And in the long term, if we believe new hardware platforms will generally support FMA, the platform would benefit from greater simplicity and determinism.

A variant would be to standardize simd128 without FMA, and then add FMA as a separate feature, which implementations could decide to implement, and toolchains could decide to depend on, separately from the base simd128. In the long term, perhaps all implementations would support FMA, while in the short term this might allow for a transition period.

Is this equivalent to LLVM's fmul_add ? If not, could you expand on how they differ ?

LLVM's fmul_add leaves it entirely nondeterministic whether the operation is fused or not. The description of qfma above says "WebAssembly implementations are required to be consistent, and either always generate FMA instruction, or always generate multiplication+addition pair for a QFMA instruction within a module." This addresses an important application constraint, but also raises some interesting questions.

  • It would no longer be semantics-preserving for a tool to naively split parts of a module into separate modules, because a module may be depending on the guarantee that FMA nondeterminism is consistent within a module.
  • Toolchains which support inlining across module boundaries might still hit discontinuity problems if a user function is inlined into multiple modules, and then at runtime there are differences in rounding between modules.

@gnzlbg
Copy link
Contributor

gnzlbg commented May 29, 2019

LLVM's fmul_add leaves it entirely nondeterministic whether the operation is fused or not. The description of qfma above says "WebAssembly implementations are required to be consistent, and either always generate FMA instruction, or always generate multiplication+addition pair for a QFMA instruction within a module."

I thought that LLVM fmul_add generates a fusedMultiplyAdd if the target has an instruction for this and if this instruction is faster than a multiply followed by an add. That is, that all invocations of fmul_add for a particular target do the exact same thing, as opposed to, deciding what is more efficient on each invocation. I might have misunderstood this.

A variant would be to standardize simd128 without FMA, and then add FMA as a separate feature, which implementations could decide to implement, and toolchains could decide to depend on, separately from the base simd128. In the long term, perhaps all implementations would support FMA, while in the short term this might allow for a transition period.

Note that under this proposal a module that requires FMA would fail to validate on an implementation that does not support FMA, so such a transition period might result in users having to ship multiple modules (w/o FMA). Something like #80 would help here.

@penzn penzn mentioned this pull request May 29, 2019
@Maratyszcza
Copy link
Contributor Author

@gnzlbg qfma is similar to fmul_add. But IIUC, for fmul_add it would be permissible to lower to FMA in one situation (e.g. when result overwrites accumulator), but to multiplication+addition in another (e.g. when result doesn't overwrite any inputs). For qfma both cases would have to lower to similar instructions (either both to FMA, or both to multiply + add).

@Maratyszcza
Copy link
Contributor Author

another option here would be to just add a regular FMA instruction, without the nondeterminism.

There are plenty of new x86 CPUs which don't support FMA operation. First, low-power series from Intel (Goldmont/Goldmont Plus microarchitectures) and AMD (Jaguar microarchitecture) lack FMA instructions. Secondly, Intel has a long history of stripping any post-SSE4 ISA extension from low-cost versions (i.e. Celeron & Pentium) of its mainstream architectures. E.g. Pentium Gold G5600 is based on the latest Coffee Lake microarchitecture, but doesn't support AVX and FMA instruction sets.

LLVM's fmul_add leaves it entirely nondeterministic whether the operation is fused or not. The description of qfma above says "WebAssembly implementations are required to be consistent, and either always generate FMA instruction, or always generate multiplication+addition pair for a QFMA instruction within a module."

I expect that decision whether to use FMA would be based only on hardware characteristics, i.e. whether processor support FMA and whether it is fast on this processor. Unlike LLVM's fmul_add, WAsm code generator can't choose FMA/mul+add form depending on e.g. which input operand gets overwritten by qfma result. Then it is valid both to split WAsm module and to do inlining across modules.

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented May 30, 2019

It seems that I misuse the concept of "WebAssembly module". By "WebAssembly module" I mean everything in a ".wasm" or ".wast" file, but "WebAssembly module" seems to have a different meaning in the WAsm spec. What would be the right wording here?

@dtig
Copy link
Member

dtig commented Jul 18, 2019

These are sometimes used interchangeable, but a WebAssembly module is the result of compiling wasm bytecode. Depending on the context you may be looking for different words - for everything in a .wasm file, it could just be the Wasm binary, but in the cases where there are references to failing to validate/compile - module would still be the right wording IMO.

Thanks for the thorough analysis! Having thought about this a little more, I have a few concerns about the inclusion of this in the SIMD MVP. Firstly the performance inconsistencies between the FMA/non-FMA enabled hardware for this set of operations would be hard to miss. We have usually tried to stick to the instructions that are available across the board, and setting SSE4.1, and Neon as thresholds to guarantee performance across a large cross section of devices. Introduction of FMA makes this predictable performance across devices somewhat nebulous.

Secondly, there is a compromise to be made here with the inclusion of QFMA, that it's not strictly FMA, and including just FMA without non-determinism violates portability. It is true that this is already the case due to removal of subnormal determinism, but I'm more comfortable with that because it is consistent with the existing Wasm spec, the number of devices that that would affect is not large, and shrinking, vs. the FMA non-determinsim is a larger surface area.

I'm leaning towards having FMA without the non-determinism, but punting this to Post-MVP.

That said, we have an open V8 bug to prototype/experiment with variants of this and their Polyfills on non-FMA enabled hardware. I'm labeling this with pending prototype data to follow up with after we have a prototype.

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Oct 4, 2019

Thanks to great work of @tlively who added QFMA in LLVM, Clang, and Binaryen, and @ngzhian who implemented QFMA in V8, it is now possible to try this instruction on real workloads. I evaluated a prototype QFMA-enabled implementation of neural network inference, and updated the top-level post with the new results.

@zeux
Copy link
Contributor

zeux commented Dec 6, 2019

FWIW from my point of view, this is a good addition.

In native code [I am used to], the non-determinism between different compilers / architectures is a given - different optimizations / lowering / precision of different operations just is a fact of life. Within these constraints, automatic fma substitution (-ffast-math -mfma or equivalent) usually gives appreciable performance gains in the 10-20% range on real floating-point-heavy code. In native world, it's not always easy to exercise this because you don't know if FMA is supported on the target platform, and compiling multiple versions usually doesn't work out of the box. But in WebAssembly due to JIT compilation something like qfma can work well.

So in practice:

  • Without -ffast-math, I wouldn't expect LLVM to use qfma outside of explicit intrinsic calls because it doesn't have a consistently defined behavior. So we retain de-facto determinism.
  • With -ffast-math, there are going to be noticeable performance gains on a lot of real-world floating-point-heavy benchmarks. With -ffast-math the author of the code is explicitly saying "I am willing to tolerate non-determinism"
  • Even without -ffast-math, this is a new, separate, intrinsic call - in areas where performance is vital, an application could use this consciously.

(I understand that this runs contrary to the desire to get identical results between different platforms... but the reality for SIMD is that for floating-point math, often performance trumps determinism in my experience. Of course in theory we could include instructions like qfma and estimated rsqrt in a separate proposal post-MVP, but this gets hard to deal with in practice when the actual support for different instruction sets between implementations varies)

@nfrechette
Copy link

My experience runs contrary to yours, @zeux . In my tests, FMA instructions are always slower under x64. I did notice fast-math automatically generating them and it is one of the many reasons why I hate fast-math. FMA instructions are designed for throughput, not latency. With modern processors easily being able to dual issue mul/add pairs, FMA offers little benefit aside from reducing the code size and potentially using fewer registers (although in practice I haven't observed this either). Here are some benchmarks I ran with it in my own math lib here. I had high hopes for it and judging from the replies I got on twitter, FMA is underwhelming. It seems to me that it would perform unusually well in a setup like neural network evaluation because it is most likely throughput bound.

Fast-math also doesn't give a gain anywhere near what you claim in the workloads I have seen, not for a long time. This might have been true at some point in the past but not with VS2017 and later generations. I disabled fast-math in my animation compression code because determinism is more important than whatever gain might arise. I saw no impact on the performance after turning it off. 3dsmax 2019 also disabled fast-math when they switched to VS2017. While some workloads saw a minor impact, overall we only saw benefits from turning it off.

Determinism matters more than performance unless you really know what you are doing and I would argue that if you do, then fast-math offers little benefit because you can do what the compiler does by hand. Fast-math means that from compiler version to version, behavior can change, sometimes drastically as we've seen with VS2017. It introduced sin/cos pair calculation through SSE float32 arithmetic while prior compilers calculated both separately with float64 arithmetic (the VS stdlib does this for transcendental functions). This leads to a dramatic reduction in accuracy and can lead to visual artifacts and other undesired behavior. Visual studio defaults to precise math and that is a sane default IMO. I hope V8 & cie use a similar default.

@zeux
Copy link
Contributor

zeux commented Dec 6, 2019

In my tests, FMA instructions are always slower under x64.

This obviously depends on the workload. In (some) floating point heavy code that I see, FMA results in performance gains.

FMA instructions are designed for throughput, not latency.

On modern Intel processors the latency is the same as of multiplication, no? So you're making the latency strictly shorter by (potentially) reducing the critical path. Or reducing the port pressure, allowing for more coissue opportunities. On Skylake, both multiplication and FMA have 0.5 cycle rec. throughput and 4 cycles latency. So on computations that have a lot of opportunity for fusing, you're strictly winning - you're not going to lose. Of course this can vary with architecture.

Fast-math also doesn't give a gain anywhere near what you claim in the workloads I have seen, not for a long time.

I'm specifically referring to floating-point-heavy matrix-like code that's dominated by multiplies and adds and clang. Please refer to numbers posted by @Maratyszcza for even larger gains, gains I'm used to are more moderate. I'm not sure to what extent it's valid to compare this on Visual Studio.

Visual studio defaults to precise math and that is a sane default IMO. I hope V8 & cie use a similar default.

The default is always precise math, this seems orthogonal? My point is precisely that if we do have qfma support, the developer of the code is in control - they can enable the use of fused instruction for extra performance gains, or keep it off. If we don't have qfma support, we don't have this optimization opportunity,

@zeux
Copy link
Contributor

zeux commented Dec 6, 2019

Here's a motivating example from my experiments (caveats apply, ymmv, etc etc.):

This is an implementation of an slerp optimization from https://zeux.io/2015/07/23/approximating-slerp/ + https://zeux.io/2016/05/05/optimizing-slerp/ for fitted nlerp (the middle ground between nlerp and much more precise version, onlerp):

Baseline: https://gcc.godbolt.org/z/w-j8QW, 64.97 cycles per loop iteration
fast-math: https://gcc.godbolt.org/z/hVsk2B, 62.96 cycles per loop iteration
fast-math + avx2: https://gcc.godbolt.org/z/7mvWPe, 55.82 cycles per loop iteration
fast-math + fma: https://gcc.godbolt.org/z/XRKt4C, 40.79 cycles per loop iteration

I would expect that the results produced with llvm-mca closely match the actual timing results on stream-transform using this function, e.g. "given two streams of quaternions, interpolate between them".

@lemaitre
Copy link

lemaitre commented Dec 6, 2019

Baseline: https://gcc.godbolt.org/z/w-j8QW, 64.97 cycles per loop iteration
fast-math: https://gcc.godbolt.org/z/hVsk2B, 62.96 cycles per loop iteration
fast-math + fma: https://gcc.godbolt.org/z/XRKt4C, 40.79 cycles per loop iteration

Beware that your baseline is in SSE while fast-math+fma is in AVX2.
Even if you write sse intrinsics, the compiler will generate 128-bit AVX2 instructions if you specify that the target supports AVX2.
If you add -mavx2 to your baseline, you will see that no fma will be used, but all instructions use VEX encoding.
The effect here is to drop the time from 64.97 to 57.82 cycles per loop iteration (supposedly because of the 3 operand instructions and the broadcast instruction).

This does not change the overall conclusion, though.

As a side note, since version 5, GCC automatically fuses MULs and ADDs into FMAs by default at -O2 without -ffast-math and even for intrinsics code.

@lemaitre
Copy link

lemaitre commented Dec 6, 2019

@nfrechette Using FMAs cannot be slower on recent hardware because the FMA instruction has the exact same latency and throughput as the multiplication (latency 4c, throughput 2/c on skylake).
Basically, the addition is for free.

-ffast-math can sometimes be slower than without, but this is always a compiler bug (it should not do the transformation if it is actually slower).
But FMA is only a tiny part of -ffast-math.
I am pretty sure your slow down was never about FMAs, but some other "optimizations" from fast-math.

Also, you mention MSVC, but beware that MSVC lags behind all its competitor when it comes to speed.
If I take the simple code from @zeux and try to compile with both clang and msvc (https://gcc.godbolt.org/z/KarV3c), you will see that MSVC does not use any FMA at all (in this example) and is not able to keep all variables in register (spill code at the beginning).

@zeux
Copy link
Contributor

zeux commented Dec 6, 2019

Beware that your baseline is in SSE while fast-math+fma is in AVX2.

Thanks! Good catch, I forgot about this. I've updated the post to include fast-math avx2 (55.82 cycles). As you say it doesn't change the overall conclusion much, FMA vs AVX2 here is 55.82 / 40.79 = 1.36x speedup.

@dsmilkov
Copy link

dsmilkov commented Dec 6, 2019

Thank you for this proposal! We just released an alpha version of the TensorFlow.js WASM backend, which was one of our most requested features. Benchmarks show 30-50% speedup with QFMA on various ML models, on top of existing SIMD. Adding these instructions would greatly benefit machine learning and numerical computation libraries in general.

@nfrechette
Copy link

nfrechette commented Dec 7, 2019 via email

@zeux
Copy link
Contributor

zeux commented Dec 7, 2019

With fmadd having the latency of 5 on older CPUs, you still save latency because the mul+add pair has a combined latency of 6 cycles due to the dependency. You don't save as much as you do on Skylake of course.

If I take the benchmark I posted earlier and ask llvm-mca to generate timings for Broadwell, I get 48.51 cycles for AVX2 fast-math version and 42.29 cycles for FMA fast-math version, which is still 15% faster.

The beauty of this proposal is that even if somehow there are CPUs that execute fma slower than mul+add pairs (although I still don't see how / when this would be possible?), the code generation engine can disable it. Clearly there are demonstrable noticeable wins on multiple workloads on modern Intel chips, it seems like we would not have much to discuss if the faster execution didn't come at a price of having different results on different CPUs.

FWIW curiously gcc fuses mul+add into fmadd by default at -O2 when targeting ISAs with FMA at least for x64, without requiring -ffast-math at all.

@munrocket
Copy link

Am I right that QFMA care only about performance and become not correctly rounded according to true FMA operation in IEEE 757-2008? There are another way to polyfill FMA with respect to correctness. It is much harder than simply RN(a + RN(b * c)) but it useful for applications with arbitrary precision arithmetic.

@nsthorat
Copy link

I also want to pile onto this discussion and add support for this proposal. TensorFlow.js released a new WASM backend which will greatly lifts the floor for CPU accelerated machine learning. This QFMA proposal will give us even larger wins on top of SIMD which starts reducing the divide between CPU and WebGL accelerators (WebGL has a lot of driver issues w.r.t precision / correctness). This proposal would really be great for machine learning on the web!

@munrocket
Copy link

With this Quasi FMA proposal you are going to bake new specification that not grantee the same results on different machines and will fail CI tests not only in spec tests but also in all another software. Probably you will get about the same results in ML model, because you use matrix multiplication, but this error will increase after each operation and another algorithms may be not robust at all, warn everybody about this in manuals.

Why you don't want to allow feature detect and inlining functions? Here basically the same but give to all a raw predictable API

QFMA(a, b, c) = isFmaAvailable() ? fma(a, b, c) : a * b + c

Another software want correctness and robustness in execution in general. You solve precision problems on GPU in machine learing but make a huge problem to another software, you literally return FP arithmetic in 70s before IEEE, where no one could garantee anything. Also in your proposal qfms(a, b, c) === qfma(a, -b, c) because FP operations not associative in general but support negation propagation.

Also check how it is implemented in Java and Julia, they have FMAC (FMA Accurate) but sometime it very slow. By the way it not so slow as a BigNumber implementation, because it is based on Sylvie Boldo theorems, at least in Julia.

FMAC(a, b, c) = isFmaAvailable() ? fma(a, b, c) : sylvie_boldo_polyfill(a, b, c)

Also exist third variant, and it can be popular in Rust, AssemblyScript and Emscripten, because you don't have sin/pow/ln/tan operations in wasm and they want implement this aglorithms fast and correct.

wasm_pow(x) = ifFmaAvailable() ? algo_with_fma(x) : algo_witout_fma(x)

@munrocket
Copy link

Ok, even if we will have QFMA with feature detect #80, we still can create all of the functions. Because

FMAC(a, b, c) = isFmaAvailable() ? QFMA(a, b, c) : sylvie_boldo_polyfill(a, b, c)
wasm_pow(x, e) = isFmaAvailable() ? algo_with_fma(x, e, QFMA) : algo_witout_fma(x, e)

But only if you use real fma when it available.

@dtig
Copy link
Member

dtig commented Mar 17, 2020

The performance numbers from the QFMA operations have been quite compelling, and @Maratyszcza's data demonstrates 30-50% speedup over the current SIMD proposal. In the interest of standardizing the MVP of the SIMD proposal, one of the goals though is minimizing non-determinisim, and different results on different platforms unfortunately out of scope for the current proposal as it stands. That said, these will be available in a future SIMD proposal that is currently under discussion and biases towards performance over consistency. Till we have a repository for the proposal, please continue to use this issue for discussion. Marking this with a Post MVP label.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants