Folding integer additions with operands of mixed bit widths #228

bjacob · 2020-05-12T18:09:10Z

ARM NEON has pairwise-folding addition instructions where pairs of narrow (e.g. 8-bit) input lanes are added together and accumulated into wider (e.g. 16-bit) integer lanes. For example SADALP, SADDLP.

This is in addition to plain pairwise-folding additions with all operands of the same bit width, like SADDP.

An extreme case of such folding is the dot-product instructions (SDOT, See PR #127) where the folding addition is performed 4-fold. When one of the source operands has all lanes set to 1's, this acts as a 4-fold addition of 8bit values into 32bit accumulators.

This combination of folding behavior and mixing different bit widths allows to maximize the number of scalar operations done per instruction.

This is very widely used in any integer arithmetic application. For example in matrix multiplication kernels using plain NEON without SDOT, based on the idea of multiplying 8bit input values into 16bit local products (see Issue #226), then pairwise-folding those 16bit products into 32bit accumulators:
https://github.com/google/ruy/blob/808ff748e0c7dc746a413fe45fa022d63e6253e8/ruy/kernel_arm64.cc#L1233

Maratyszcza · 2021-01-14T21:17:06Z

This is particularly covered by Extended Pairwise Addition instructions (#380)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Folding integer additions with operands of mixed bit widths #228

Folding integer additions with operands of mixed bit widths #228

bjacob commented May 12, 2020 •

edited

Maratyszcza commented Jan 14, 2021

Folding integer additions with operands of mixed bit widths #228

Folding integer additions with operands of mixed bit widths #228

Comments

bjacob commented May 12, 2020 • edited

Maratyszcza commented Jan 14, 2021

bjacob commented May 12, 2020 •

edited