Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

v128.const and v128.shuffle blow up code size #255

Open
Maratyszcza opened this issue Jun 12, 2020 · 5 comments
Open

v128.const and v128.shuffle blow up code size #255

Maratyszcza opened this issue Jun 12, 2020 · 5 comments

Comments

@Maratyszcza
Copy link
Contributor

v128.const and v128.shuffle instructions are always 18 bytes long, which is excessive in the majority of cases. For comparison, native shuffle instructions on x86-64 are at most five bytes long, even though they are more restricted in shuffle patterns. For an example here's an implementation of 4x4 matrix transpose in WAsm SIMD and SSE intrinsics:

#include <wasm_simd128.h>

void transpose4x4(float matrix[16]) {
	const v128_t row0 = wasm_v128_load(matrix);
	const v128_t row1 = wasm_v128_load(matrix + 4);
	const v128_t row2 = wasm_v128_load(matrix + 8);
	const v128_t row3 = wasm_v128_load(matrix + 12);

	const v128_t tmp0 = wasm_v32x4_shuffle(row0, row1, 0, 4, 1, 5);
	const v128_t tmp1 = wasm_v32x4_shuffle(row2, row3, 0, 4, 1, 5);
	const v128_t tmp2 = wasm_v32x4_shuffle(row0, row1, 2, 6, 3, 7);
	const v128_t tmp3 = wasm_v32x4_shuffle(row2, row3, 2, 6, 3, 7);

	const v128_t col0 = wasm_v32x4_shuffle(tmp0, tmp1, 0, 1, 4, 5);
	const v128_t col1 = wasm_v32x4_shuffle(tmp0, tmp1, 2, 3, 6, 7);
	const v128_t col2 = wasm_v32x4_shuffle(tmp2, tmp3, 0, 1, 4, 5);
	const v128_t col3 = wasm_v32x4_shuffle(tmp2, tmp3, 2, 3, 6, 7);

	wasm_v128_store(matrix, col0);
	wasm_v128_store(matrix + 4, col1);
	wasm_v128_store(matrix + 8, col2);
	wasm_v128_store(matrix + 12, col3);
}
#include <xmmintrin.h>

void transpose(float matrix[16]) {
	const __m128 row0 = _mm_loadu_ps(matrix);
	const __m128 row1 = _mm_loadu_ps(matrix + 4);
	const __m128 row2 = _mm_loadu_ps(matrix + 8);
	const __m128 row3 = _mm_loadu_ps(matrix + 12);

	const __m128 tmp0 = _mm_unpacklo_ps(row0, row1);
	const __m128 tmp2 = _mm_unpacklo_ps(row2, row3);
	const __m128 tmp1 = _mm_unpackhi_ps(row0, row1);
	const __m128 tmp3 = _mm_unpackhi_ps(row2, row3);
	
	const __m128 col0 = _mm_movelh_ps(tmp0, tmp2);
	const __m128 col1 = _mm_movehl_ps(tmp2, tmp0);
	const __m128 col2 = _mm_movelh_ps(tmp1, tmp3);
	const __m128 col3 = _mm_movehl_ps(tmp3, tmp1);

	_mm_storeu_ps(matrix, col0);
	_mm_storeu_ps(matrix + 4, col1);
	_mm_storeu_ps(matrix + 8, col2);
	_mm_storeu_ps(matrix + 12, col3);
}

x86-64 SSE version compiles to 67 bytes of code, but WAsm SIMD version produce 231 bytes of code 3.5X larger.

I suggest that WAsm SIMD defines more restricted variants of v128.shuffle and v128.const which could be encoded in more concise representation.

@binji
Copy link
Member

binji commented Jun 12, 2020

Agreed that these instructions are very large, but they're likely to be relatively rare in the module. I'd be curious to see how often they occur in real modules.

@nfrechette
Copy link

Shuffles are very common in realtime 3d applications. From regular 3x3, 3x4, and 4x4 matrix math to quaternion math too. Over a whole application it might not contribute all that much though, but in a hot code path it definitely could add significant overhead. As long as the bytecode can be compiled into efficient machine code, it should go down in asm size and not be a huge deal IMO.

@penzn
Copy link
Contributor

penzn commented Jun 13, 2020

@binji just a 4x4 transpose ended with 8 of them, and they would be present in most of the codes that rearrange elements in an input, for example in code dealing with matrices or bitmapped graphics. Free-form shuffles are relatively rare real-world instructions, but the operations demonstrated in the example above are quite common in SIMD code.

Part of the problem is that we end up using v8x16.shuffle for things that won't be an actual shuffle instruction at runtime (and should not be, for performance reasons). A lot of the common algorithms won't use "true" native shuffles at all and would use more compact instructions instead - so the extra bits we use to encode v8x16.shuffle masks for those won't be actually used by runtime (aside from detecting what variant it can emit). If we had at least some of the common operations (#199) covered, we won't be using v8x16.shuffle as often.

@Maratyszcza thank you for the great illustration!

@tlively
Copy link
Member

tlively commented Jun 13, 2020

Given all the related discussion we've had about shuffles, I would really want to see the code size impact on a whole application (even one constructed to heavily use shuffles) before reconsidering whether we should have more specific shuffle operations.

@zeux
Copy link
Contributor

zeux commented Jun 13, 2020

Worth noting is that repeated shuffle masks should compress reasonably well. I've compiled both examples into a binary (extracting just "machine" code) and used zstd to compress these (zstd over gzip since it has smaller headers, unsure how to coerce gzip to use raw deflate without the cruft):

/mnt/c/work/shuffles $ zstd transpose-sse.o -9
transpose-sse.o      :118.31%   (    71 =>     84 bytes, transpose-sse.o.zst)
/mnt/c/work/shuffles $ zstd transpose-wasm.o -9
transpose-wasm.o     : 60.79%   (   227 =>    138 bytes, transpose-wasm.o.zst)

So yes there's a large impact pre-compression, but it's less significant post-compression. On a larger scale program I feel like the efficiency of LZ matches is going to be even better, here the matches are likely mostly 4 byte wide and occasionally 8 byte wide for shuffle masks, whereas on a larger program I'd expect few unique shuffle masks and deduplication (through LZ) across different functions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants