Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Constant Pool Optimization on X64 and Related Benchmarks #498

Open
omnisip opened this issue Mar 20, 2021 · 4 comments
Open

Constant Pool Optimization on X64 and Related Benchmarks #498

omnisip opened this issue Mar 20, 2021 · 4 comments

Comments

@omnisip
Copy link

omnisip commented Mar 20, 2021

Hi everyone,

@ngzhian has encouraged me to share the discussion we're having on building an x64 constant pool and related optimizations in V8 since it might be helpful to other people implementing compilers. Specifically, we're putting together a design document that shows how a constant pool will improve the performance of select operations -- the first being shuffle. However, it seems unlikely that shuffle would be the only thing that would receive a performance benefit. There's probably a handful of other cases that stand out with a constant intermediate being reused more than one time. For instance, if someone were calling extended multiply i32x4->i64x2 in a loop.

If you have specific benchmarks or places, which might be well addressed by the constant pool and/or constant optimization, please post them here. I would love to include them and your input in the design proposal.

@lemaitre
Copy link

One function where I did need constants is with the fast square root reciprocal.

I basically use the "fast square root reciprocal" from quake, but extended to double precision, and with faster convergence.
This function requires 4 constants.

__m256d rsqrt(__m256d x) {
  const __m256i vmagic = _mm256_set1_epi64x(0x5fe6eb50c7b537a9ull);
  const __m256d v35over16 = _mm256_set1_pd(35./16.), v21over16 = _mm256_set1_pd(21./16.), v5over16 = _mm256_set1_pd(5./16.);

  __m256d a, y;
  y = _mm256_castsi256_pd(_mm256_sub_epi64(vmagic, _mm256_srli_epi64(_mm256_castpd_si256(x), 1)));

  // Householder's iterations (quartic convergence)
  a = _mm256_mul_pd(_mm256_mul_pd(y, y), x);
  y = _mm256_mul_pd(_mm256_fnmadd_pd(_mm256_fnmadd_pd(_mm256_fnmadd_pd(a, v5over16, v21over16), a, v35over16), a, v35over16), y);
  a = _mm256_mul_pd(_mm256_mul_pd(y, y), x);
  y = _mm256_mul_pd(_mm256_fnmadd_pd(_mm256_fnmadd_pd(_mm256_fnmadd_pd(a, v5over16, v21over16), a, v35over16), a, v35over16), y);

  return y;
}

If you are interested in more details about that, you can look at my paper at JSA.

@omnisip
Copy link
Author

omnisip commented Mar 23, 2021

@lemaitre This could be a really good use case. Do you have a pre-built benchmark we can use?

I'd love to see the wasm code generation and the native assembly generation.

@lemaitre
Copy link

The benchmark I had at the time is super large, but this was part of a Cholesky factorization kernel on tiny matrices (like 3x3 or 5x5). In fact, to evaluate the performance of this code, you don't even need actual data as there is no ifs and all instructions used are constant time.

If you are interested, I could generate a tiny benchmark for this code later. But for now, I'm a bit busy.

@omnisip
Copy link
Author

omnisip commented Mar 25, 2021

Possibly. The goal here is to find cases where there is a performance penalty for not being able to reuse constants. In my case, I have one where there is no efficient way to build the constant, and it has to be manipulated and regenerated repeatedly for the shuffle mask. For something like extended multiply, it could be that it's looping and multiplying by a constant value that it has to shuffle for architectural reasons.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants