Skip to content

Performance patches and build fixes for Elbrus 2000 (e2k) architecture.

Notifications You must be signed in to change notification settings

ilyakurdyukov/e2k-ports

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

e2k-ports

Performance patches and build fixes for Elbrus (e2k) architecture.

This is my personal repository so that patches won't get lost.

Elbrus porting cheat sheet:

Elbrus 2000 (aka e2k) is a 64-bit little-endian architecture.
The compiler is mostly GCC compatible (defines __GNUC__), EDG frontend.

detection

  • shell: uname -m returns e2k
  • cmake: if({CMAKE_SYSTEM_PROCESSOR} STREQUAL "e2k")
  • C preprocessor: if defined(__e2k__)
  • compiler version: if __LCC__ = 125 and __LCC_MINOR__ = 9 then it's "LCC 1.25.09"
  • architecture version: defined in __iset__ (less than 3 is obsolete, 6 is the latest at the moment)

intrinsics

  • MMX, SSE2, SSSE3, SSE4.1* - native support
  • AVX, AVX2 - supported, but not recommended, uses too much CPU registers
  • SSE4.2 and _mm_dp_ps (from SSE4.1) - emulated, slow, do not use

The compiler enables MMX to AVX2 support by default, pass -mno-avx (-mno-sse4.2) if code depends on the presence of macros (e.g. #if defined(__AVX2__)).

builtins

  • __sync*, __atomic* - supported by the compiler
  • count leading/trailing zeros - supported (__builtin_clz, __builtin_ctz)
  • memory fence - supported (need to include x86intrin.h first)
    • __builtin_ia32_mfence, __builtin_ia32_lfence, __builtin_ia32_sfence

cpuid

Use compile time CPU detection, select the best SIMD up to SSE4.1.

rdtsc

#include <x86intrin.h>
uint64_t time = __rdtsc();
// same: unsigned aux; uint64_t time = __rdtscp(&aux);

useful pragmas

_Pragma("name") - to use from macros.

Use before the loop:

  • #pragma ivdep - ignore data dependencies inside the loop
  • #pragma unroll(n) - unroll cycle N times

restrict

Using the restrict keyword is good for performance, but note that it is ignored by the LCC if you're using vector load/store intrinsics such as _mm_load_si128(). For code with vector intrinsics use #pragma ivdep.

makecontext

Instead of makecontext(ctx, ...) use makecontext_e2k(ctx, ...), returns a negative integer on error. Allocates extra resources that need to be freed using freecontext_e2k(ctx).

nop

Use __asm__ __volatile__ ("nop") or _mm_pause() for a little delay.

clearing the instruction cache

The GNUC standard function __clear_cache(char *begin, char *end) works correctly since LCC 1.25.18, LCC 1.26.04. This function is available in previous versions, but does nothing.

inline

If it's crucial to performance, then use __attribute__((__always_inline__)) inline rather than just inline. Because when using large or complicated inline functions, the LCC compiler may decide not to inline them.

avoid if possible

The GNUC C extension Labels as Values is available in the LCC, but performance is worse than using a simple switch/case.

The GNUC Vector Extension is also available in LCC, but poorly implemented and its performance is very bad.

About

Performance patches and build fixes for Elbrus 2000 (e2k) architecture.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published