Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/CPU Detection for Apple M1 #40876

Closed
chriselrod opened this issue May 19, 2021 · 2 comments
Closed

Feature/CPU Detection for Apple M1 #40876

chriselrod opened this issue May 19, 2021 · 2 comments
Labels
system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips

Comments

@chriselrod
Copy link
Contributor

Originally posted here.

The Apple M1 supports ARMv8.4-A, but Julia/LLVM treats it like an A7/Cyclone CPU:

julia> versioninfo()
Julia Version 1.7.0-DEV.1107
Commit 5aca7a37be* (2021-05-15 16:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.3.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 4

Which is ARMv8-a. Although the page on the A14 claims it is ARMv8.5-a. for the firestorm/icestorm cores.

As such, atomics are implemented using a load link/conditional store loop:

julia> a = Threads.Atomic{Int}(1)
Base.Threads.Atomic{Int64}(1)

julia> @code_native Threads.atomic_add!(a, 2)
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:405 within `atomic_add!'
        mov     x8, x0
L4:
        ldaxr   x0, [x8]
        add     x9, x0, x1
        stlxr   w10, x9, [x8]
        cbnz    w10, L4
        ret
; └
julia> @code_native Threads.atomic_cas!(a, 5, 2)
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:373 within `atomic_cas!'
        mov     x8, x0
L4:
        ldaxr   x0, [x8]
        cmp     x0, x1
        b.ne    L28
        stlxr   w9, x2, [x8]
        cbnz    w9, L4
        ret
L28:
        clrex
        ret
; └

However, if I start Julia with -C'armv8.4-a':

julia> a = Threads.Atomic{Int}(1)
Base.Threads.Atomic{Int64}(1)

julia> @code_native Threads.atomic_add!(a, 2)
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:405 within `atomic_add!'
        ldaddal x1, x0, [x0]
        ret
; └
julia> @code_native Threads.atomic_cas!(a, 5, 2)
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:373 within `atomic_cas!'
        casal   x1, x2, [x0]
        mov     x0, x1
        ret
; └

Starting Julia without -C flags:

julia> using Octavian

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000087 seconds (2 allocations: 40.578 KiB)

julia> @benchmark matmul!($C0,$A,$B) # threaded matmul uses atomics
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.425 μs (0.00% GC)
  median time:      6.525 μs (0.00% GC)
  mean time:        6.530 μs (0.00% GC)
  maximum time:     14.592 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

With -C'armv8.4-a':

julia> using Octavian

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000100 seconds (2 allocations: 40.578 KiB)

julia> @benchmark matmul!($C0,$A,$B) # threaded matmul uses atomics
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.258 μs (0.00% GC)
  median time:      6.525 μs (0.00% GC)
  mean time:        6.532 μs (0.00% GC)
  maximum time:     13.475 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

I made non-x86 architectures (including the M1) ramp up thread use more slowly, because earlier performance tests suggested the M1 had higher threading overhead. Maybe that was partly because of atomics, and partly because of the lack of a shared L3 cache, and of course maybe for other reasons I don't know.

There's of course more than just atomics separating armv8.(4/5)-a and armv8.

@ViralBShah ViralBShah added the system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips label May 19, 2021
@gbaraldi
Copy link
Member

How hard would fixing this be? Would I be able to do it?
Go did via hardcoding the cpu features golang/go#42747

I imagine it's adding some hardcoded options to here:

julia/src/processor_arm.cpp

Lines 1184 to 1308 in a08a3ff

static NOINLINE std::pair<uint32_t,FeatureList<feature_sz>> _get_host_cpu()
{
FeatureList<feature_sz> features = {};
// Here we assume that only the lower 32bit are used on aarch64
// Change the cast here when that's not the case anymore (and when there's features in the
// high bits that we want to detect).
features[0] = (uint32_t)jl_getauxval(AT_HWCAP);
features[1] = (uint32_t)jl_getauxval(AT_HWCAP2);
#ifdef _CPU_AARCH64_
if (test_nbit(features, 31)) // HWCAP_PACG
set_bit(features, Feature::pauth, true);
#endif
auto cpuinfo = get_cpuinfo();
auto arch = get_elf_arch();
#ifdef _CPU_ARM_
if (arch.version >= 7) {
if (arch.klass == 'M') {
set_bit(features, Feature::mclass, true);
}
else if (arch.klass == 'R') {
set_bit(features, Feature::rclass, true);
}
else if (arch.klass == 'A') {
set_bit(features, Feature::aclass, true);
}
}
switch (arch.version) {
case 8:
set_bit(features, Feature::v8, true);
JL_FALLTHROUGH;
case 7:
set_bit(features, Feature::v7, true);
break;
default:
break;
}
#endif
std::set<uint32_t> cpus;
std::vector<std::pair<uint32_t,CPUID>> list;
// Ideally the feature detection above should be enough.
// However depending on the kernel version not all features are available
// and it's also impossible to detect the ISA version which contains
// some features not yet exposed by the kernel.
// We therefore try to get a more complete feature list from the CPU name.
// Since it is possible to pair cores that have different feature set
// (Observed for exynos 9810 with exynos-m3 + cortex-a55) we'll compute
// an intersection of the known features from each core.
// If there's a core that we don't recognize, treat it as generic.
bool extra_initialized = false;
FeatureList<feature_sz> extra_features = {};
for (auto info: cpuinfo) {
auto name = (uint32_t)get_cpu_name(info);
if (name == 0) {
// no need to clear the feature set if it wasn't initialized
if (extra_initialized)
extra_features = FeatureList<feature_sz>{};
extra_initialized = true;
continue;
}
if (!check_cpu_arch_ver(name, arch))
continue;
if (cpus.insert(name).second) {
if (extra_initialized) {
extra_features = extra_features & find_cpu(name)->features;
}
else {
extra_initialized = true;
extra_features = find_cpu(name)->features;
}
list.emplace_back(name, info);
}
}
features = features | extra_features;
// Not all elements/pairs are valid
static constexpr CPU v8order[] = {
CPU::arm_cortex_a35,
CPU::arm_cortex_a53,
CPU::arm_cortex_a55,
CPU::arm_cortex_a57,
CPU::arm_cortex_a72,
CPU::arm_cortex_a73,
CPU::arm_cortex_a75,
CPU::arm_cortex_a76,
CPU::arm_neoverse_n1,
CPU::arm_neoverse_n2,
CPU::arm_neoverse_v1,
CPU::nvidia_denver2,
CPU::nvidia_carmel,
CPU::samsung_exynos_m1,
CPU::samsung_exynos_m2,
CPU::samsung_exynos_m3,
CPU::samsung_exynos_m4,
CPU::samsung_exynos_m5,
};
shrink_big_little(list, v8order, sizeof(v8order) / sizeof(CPU));
#ifdef _CPU_ARM_
// Not all elements/pairs are valid
static constexpr CPU v7order[] = {
CPU::arm_cortex_a5,
CPU::arm_cortex_a7,
CPU::arm_cortex_a8,
CPU::arm_cortex_a9,
CPU::arm_cortex_a12,
CPU::arm_cortex_a15,
CPU::arm_cortex_a17
};
shrink_big_little(list, v7order, sizeof(v7order) / sizeof(CPU));
#endif
uint32_t cpu = 0;
if (list.empty()) {
cpu = (uint32_t)generic_for_arch(arch);
}
else {
// This also covers `list.size() > 1` case which means there's a unknown combination
// consists of CPU's we know. Unclear what else we could try so just randomly return
// one...
cpu = list[0].first;
}
// Ignore feature bits that we are not interested in.
mask_features(feature_masks, &features[0]);
return std::make_pair(cpu, features);
}

It could be possible to do it programmatically using developer.apple.com/documentation/kernel/1387446-sysctlbyname but that would necessitate a refactor of the code since I think it just expects linux code for now.

@giordano
Copy link
Contributor

giordano commented Apr 6, 2022

@chriselrod I presume this was fixed by #41924?

julia> versioninfo()
Julia Version 1.9.0-DEV.332
Commit 559244b383* (2022-04-06 16:01 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.4.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores

julia> @code_native Threads.atomic_add!(Threads.Atomic{Int}(1), 2)
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.globl	"_julia_atomic_add!_11581"      ; -- Begin function julia_atomic_add!_11581
	.p2align	2
"_julia_atomic_add!_11581":             ; @"julia_atomic_add!_11581"
; ┌ @ atomics.jl:405 within `atomic_add!`
	.cfi_startproc
; %bb.0:                                ; %top
	ldaddal	x1, x0, [x0]
	ret
	.cfi_endproc
; └
                                        ; -- End function
.subsections_via_symbols

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:apple silicon Affects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips
Projects
None yet
Development

No branches or pull requests

5 participants