Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

16 byte cmpxchg #524

Open
SchrodingerZhu opened this issue May 11, 2022 · 4 comments
Open

16 byte cmpxchg #524

SchrodingerZhu opened this issue May 11, 2022 · 4 comments

Comments

@SchrodingerZhu
Copy link
Contributor

I think there are two points to improve in current cmpxchg support for 16 byte data structures:

  • consider arm lse
  • consider using __sync_bool_compare_and_swap to force gcc emit inlined instructions.
#include <atomic>

#if defined(__aarch64__) && defined(__clang__)
#  pragma clang attribute push(__attribute__((target("lse"))),apply_to=function)
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("clang attribute pop")
#elif defined(__aarch64__) && defined(__GNUC__)
#  pragma GCC push_options
#  pragma GCC target("arch=armv8-a+lse")
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("GCC pop_options")
#elif defined(__x86_64__) && defined(__clang__)
#  pragma clang attribute push(__attribute__((target("cx16"))),apply_to=function)
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("clang attribute pop")
#elif defined(__x86_64__) && defined(__GNUC__)
#  pragma GCC push_options
#  pragma GCC target("cx16")
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING _Pragma("GCC pop_options")
#else 
#  define PLATFORM_SPECIFIC_OPTIONS_ENDING
#endif

template<class T>
__attribute__((always_inline)) inline bool cas(std::atomic<T> &src,
                T const& __restrict cmp,
                T const& __restrict with)
{
    auto inline_copy = [](__int128 * dst, const void * __restrict src) {
#if __has_builtin(__builtin_inline_memcpy)
    __builtin_inline_memcpy(dst, src, sizeof(__int128));
#elif __has_builtin(__builtin_memcpy)
    __builtin_memcpy(dst, src, sizeof(__int128));
#else
    ::memcpy(dst, src, sizeof(__int128));
#endif
    };
    __int128 cmp_value;
    __int128 with_value;
    inline_copy(&cmp_value, &cmp);
    inline_copy(&with_value, &with);
    return __sync_bool_compare_and_swap(reinterpret_cast<__int128 *>(&src), cmp_value, with_value);
}

struct A {
    int64_t a, b;
};

bool cas_test(std::atomic<__int128> &src,
                __int128 const& cmp,
                __int128 const& with)
{
    return cas(src, cmp, with);
}

bool cas_test(std::atomic<A> &src,
                A const& cmp,
                A const& with)
{
    return cas(src, cmp, with);
}

PLATFORM_SPECIFIC_OPTIONS_ENDING

bool cas_test2(std::atomic<__int128> &src,
                __int128 & cmp,
                __int128 & with)
{
    return src.compare_exchange_weak(cmp, with);
}
@davidchisnall
Copy link
Collaborator

I'm not sure why we'd need anything on Arm. Arm has load-reserved, store-exclusive, so doesn't need a second word for ABA.

The only platforms where we have a problem have the following combination of things:

  • An architecture with a CAS instruction but not ll/sc, so no ABA-proof primitive (i.e. x86, this is not a problem on any other architecture).
  • A standard library that use a compiler builtin for 128-bit CAS.
  • A compiler that doesn't expand that builtin to a single instruction.

This means that you need the combination of x86, libstdc++, and gcc to see the problem. GCC has an open bug report about it and so, presumably, will fix it eventually.

That said, the ABA code isn't used on any hot paths in snmalloc, so the cost of a libcall is unlikely to be noticeable.

This combination of things (bug in a single compiler, doesn't affect hot paths) means that I'm very hesitant to do something ad-hoc as a work around for that compiler.

Looking at the code, we seem not to be doing LL/SC for non-x86 architectures. I thought @nwf added that a while ago?

@mjp41
Copy link
Member

mjp41 commented May 11, 2022

I had a look at LL/SC on Arm, but couldn't work out how to do it. The ABA primitive is designed to be able to use it, I just didn't know a good way to actually achieve it without complex inline assembly across functions.

The use of the ABA protection is only for allocating allocators, and potentially in the backend. So falling back to locks for ABA protection is actually fine in most cases.

@SchrodingerZhu
Copy link
Contributor Author

SchrodingerZhu commented May 11, 2022

Hi,

I initiated this issue because I found that LLVM and GCC would behave quite differently on whether to inline atomic ops. This may deliver extra linkage with libatomic.so for rust.

For example, on x86, we are requiring the compiled target to have -mcx16 but gcc will still emit outline atomic ops.

@SchrodingerZhu
Copy link
Contributor Author

For DW LL/SC, maybe https://github.com/taiki-e/portable-atomic/blob/main/src/imp/aarch64.rs can be considered as a reference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants