Ibraheem Ahmed

There are a handful of concurrent algorithms, most famously solutions to the ABA problem, that require 128-bit atomics. Unfortunately, the story around 128-bit atomics hasn't been great, but recently both compilers and CPU manufacturers have been making strides toward full-fledged support.

x86

cmpxchg16b serves as the double-word compare-and-swap instruction on x86, and exists on just about every Intel and AMD processor excluding a few very early models. The problem is that for the longest time cmpxchg16b was the only 128-bit atomic instruction, meaning that even trivial atomic loads and stores would require a lock instruction taking exclusive access of the relevant cache-line, which is significantly more expensive than the mov that regular atomics use. This meant that while 128-bit atomics were possible, they weren't very practical.

A few months ago however, both AMD and Intel updated their specs to guarantee the atomicity of 16-byte SSE loads and stores, such as movdqa, on all processors that support AVX. Although AVX isn't supported on lower end models like Intel's Pentium line, this change makes using 128-bit atomics much more feasible.

ARM

As usual, the story on ARM is a lot better, with double-world ll/sc (ldxp/stxp) being included in aarch64. As part of the Large Systems Extensions (LSE), ARM v8.1 added a dedicated casp instruction for double-word compare-and-swap, and ARM v8.4 (LSE2) later added load and store instructions with ldp/stp. More recently, ARM v9.4-a (LSE3) added a slew of double-word RMW instructions, including ldclrp, ldsetp, and swpp.

Library Support

C++ has easy access to everything mentioned above because of the runtime feature detection provided libatomic, which enables std::atomic<__int128> to, for the most part, just work. GCC and LLVM should take advantage of all the optimizations mentioned in their latest versions.

Rust on the other hand can't even expose AtomicU128 because the standard library atomics are explicitly lock-free, and thus can't depend on libatomic's runtime detection. This makes it impossible to expose on x86 because of those older processors without cmpxchg16b support, which can't be detected at compile-time. What you can do though, is use the AtomicU128 type from the portable-atomic crate, which implements all the optimizations mentioned above by hand with optional runtime feature detection. Shoutout to @taiki-e for all the work they put into implementing this.