-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always use cmpxchg16b on x64 for atomic_ref<16 bytes>
#4751
Conversation
…e for 64-bit Win8.1 / Server 2012 R2. https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-isprocessorfeaturepresent > PF_COMPARE_EXCHANGE128 > The atomic compare and exchange 128-bit operation (cmpxchg16b) is available. https://support.microsoft.com/en-us/windows/system-requirements-2f327e5a-2bae-4011-8848-58180a4353a7 > To install a 64-bit OS on a 64-bit PC, your processor needs to support CMPXCHG16b, PrefetchW, and LAHF/SAHF As mentioned by https://en.wikipedia.org/wiki/Windows_8 and confirmed by https://learn.microsoft.com/en-us/lifecycle/products/windows-8 https://learn.microsoft.com/en-us/lifecycle/products/windows-embedded-8-standard support for Win8.0 client ended in Jan 2016, with the exception of Windows Embedded 8 Standard which ended in Jul 2023. The one we have to worry about is Server 2012 classic, which is receiving extended security updates through Oct 2026: https://learn.microsoft.com/en-us/lifecycle/products/windows-server-2012 But only C++20 `atomic_ref<16 bytes>` is physically affected, and only on potato chips, and only for x64. (32-bit servers would be unaffected.) The potential impact is nil. Server 2012 classic was released in Sep 2012 and was the latest version for only one year, before Server 2012 R2 was released in Oct 2013. cmpxchg16b support appeared even earlier: in Intel Nehalem released in Nov 2008, and AMD Bulldozer released in Oct 2011.
… should `abort` for 32-bit. We can do this because they were declared in `<atomic>` for 64-bit only; there were never any 32-bit callers. (They have to remain dllexported for 32-bit, but that's all.) Include `<cstdlib>` for `abort` (which we were already calling in this file). We can drop `__std_atomic_compare_exchange_128_fallback` without ABI impact because it was defined in an unnamed namespace.
Drop the declaration, mark the definition as preserved for bincompat. It's still dllexported by `stl/src/msvcp_atomic_wait.src`. Directly use the `_InterlockedCompareExchange128` intrinsic.
Now that `__std_atomic_has_cmpxchg16b()` always returns `true`, this collapses into the `_ATOMIC_HAS_DCAS` case.
…in "current ABI" `atomic<T>::is_lock_free()`.
We don't need to replicate the signature, nor does `constexpr bool _Result` improve clarity.
…lways_lock_free`). `atomic<T>::is_lock_free()` collapsed into the `_ATOMIC_HAS_DCAS` case, so this does too.
`__std_atomic_has_cmpxchg16b()` always returns `true` now.
For `!_ATOMIC_HAS_DCAS`: `is_always_lock_free` (power of two, at most 1 pointer) is a strict subset of `_Is_potentially_lock_free` (power of two, at most 2 pointers).
This describes what `atomic_ref::is_lock_free()` now returns.
Again, this all collapses into the `_ATOMIC_HAS_DCAS` case.
Drop the declaration, mark the definition as preserved for bincompat. It's still dllexported by `stl/src/msvcp_atomic_wait.src`.
It's now unused by product code, so we can drop its definition and consistency checks. Then, `P0019R8_atomic_ref` can assume a single mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the original author of all these polyfills,🔥 them, 🔥 them to the ground
🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I'm speculatively mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #4481.
📜 Overview
The cmpxchg16b instruction (aka the
_InterlockedCompareExchange128
intrinsic) was missing from the earliest x64 processors, but was quickly added and is now required by all modern OSes. This intrinsic (emitting an equivalent instruction sequence) has always been supported by ARM64.Our implementation of C++11
atomic
didn't attempt to take advantage of the then-fairly-new cmpxchg16b instruction. By the time our ABI was frozen in VS 2015, cmpxchg16b was more widespread, but we were still learning about concurrency and hadn't realized the potential for optimization here. Shortly after the ABI freeze, we added some (highly unusual for our conventions) preprocessor-disabled logic, indicating how we could take advantage of cmpxchg16b when we could break ABI, which still hasn't happened yet. So, while this disabled logic appears inatomic
, it currently never uses_InterlockedCompareExchange128
(for either x64 or ARM64), andatomic<16 bytes>
is always locking.Adding to the complexity, our codebase referred to "DCAS", Double Compare-And-Swap, i.e. double the size of a pointer. For 32-bit architectures, that's a 64-bit compare-and-swap, which is always available. So the codebase said that DCAS was always available on x86, ARM32, and ARM64, but only sometimes available on x64.
When we merged #843 implementing C++20
atomic_ref
on 2020-08-12 for VS 2019 16.8, cmpxchg16b was even more widespread. I now believe that we should have unconditionally used cmpxchg16b, giving the feature a processor requirement (similar to howshared_mutex
and C++20<chrono>
had OS requirements). At the time, it seemed tempting to implement fallback logic. So, on 64-bit systems,atomic_ref<16 bytes>
doesn't store any space for a lock, but if at runtime the STL discovers that it's running on x64 without cmpxchg16b, it uses a global lock.All of this machinery makes our sources very complicated, and it also degrades
atomic_ref<16 bytes>
performance on 64-bit systems. I believe it's finally time to rip out this machinery, simplifying our codebase and improving performance. I've talked to @MahmoudGSaleh and he agrees.🎯 Impact
I carefully performed these changes step-by-step, although they're easier to review as a whole. These changes preserve bincompat, in the sense that old and new object files can be freely mixed. (We always require that the final link be performed by the newest toolset and that the VCRedist is also the newest version, but this PR isn't exercising those requirements.)
As previously mentioned, while
atomic
's preprocessor-disabled logic is being cleaned up,atomic
is physically unaffected by this PR.atomic<16 bytes>
is always locking (with a per-object lock), and will remain so for the stable ABI era.On 64-bit systems, C++20
atomic_ref<16 bytes>
now uses_InterlockedCompareExchange128
header-only. This improves performance by eliminating a separately compiled function call (for both x64 and ARM64), and by eliminating a cached runtime support check (for x64).The old dllexports are preserved for bincompat, but they now unconditionally use cmpxchg16b. (This affects the VCRedist, which is unlocked for VS 2022 17.12, but we don't require close coordination between the header and DLL changes.)
Together, these changes theoretically break one scenario, but it is so vanishingly improbable that I don't believe we need to worry about it now (and Mahmoud agreed). Because this is using a processor instruction, not an OS API, the code has to actually be executed on an old processor to cause problems - merely loading the STL DLL won't be blocked.
🥔 New Code + Ancient OS + Potato Chip = Won't Happen
Win8.1 / Server 2012 R2 require cmpxchg16b on x64.
Win7 / Server 2008 R2 are no longer supported by the STL after #4742.
Win8.0 client is no longer supported by Microsoft. As Wikipedia mentions, support for Win8.0 ended in Jan 2016, with the exception of Windows Embedded 8 Standard which ended in Jul 2023.
That leaves Server 2012 classic, which is receiving (paid) extended security updates until Oct 2026.
Consider how many stars would have to align for this PR to be a problem:
/std:c++20
(not a common option yet),atomic_ref
(not a commonly used feature, even in C++20 codebases),atomic_ref<16 bytes>
specifically (no other size is affected).In this highly specific case, this PR will attempt to unconditionally use the cmpxchg16b instruction, which will fail.
I believe that this scenario is sufficiently implausible that we don't need to worry about it (i.e. I don't believe anyone is actually doing this), even with the caveat that deploying an updated VCRedist would be sufficient to cause it, because it would still require C++20
atomic_ref<16 bytes>
-using code to have been deployed to the server. (And remember, such code would have gotten terrible performance from the fallback due to the global mutex.)As previously mentioned, because this PR is using a new instruction and not a new OS API, its impact is limited to this specific scenario only. In general, Server 2012 classic will still be supported by the STL, for the 2 years that this OS has remaining.
⚙️ Source Changes
My individual commits have detailed changelogs, but as previously mentioned it's simpler to review the diff as a whole, and look for these notable things:
<yvals_core.h>
_STL_WIN32_WINNT_WINBLUE
(i.e. Win8.1) will be unused after these changes.<yvals.h>
_STD_ATOMIC_ALWAYS_USE_CMPXCHG16B
is going away - effectively, it will be1
iff we're 64-bit._STL_WIN32_WINNT_WINBLUE
.<atomic>
_STD_COMPARE_EXCHANGE_128
is going away - we're going to use_InterlockedCompareExchange128
directly in the header.__std_atomic_compare_exchange_128()
and__std_atomic_has_cmpxchg16b()
are being preserved for bincompat, so we don't need to declare them anymore.stl/src/msvcp_atomic_wait.src
.#ifdef _WIN64
, so 32-bit code never saw these declarations. This will be relevant below._STD_ATOMIC_ALWAYS_USE_CMPXCHG16B == 0
, so ARM64 never saw these declarations either._ATOMIC_HAS_DCAS
is going away - effectively, it will always be1
.atomic::is_always_lock_free
triggers forsizeof(void *)
or2 * sizeof(void *)
" was inaccurate - it affectedatomic
only in the preprocessor-disabled "break ABI" codepaths. It actively affectedatomic_ref
, though.-mcx16
compiler option to use the_InterlockedCompareExchange128
intrinsic. Fortunately, there's a way to make this work without requiring users to alter their command lines. We can apply thetarget
attribute to functions, making them effectively compiled with-mcx16
. We could manually mark each usage this way, but there's an even more convenient technique - Clang has a push/pop#pragma clang attribute
that will apply the attribute to all functions in a region. For hygiene, I'm inventing a_STD_ATOMIC_HEADER
"namespace" for this.atomic
are being changed, with DCAS always being used.is_lock_free()
report the same answer asis_always_lock_free
(which varies above), so this all collapses away. Again, this leaves the current ABI behavior unaffected (it'strue
for exactly 1, 2, 4, and 8).atomic_ref
, "potentially lock-free" becomes "always lock-free", greatly simplifying this code.#undef
s for removed macros.atomic_wait.cpp
abort()
in this file, but let's be good kitties and include<cstdlib>
for it.__std_atomic_compare_exchange_128()
and__std_atomic_has_cmpxchg16b()
as preserved for bincompat, and simplify their implementations.abort()
._M_ARM64
. However, I don't understand ARM64EC well enough to be absolutely confident that we could further restrict the implementations here to_M_X64
, and there would be little value in doing so.P0019R8_atomic_ref/env.lst
mutantsmodes.P0019R8_atomic_ref/test.cpp
atomic_ref<two_pointers_t>
is always lock-free now.