Our concurrency memory model (inherited from C++20) is incompatible with x86 #548

RalfJung · 2024-12-09T08:53:06Z

See https://cplusplus.github.io/LWG/lwg-active.html#3941 for details. Thanks @orilahav for pointing this out to me.

chorman0773 · 2024-12-09T11:08:29Z

Having read the issue, I'm not sure that it's incompatible. It's just that it's universally miscompiled. The issue is that compiling it properly will make it a stupid amount slower (and I'm not sure C++ users will appreciate the perf regresssiom).

Specifically, putting an mfence after each mov for a SeqCst load would be sufficient, as would using lock cmpxchg/cmpxchg8b on targets without SSE. Though this means that non-sse can't compile any SeqCst loads in a manner that is compatible with accessing read-only memory.

taiki-e · 2024-12-09T12:15:36Z

Though this means that non-sse can't compile any SeqCst loads in a manner that is compatible with accessing read-only memory.

This problem should not occur as LLVM can emit a lock or dword ptr [esp], 0, as it already does for SeqCst fences and 64-bit SeqCst stores in environments without SSE: https://godbolt.org/z/qPYqx9nda

RalfJung · 2024-12-09T12:39:42Z

Well, it is incompatible with the intended lowering. Changing the lowering is the wrong fix for this.

chorman0773 · 2024-12-09T15:39:39Z

Well without changing our memory model or bugging WG-21, I'm not sure what the "right fix" would be.
It's entirely possible that WG-21 considered this sort of change (Not the precise one or gcc/clang/msvc would all know) and accepted the memory model with that in mind, in which case, at least from the C++ side, changing the lowering would be theright fix.

RalfJung · 2024-12-09T15:40:35Z

I like this comment from the report

[ Meta-observation from a small off-line discussion that may be relevant to how we phrase the resolution here: The fact that nobody noticed this for a very long time, and implementers were not bothered by it, suggests that the audience for this part of the standard is nearly empty. We conjecture that implementers actually rely on atomics mappings generated by memory model experts, who are more interested in formal models than standardese. A more formal description is likely to increase the size of the audience, and would definitely ease verification and reduce the probability of mistakes like this.]

I definitely agree that just defining the relevant orders and consistency axioms would be much better than trying to circumscribe them in standardese. Even if the definition is carried out in English, as long as its translation into formal syntax is clear, that seems much better than the current approach the C++ standard is taking.

It's entirely possible that WG-21 considered this sort of change (Not the precise one or gcc/clang/msvc would all know) and accepted the memory model with that in mind, in which case, at least from the C++ side, changing the lowering would be theright fix.

It's very clear from the document I linked above that this is a mistake they made when translating the formal notation from the paper into plain English:

The problem here is an error during the attempt to simplify and translate plv.mpi-sws.org/scfix/paper.pdf to standardese. There is no known issue with the underlying paper.

arielb1 · 2024-12-17T17:57:17Z

It's very clear from the document I linked above that this is a mistake they made when translating the formal notation from the paper into plain English:

Do they know what the error is?

RalfJung · 2024-12-17T17:58:51Z

Yes, please read the document linked in the OP.

arielb1 · 2024-12-17T18:39:37Z

The OP doesn't seem to have a "Proposed resolution"

RalfJung · 2024-12-17T18:48:22Z

No, but it describes the error.

Currently, we require that the sequential consistency order S is consistent with the extended coherence order, which includes "reads before" and "reads from". This may require sc reads to be consistently ordered even if they see non-sc writes. This is not enforced by some "fence-after-store"/"trailing fence" sc implementations, notably on X86. It may also be violated by compiler transformations that eliminate a sequentially consistent load after a store into the same location, e.g. on sequentially consistent hardware.

The proposed resolution is to follow exactly what the paper suggests (minus "(po \cup rf) acyclic").

arielb1 · 2024-12-17T18:57:23Z

So "coherence-ordered" is supposed to be the same as the paper's "psc", but it's just written in standartese that has a completely different meaning?

RalfJung · 2024-12-17T19:01:14Z

I don't know what the intended correspondence between paper and standard is, sorry. Maybe @orilahav can help with that.

arielb1 · 2024-12-17T19:36:12Z

Reading the paper I completely understand why the person that translated the paper didn't like the "seqcst-order is inconsistent with happens before" concept.

RalfJung · 2024-12-17T19:43:51Z

The consequences of memory model choices can be quite counter-intuitive. That's why we need formalizations of these models. Clearly we want to be able to compile atomics to x86 in the obvious way, and therefore we must accept this concept even if we don't like it.

arielb1 · 2024-12-17T19:55:00Z

I'm actually not sure the "add barriers solution" would be that bad to Rust, which doesn't use SeqCst reads much. Tho "change the model" is probably better.

RalfJung · 2024-12-17T19:56:44Z

I don't think we should make SC reads in Rust surprisingly more expensive than in C++. Also we have to consider compatibility with C++ whenever we share memory.

arielb1 · 2024-12-17T22:03:24Z

If @orilahav is reading this, I can't figure out why your paper allows for removing dead (non-sc) reads.

If you have this Figure-4-style example:

k: a := x_acq // 2
l: b := y_sc  // 0

m: x :=_sc 1
z: unused := unrelated_rlx
n: x :=_sc 2

o: y :=_sc 1
p: c := x_sc   // 0

This should be compilable into the same as this, which is our target execution of WWmerge

k: a := x_acq // 2
l: b := y_sc  // 0

n: x :=_sc 2

o: y :=_sc 1
p: c := x_sc   // 0

But:

1. n sw k // (load-store pair)
2. m sb_!loc z // program order
3. z sb_!loc n // program order
4. k sb_!loc l // program order
---
5. z hb k (due to 3, 1)
6. m scb l // (due to 2, 5, 6)
7. m psc l // due 6

And these are the same as fig 4
8. l psc o [due to rb for pair of seq cst]
9. o psc p [due to sb]
10. p psc m [due to rb for pair of seq cst]

Which leads to a contradiction.

orilahav · 2024-12-18T11:41:13Z

Thank you for pointing this out!

I don't recall considering "dead load elimination" back in 2017 when we worked on the paper. I've thought about it now and arrived at some interesting conclusions, summarized in two parts below. I'm glad you looked into this!

PART A

I think that "dead load elimination" is unsound in any of the alternatives described in the paper, including the original C11 model.

My counterexample is this one:

y:=1 rlx
Fence sc
a:=x rlx
Fence acq
b:=z rlx // 0
||
z:=1 rlx
x:=1 sc
c:=y sc // 0

This behavior is disallowed.
If we remove a:=x rlx, it becomes allowed.

Is this something compilers do? If so, we need to report a cool new problem!
(A first attempt at a solution can be another hack in the psc relation using the fact that sb;[SC];sb must be mapped to something with a fence in the middle.)

@RalfJung This could be another example to test in Miri, make sure the example above is indeed forbidden.

PART B

@arielb1's example, however, shows a serious drawback. Since unused is only mentioned in one thread, it can be a non-atomic access, and I think one should support such optimization. This must be something that compiler actually do, right?

In fact, this reveals a mistake in our PLDI'17 paper! Although we did not study "dead load elimination", we claimed that the so-called "register promotion" (where a variable used solely in one thread is promoted to a "register") is sound. Upon revisiting the proof, the issue lies in "Lemma I.12" in the lengthy appendix. Unsurprisingly, I simply wrote that it "easily follows from our definitions". While the paper was never fully mechanized in a proof assistant, it has managed to survive more than seven years without anyone identifying a major mistake. Well done, Ariel! :D

The implications of this mistake seem actually less significant. The only reason we introduced the sb|!=loc hack was to support the elimination of repeated SC accesses (e.g., x:=1 sc; x:=2 sc ~> x:=2 sc or a:=x sc; b:=x sc ~> a:=x sc; b:=a), a transformation that, to the best of my knowledge, no compiler performs. The C/C++ committee wasn't fond of this hack and the standard uses sb rather than sb|!=loc in its constraints.

With sb instead of sb|!=loc, I believe that "Lemma I.12" holds and "register promotion" is sound. Additionally, I think "dead non-atomic load elimination" is sound under this model as well. That said, I'd like to see a full formal proof to gain more confidence. When I get the time I will try to write a quick addendum to the paper and prove this claim.

It seems to me that all this complexity arises from (super uncommon?) programs that mix SC and non-SC accesses to the same location. I believe things would be much more straightforward if such mixing weren't allowed: either a location is declared SC, meaning all its accesses are SC, or it is declared a standard atomic, allowing it to use rel/acq/rlx modes exclusively.

arielb1 · 2024-12-18T13:23:04Z

@arielb1's example, however, shows a serious drawback. Since unused is only mentioned in one thread, it can be a non-atomic access, and I think one should support such optimization. This must be something that compiler actually do, right?

I did a quick check on Godbolt, and it seems that current compilers don't remove dead relaxed-atomic loads in the most obvious case, but they do omit even seq-cst loads when they refer to "imaginary address" stack objects 0 (which can't cause this sort of problem since imaginary address objects can't be shared between threads).

Compilers however certainly remove non-atomic loads and that is very important to performance, so I agree that just saying that sc stores are not mergable and relaxed loads are not deletable is probably the smartest fix.

EDIT: the document at https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0062r1.html says that "For example, removing a (non-volatile) atomic load whose value is not used seems uncontroversial". Still didn't find a place where a compiler does it.

It seems to me that all this complexity arises from (super uncommon?) programs that mix SC and non-SC accesses to the same location. I believe things would be much more straightforward if such mixing weren't allowed: either a location is declared SC, meaning all its accesses are SC, or it is declared a standard atomic, allowing it to use rel/acq/rlx modes exclusively.

Makes sense, except this would be very annoying from an API surface perspective.

https://bugs.llvm.org/show_bug.cgi?id=37716#c6 has an example for why removing raced-on dead relaxed loads is unsound even without seqcst.

orilahav · 2024-12-18T14:26:25Z

https://bugs.llvm.org/show_bug.cgi?id=37716#c6 has an example for why removing raced-on dead relaxed loads is unsound even without seqcst.

Indeed, I saw it before and forgot about it. Thanks for recalling that example!
Here it is to save time for others (and future me):

z:=1 rlx
x:=1 rel
||
a:=x rlx // 1
y:=1 rel
||
b:=y acq // 1
_:=x rlx
Facq
c:=z rlx // 0

disallowed, but allowed if we remove _:=x rlx

This makes me wonder about an old idea about fixing the sw relation to have hb after release fences and before acquire fences, rather than sb. I think it makes some sense and I need to recall if there was something wrong with it. Maybe the mapping schemes to hardware... (At least in this case, it will forbid the behavior above also without _:=x rlx.)

arielb1 · 2024-12-18T14:30:07Z

(At least in this case, it will forbid the behavior above also with _:=x rlx.)

Isn't the behavior forbidden today with _ :=rlx x? [ED: fixed]

arielb1 · 2024-12-18T14:34:03Z

This makes me wonder about an old idea about fixing the sw relation to have hb after release fences and before acquire fences, rather than sb. I think it makes some sense and I need to recall if there was something wrong with it. Maybe the mapping schemes to hardware... (At least in this case, it will forbid the behavior above also without _:=x rlx.)

I mean, if it works it could be nice to have a model that allows for removing dead relaxed loads.

RalfJung · 2024-12-18T14:36:41Z

This could be another example to test in Miri, make sure the example above is indeed forbidden.

Is this meaningfully different from the examples we already have? I'm getting a bot worried about the execution time of our test suite, these consistency tests are quite slow since they run many times to ensure we never see that behavior. ;)

orilahav · 2024-12-18T14:40:01Z

I mean, if it works it could be nice to have a model that allows for removing dead relaxed loads.

will investigate this further. It makes the definition of hb more recursive, but why not...

Makes sense, except this would be very annoying from an API surface perspective.

I can imagine a language with three modes: non-atomics, strong atomics (with SC semantics), weak atomics (rel/acq/rlx). Non-experts use only non-atomics and strong atomics. Is this much worse than the current situation?

arielb1 · 2024-12-18T14:41:09Z

I can imagine a language with three modes: non-atomics, strong atomics (with SC semantics), weak atomics (rel/acq/rlx). Non-experts use only non-atomics and strong atomics. Is this much worse than the current situation?

The thing is that in Rust the default people use is weak atomics. Strong atomics (esp. SeqCst writes) are very rare.

RalfJung · 2024-12-18T14:44:27Z

I can imagine a language with three modes: non-atomics, strong atomics (with SC semantics), weak atomics (rel/acq/rlx). Non-experts use only non-atomics and strong atomics. Is this much worse than the current situation?

I generally recommend people should use rel/acq and not SC unless they have to. But my attempt at making it easier to program that way in Rust got rejected so 🤷

It's certainly not trivial to migrate Rust to a new set of atomic APIs, given that the current ones exist and will forever have to exist due to our backwards compatibility guarantees.

Regarding the example, I implemented it in Miri and it does feel different. Also Miri behaves correctly. :) However, does it really need 3 variables? I didn't really understand what happens here. Having those examples without comments explaining why the bad execution is bad is unfortunately not very useful to me...

orilahav · 2024-12-18T14:51:00Z

Strong atomics (esp. SeqCst writes) are very rare.

I agree. For that reason we stopped working on SC accesses in other projects (promising semantics, program logics...). But, if one does want to have them in the language, better find a good way to forbid mixing.

Regarding the example, I implemented it in Miri and it does feel different. Also Miri behaves correctly. :) However, does it really need 3 variables? I didn't really understand what happens here. Having those examples without comments explaining why the bad execution is bad is unfortunately not very useful to me...

I think it does need three variables. I will try to add more text later on :)

RalfJung · 2024-12-18T15:00:57Z

I'm pretty sure there's a lot of Rust code out there that uses SC. And obviously we can't remove them from std. So sadly work that doesn't cover SC accesses doesn't apply to Rust. :(

arielb1 · 2024-12-18T15:14:58Z

I'm pretty sure there's a lot of Rust code out there that uses SC. And obviously we can't remove them from std. So sadly work that doesn't cover SC accesses doesn't apply to Rust. :(

And what often enough happens is that there's an atomic that is used with weak access in the common case, but in some rare case that is not worth optimizing people get lazy and put SeqCst on all accesses, which is how you get the access modes mixed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Our concurrency memory model (inherited from C++20) is incompatible with x86 #548

Our concurrency memory model (inherited from C++20) is incompatible with x86 #548

RalfJung commented Dec 9, 2024

chorman0773 commented Dec 9, 2024 •

edited

Loading

taiki-e commented Dec 9, 2024 •

edited

Loading

RalfJung commented Dec 9, 2024 via email

chorman0773 commented Dec 9, 2024

RalfJung commented Dec 9, 2024 •

edited

Loading

arielb1 commented Dec 17, 2024

RalfJung commented Dec 17, 2024

arielb1 commented Dec 17, 2024

RalfJung commented Dec 17, 2024

arielb1 commented Dec 17, 2024 •

edited

Loading

RalfJung commented Dec 17, 2024

arielb1 commented Dec 17, 2024 •

edited

Loading

RalfJung commented Dec 17, 2024

arielb1 commented Dec 17, 2024 •

edited

Loading

RalfJung commented Dec 17, 2024

arielb1 commented Dec 17, 2024 •

edited

Loading

orilahav commented Dec 18, 2024 •

edited

Loading

arielb1 commented Dec 18, 2024 •

edited

Loading

orilahav commented Dec 18, 2024 •

edited

Loading

arielb1 commented Dec 18, 2024 •

edited

Loading

arielb1 commented Dec 18, 2024 •

edited

Loading

RalfJung commented Dec 18, 2024

orilahav commented Dec 18, 2024

arielb1 commented Dec 18, 2024 •

edited

Loading

RalfJung commented Dec 18, 2024 •

edited

Loading

orilahav commented Dec 18, 2024

RalfJung commented Dec 18, 2024

arielb1 commented Dec 18, 2024

Our concurrency memory model (inherited from C++20) is incompatible with x86 #548

Our concurrency memory model (inherited from C++20) is incompatible with x86 #548

Comments

RalfJung commented Dec 9, 2024

chorman0773 commented Dec 9, 2024 • edited Loading

taiki-e commented Dec 9, 2024 • edited Loading

RalfJung commented Dec 9, 2024 via email

chorman0773 commented Dec 9, 2024

RalfJung commented Dec 9, 2024 • edited Loading

arielb1 commented Dec 17, 2024

RalfJung commented Dec 17, 2024

arielb1 commented Dec 17, 2024

RalfJung commented Dec 17, 2024

arielb1 commented Dec 17, 2024 • edited Loading

RalfJung commented Dec 17, 2024

arielb1 commented Dec 17, 2024 • edited Loading

RalfJung commented Dec 17, 2024

arielb1 commented Dec 17, 2024 • edited Loading

RalfJung commented Dec 17, 2024

arielb1 commented Dec 17, 2024 • edited Loading

orilahav commented Dec 18, 2024 • edited Loading

PART A

PART B

arielb1 commented Dec 18, 2024 • edited Loading

orilahav commented Dec 18, 2024 • edited Loading

arielb1 commented Dec 18, 2024 • edited Loading

arielb1 commented Dec 18, 2024 • edited Loading

RalfJung commented Dec 18, 2024

orilahav commented Dec 18, 2024

arielb1 commented Dec 18, 2024 • edited Loading

RalfJung commented Dec 18, 2024 • edited Loading

orilahav commented Dec 18, 2024

RalfJung commented Dec 18, 2024

arielb1 commented Dec 18, 2024

chorman0773 commented Dec 9, 2024 •

edited

Loading

taiki-e commented Dec 9, 2024 •

edited

Loading

RalfJung commented Dec 9, 2024 •

edited

Loading

arielb1 commented Dec 17, 2024 •

edited

Loading

arielb1 commented Dec 17, 2024 •

edited

Loading

arielb1 commented Dec 17, 2024 •

edited

Loading

arielb1 commented Dec 17, 2024 •

edited

Loading

orilahav commented Dec 18, 2024 •

edited

Loading

arielb1 commented Dec 18, 2024 •

edited

Loading

orilahav commented Dec 18, 2024 •

edited

Loading

arielb1 commented Dec 18, 2024 •

edited

Loading

arielb1 commented Dec 18, 2024 •

edited

Loading

arielb1 commented Dec 18, 2024 •

edited

Loading

RalfJung commented Dec 18, 2024 •

edited

Loading