-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in std::sync::mpsc::list
#110001
Comments
Here are the results of some initial investigation I have been doing. I used
|
cc @ibraheemdev |
I wouldn't be so quick to write off an ordering issue, the race condition could require stronger synchronization to fix (e.g. |
The segfault did occur (after about 5 minutes of runtime) on my Macbook Pro with an Intel Core i7 processor running macOS 13.3 and rustc 1.68.2. So whatever is causing this, it's not exclusive to Linux. |
I have a feeling the issue is to do with a CAS failure when updating the tail not synchronizing with the update to head. If the thread that actually makes the update is preempted, the second thread might continue with the assumption that head has been updated, which might matter if it quickly writes to the entire block and the channel is disconnected before the head is ever touched. I haven't verified if this is possible, but it would be extremely rare if it is, which explains why this issue hasn't been hit before. |
sync::mpsc: synchronize receiver disconnect with initialization Receiver disconnection relies on the incorrect assumption that `head.index != tail.index` implies that the channel is initialized (i.e `head.block` and `tail.block` point to allocated blocks). However, it can happen that `head.index != tail.index` and `head.block == null` at the same time which leads to a segfault when a channel is dropped in that state. This can happen because initialization is performed in two steps. First, the tail block is allocated and the `tail.block` is set. If that is successful `head.block` is set to the same pointer. Importantly, initialization is skipped if `tail.block` is not null. Therefore we can have the following situation: 1. Thread A starts to send the first value of the channel, observes that `tail.block` is null and begins initialization. It sets `tail.block` to point to a newly allocated block and then gets preempted. `head.block` is still null at this point. 2. Thread B starts to send the second value of the channel, observes that `tail.block` *is not* null and proceeds with writing its value in the allocated tail block and sets `tail.index` to 1. 3. Thread B drops the receiver of the channel which observes that `head.index != tail.index` (0 and 1 respectively), therefore there must be messages to drop. It starts traversing the linked list from `head.block` which is still a null pointer, leading to a segfault. This PR fixes this problem by waiting for initialization to complete when `head.index != tail.index` and the `head.block` is still null. A similar check exists in `start_recv` for similar reasons. Fixes rust-lang#110001
The unbounded channel implementation from
std::sync::mpsc
can segfault due to what looks like a race condition when dropping channel senders and receivers.You can find a repro at https://github.com/teskje/timely-dataflow/tree/channel-segfault. Unfortunately it is not minimal at all, since it depends on the timely-dataflow crate, which is the actual channel user. I hope it is still self-contained enough that people familiar with the channel implementation can run it and use it as a starting point for debugging.
Run the
channel_segfault
example like this:On some machines* it will segfault after some time, e.g.:
The segfault happens inside
std::sync::mpsc::list
, as visible in the gdb backtrace below.The same segfault also happens with
crossbeam_channel
. You can reproduce this by reverting the second-to-last commit on the repro branch. I understand thatstd::sync::mpsc
is a port ofcrossbeam_channel
, so that's not surprising.Meta
rustc --version --verbose
:I can reproduce the segfault on nightly as well.
*The segfault is not reproducible on all machines! We were able to reproduce it on:
It doesn't seem to reproduce on:
Also note that the repro is using more and more memory over time. That is because the different worker threads create and drop dataflows at different speeds and timely-dataflow keeps per-dataflow communication state (including channels) around until the last worker has dropped the dataflow. It is possible to avoid that by adding a
thread::sleep
that lets late workers catch up, but in my experience that also stops the segfault from happening.Backtrace
The text was updated successfully, but these errors were encountered: