Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zeroize performance on u8 arrays #743

Open
blckngm opened this issue Mar 1, 2022 · 23 comments
Open

Zeroize performance on u8 arrays #743

blckngm opened this issue Mar 1, 2022 · 23 comments

Comments

@blckngm
Copy link

blckngm commented Mar 1, 2022

I inspected the generated assembly code and benchmarked zeroize for [u8; 32] on x86_64 and found it quite inefficient, storing one byte at a time:

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=3f44f4b90e6af0eac0dcb1f649390329

On my Ryzen CPU, it takes ~7.8324 ns, or ~1cpb. Binary code size is also quite large.

Using inline assembly (just stabilized in 1.59) and SSE2, zeroing a [u8; 32] takes just 3 instructions and ~492.87 ps (~16 bytes per cycle):

let mut buf: [u8; 32];
core::arch::asm!(
    "xorps {zero}, {zero}",
    "movups {zero}, ({ptr})",
    "movups {zero}, 16({ptr})",
    zero = out(xmm_reg) _,
    ptr = in(reg) &mut buf,
    options(att_syntax, nostack, preserves_flags),
);

So it might be something worth optimizing/documenting.

If you do not want to use inline assembly, maybe you should encourage using larger types or SIMD types, e.g., [u64; 4] or [__m128; 2] instead of [u8; 32]. Using write_volatile on *mut __m128 generates equally compact and efficient code as the assembly code above.

@tarcieri
Copy link
Member

tarcieri commented Mar 1, 2022

@sopium we've done practically no optimization work, so I'm glad you're thinking about it.

Inline assembly is fine, but would need to be feature gated due to the higher MSRV (we can eventually consider bumping it, but our last bump was already painful)

For something like [T; N] though, it would be good to ensure this doesn't monomorphize to N distinct blobs of ASM.

@blckngm
Copy link
Author

blckngm commented Mar 1, 2022

It boils down to LLVM not able to optimize write_volatile for u8 pointers. The bytes have to be written individually.

I think monomorphize for 16/32/64-byte arrays should be fine. These would also be commonly used key sizes in cryptography so the optimization can be justified.

For a u8 slice with unknown but largish size you can also use align_to_mut:

pub fn zeroize_slice(x: &mut [u8]) {
    unsafe {
        // Use __m128 on x86/x86_64.
        let (p, m, s) = x.align_to_mut::<u64>();
        for b in p {
            core::ptr::write_volatile(b, 0);
        }
        for m in m {
            core::ptr::write_volatile(m, 0);
        }
        for b in s {
            core::ptr::write_volatile(b, 0);
        }
    }
}

@tarcieri
Copy link
Member

tarcieri commented Mar 1, 2022

The use of write_volatile was something of a hack to ensure that the writes would not be elided by the compiler.

But inline assembly should also ensure that as well, and permit a highly performant implementation.

@blckngm
Copy link
Author

blckngm commented Mar 1, 2022

The use of write_volatile was something of a hack to ensure that the writes would not be elided by the compiler.

But inline assembly should also ensure that as well, and permit a highly performant implementation.

Yes, I understand this. If it's “normal” write in a loop LLVM would be able recognize it and generate efficient instructions or call memset.

@tarcieri
Copy link
Member

tarcieri commented Mar 1, 2022

Yeah the typical way to implement a zeroization primitive efficiently is some sort of volatile memset typically realized by using some kind of compiler fence, often implemented as invoking memset in conjunction with an assembly-based dataflow optimization barrier.

While Rust has bindings to LLVM's volatile memset intrinsics, they are perma-unstable and no work has been done on a stable API AFAIK:

https://doc.rust-lang.org/std/intrinsics/fn.volatile_set_memory.html

But perhaps inline assembly could provide the next best thing, albeit on a target architecture-by-architecture basis.

@newpavlov
Copy link
Member

We can not use write_volatile on *mut __m128 which we got from [u8; 32], since the pointer could be insufficiently aligned. AFAIK we do not have unaligned version of the volatile write. As a hack we could split [u8; 32] into aligned and unaligned parts. In theory, compiler should be able to remove the unaligned code parts if the array is sufficiently aligned.

@blckngm
Copy link
Author

blckngm commented Mar 1, 2022

We can not use write_volatile on *mut __m128 which we got from [u8; 32], since the pointer could be insufficiently aligned. AFAIK we do not have unaligned version of the volatile write. As a hack we could split [u8; 32] into aligned and unaligned parts. In theory, compiler should be able to remove the unaligned code parts if the array is sufficiently aligned.

I was suggesting using [__m128; N] from the start, and transmuting to &[u8; N * 16] when necessary.

As for splitting to aligned and unaligned parts, that's exactly what align_to_mut does.

@blckngm
Copy link
Author

blckngm commented Mar 1, 2022

Yeah the typical way to implement a zeroization primitive efficiently is some sort of volatile memset typically realized by using some kind of compiler fence, often implemented as invoking memset in conjunction with an assembly-based dataflow optimization barrier.

Oh this reminds me, now that we have stable inline assembly, this can be done too, and it's not architecture dependent:

pub fn zeroize(x: &mut [u8]) {
    for b in x.iter_mut() {
        *b = 0;
    }
    unsafe {
        core::arch::asm!(
            "/* {ptr} */",
            ptr = in(reg) x.as_mut_ptr(),
            options(nostack, readonly, preserves_flags),
        );
    }
}

pub fn f() {
    let mut x = [0u8; 32];
    zeroize(&mut x);
}

Playground: https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=32fc07b8173bd2ce233eaa9bf0d8d81a

zeroize in f is optimized to two movaps but is not optimized out.

@cbeck88
Copy link

cbeck88 commented Jan 6, 2023

@sopium I have a question about the use of readonly in the assembly options. (This is mostly for my own understanding.)

I watched a talk by Chandler Carruth (major contributor to LLVM) a long time ago, where he described creating similar functions in order to defeat the optimizer for benchmarking purposes: https://www.youtube.com/watch?v=nXaxk27zwlk&t=2476s

static void escape(void * p) {
    asm volatile("" : : "g"(p) : "memory")
}

I realize that this is a C++ talk and not rust, but they are both going through LLVM in the end so there's a lot of overlap.

In his talk he seems to say that it's important for the optimizer to believe that
(1) This asm has side-effects that the optimizer cannot know about. In C++ that's done with "volatile" asm, and in rust I think that's done by not marking your asm pure
(2) This asm potentially clobbers all memory. So, any reads from memory after this point have to actually go back to memory and any values cached in registers are dirty.

In your block you use the readonly flag and I'm trying to understand what this does and why this is good.

  1. We are telling the optimizer, this assembly is not pure, it has some "crazy unknowable side-effect" in chandler's words
  2. However, we are also telling the optimizer, this assembly is readonly and doesn't write to memory
  3. It also preserves flags.

So, the optimizer has to believe that

  1. This assembly can read from the memory which we are supposed to have zeroes
  2. This assembly has unknowable side-effects, not affecting memory, but something else (flags? some CPU state?) that it has to care about, and whatever that something is might become wrong if we don't actually write the zeroes beforehand
  3. Afterwards, memory has not been clobbered, and flags have not been changed, so we do not have to re-read any cached values in registers or modify any flags.

So 3 is good because it will help the efficiency of code using this block.

But it's hard for me to understand, if we're promising the optimizer that flags are not changed, registers are not changed (only the in(reg) for ptr is read from), and memory is not changed, what is the side-effect that the compiler still has to be worried about that prevents it from dropping this block? Or the model is just that, it still has to assume that there might be something beyond this that it doesn't understand?

@newpavlov
Copy link
Member

newpavlov commented Jan 6, 2023

@cbeck88
AFAIK readonly applies only to memory of the current process, the asm block from compiler's PoV may in theory perform a syscall and it will be well in bounds of the asm! options contract.

@cbeck88
Copy link

cbeck88 commented Jan 6, 2023

I see, thank you

@cbeck88
Copy link

cbeck88 commented Feb 25, 2023

It looks to me that one barrier to actually creating a patch that does this is the need for some form of the specialization language feature.

We currently have these impl's:

/// Impl [`Zeroize`] on arrays of types that impl [`Zeroize`].
impl<Z, const N: usize> Zeroize for [Z; N]
where
    Z: Zeroize,
{
    fn zeroize(&mut self) {
        self.iter_mut().zeroize();
    }
}

/// Impl [`Zeroize`] on arrays of types that impl [`Zeroize`].

/// Impl [`Zeroize`] on arrays of types that impl [`Zeroize`].
impl<Z, const N: usize> Zeroize for [Z; N]
where
    Z: Zeroize,
{
    fn zeroize(&mut self) {
        self.iter_mut().zeroize();
    }
}

impl<Z> Zeroize for [Z]

If we wanted to also add impls for the special case that Z = u8, then these will be overlapping impls, and we obviously cannot remove the existing impls.


I had this same problem in a hashing library: https://github.com/mobilecoinfoundation/mobilecoin/blob/b07c17fd2e1946c15b3a38c30b0fb1dc4ab516e3/crypto/digestible/src/lib.rs#L218

What I decided to do there was, simply not support my trait on the primitive type u8, so that I could have my nice impl for [T] and a nice impl for [u8]. Since it's extremely rare that someone actually needs to put a single u8 in a struct, I never actually had a problem, and I was able to do this kind of optimization without waiting for specialization to land.

In your case you already shipped a stable API which supports u8 so you probably don't want to do this approach.

@tarcieri
Copy link
Member

u8 is pretty commonly used as a carrier for masked boolean values when writing constant-time code

@cbeck88
Copy link

cbeck88 commented Feb 25, 2023

Good point

So do you see a path forward for this? Or just wait for specialization?

It seems to me we could make an optimized "zeroize_bytes" function based on the approach here, and people could either implement Zeroize manually and call out to it, or we could make like a #[zeroize(with = ...)] attribute for the zeroize-derive proc macro? Similar to what they do in serde? Is that something you would support?

@tarcieri
Copy link
Member

tarcieri commented Feb 25, 2023

A free function works, although is somewhat inelegant.

It might be possible to optimize the impl on [T] generically, packing as many copies of <T as Default>::default() as can fit into a quadword, then writing quadwords with movups until the remaining space in the slice is less-than-quadword sized, then falling back to writing them one-at-a-time in order to fill the remaining space.

Something like this:

const QUADWORD_SIZE: usize = 8;
const SIZE_OF_T: usize = mem::size_of::<T>();

if (SIZE_OF_T < QUADWORD_SIZE) && (QUADWORD_SIZE % SIZE_OF_T == 0) {
    const COPIES_IN_QUADWORD: usize = QUADWORD_SIZE / SIZE_OF_T;

    let quadword = [T::default(); COPIES_IN_QUADWORD]; 
    let mut iter = slice.chunks_exact_mut(COPIES_IN_QUADWORD);

    for chunk in iter {
        movups(chunk, &quadword as *const T as *const u64);
    }

    slow_path(iter.remainder());
} else {
    slow_path(slice);
}

@cbeck88
Copy link

cbeck88 commented Feb 25, 2023

Interesting idea, maybe this is a good use-case for https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut

Looking more at the code, it looks to me that we already have this impl -- maybe not directly relevant to OP:

unsafe { volatile_set(self.as_mut_ptr(), Z::default(), self.len()) };

/// Impl [`Zeroize`] on slices of types that can be zeroized with [`Default`].
///
/// This impl can eventually be optimized using an memset intrinsic,
/// such as [`core::intrinsics::volatile_set_memory`]. For that reason the
/// blanket impl on slices is bounded by [`DefaultIsZeroes`].
///
/// To zeroize a mut slice of `Z: Zeroize` which does not impl
/// [`DefaultIsZeroes`], call `iter_mut().zeroize()`.
impl<Z> Zeroize for [Z]
where
    Z: DefaultIsZeroes,
{
    fn zeroize(&mut self) {
        assert!(self.len() <= isize::MAX as usize);

        // Safety:
        //
        // This is safe, because the slice is well aligned and is backed by a single allocated
        // object for at least `self.len()` elements of type `Z`.
        // `self.len()` is also not larger than an `isize`, because of the assertion above.
        // The memory of the slice should not wrap around the address space.
        unsafe { volatile_set(self.as_mut_ptr(), Z::default(), self.len()) };
        atomic_fence();
    }
}

It's too bad that Vec<T> and Box<[T]> and [T; n] don't seem to call to this impl, I guess that they can't because they don't have the DefaultIsZeroes bound? And that's where we'd need negative trait bounds or specialization or something in order to make them able to do it in the cases when that bound does hold.

But maybe this is the place to do this kind of optimization work, and people can call foo.as_mut_slice().zeroize() or whatever when the perf matters, for now.

@tarcieri
Copy link
Member

The impls on slice types call volatile_set, so that seems like the proper place to provide an optimized implementation for filling slices

@cbeck88
Copy link

cbeck88 commented Feb 25, 2023

Here's the thing -- I bet that volatile_set is already going to be well understood by the compiler and that splitting this up for alignment is going to be done by llvm, so I feel like writing out the align_to_mut stuff is probably going to be wasted effort. (Could look at code gen and see). I know I've seen llvm generate all this before for memset calls.

Here's an idea: Maybe we can refactor things so that we don't have special-casing based on DefaultIsZeroes trait?

What if we made Zeroize like this:

pub trait Zeroize {
    /// The bit pattern that we will set objects of this type to when zeroizing them
    fn zero_pattern() -> Self;

    /// Write the zero pattern over an object, can't be optimized away
    fn zeroize(&mut self) {
        volatile_write(self, Self::zero_pattern());
        atomic_fence();
    }
}

Then the idea is we could have impl <Z> Zeroize for [Z] without any additional bounds using the same code as above, and it would use Z::zero_pattern() rather than Z::default(). And we would similarly be able to implement Zeroize for Vec<Z>, Box<[Z]>, etc. in a way that calls to the optimized implementation for [Z].

I think we might not need to wait for specialization if we did that, or something along these lines.

@tarcieri
Copy link
Member

tarcieri commented Feb 25, 2023

First, you're proposing breaking changes to a post-1.0 crate which exposes a trait-oriented API. Those would only be accepted as a last resort.

But also, zeroize is designed to work on structured types that don't necessarily implement Copy. That's why zeroize takes &mut self: so it can mutate data in-place.

DefaultIsZeroes exists to provide a strategy for zeroizing Copy types by replacing them with the Default value, but that approach doesn't work for non-Copy types, and zeroize is very much designed to support both.

@cbeck88
Copy link

cbeck88 commented Feb 25, 2023

Yeah -- I see that, it's going to get more complicated when you want the implementation for [Z] not to assume that Z is copy anymore, so that it can be parametric.

Intuitively, what I'd like to say is, what if we try to implement the general case where, Z might have destructors, put calls to drop_in_place etc. as appropriate, which will become no-ops when Z is copy. We could try to be parametric over all types. At least it's not obvious to me that "that approach doesn't work for non-copy types". Maybe it's obvious to you, you've thought about it more.

I understand it's a breaking change, just trying to think about directions that might make things simpler or better.

@tarcieri
Copy link
Member

The big problem with any breaking change is like serde, zeroize contains traits which compose across many crates.

It has ~500 downstream dependencies, and any kind of upgrade needs to be coordinated across all of them.

So really breaking changes need to be reserved for things that absolutely must make them. And I don't see any reason that mandates such a drastic change.

it's going to get more complicated when you want the implementation for [Z] not to assume that Z is copy anymore

That's not an issue. The impl of Zeroize for [Z] already bounds on DefaultIsZeroes, which in turn bounds on Copy, Default, and Sized, so we can write an optimized implementation which does a simple copy. It was designed that way from the start.

The impl on IterMut can be used to zeroize [Z] where Z is non-Copy, i.e. non_copy_slice.iter_mut().zeroize().

@cbeck88
Copy link

cbeck88 commented Feb 26, 2023

I'm just trying to understand, even if we did optimize the impl Zeroize for [Z]: Z: DefaultIsZeroes, what as a user would I have to do to actually get it.

Because if I have code like OP:

let mut x = [7u8; 32];
x.zeroize();

it's not going to call the zeroize function for slices, it's going to call the [T; n] one.

Suppose I have code like

#[derive(Zeroize)]
struct A {
   a: Vec<u8>
}

This is not going to call the impl Zeroize for [Z]: Z: DefaultIsZeroes impl that we're proposing to optimize, which is kind of unfortunate. So that means that optimizing this stuff feels kind of like wasted effort until this is somehow improved.

In the meantime, from experiments it looks that

let mut x = [7u8; 32];
(&mut x as &mut [u8]).zeroize();

will convince the compiler to pick the faster implementation, so maybe we could do the optimization we're talking about and document that that's what you have to do as a user if zeroizing one byte at a time isn't good enough?

And for a struct like A you probably have to implement Zeroize by hand, and you probably only should do that if you're sure that your vec is not getting resized ever? Or you would have to marry the impl for Vec<Z> with the optimizations we have for slices with DefaultIsZeroes

It would be nice if we could make this easier for the user of the crate somehow, but I guess I don't see a way if we're not interested in discussing zeroize 2.0 here. Thanks.

@tarcieri
Copy link
Member

tarcieri commented Feb 26, 2023

what as a user would I have to do to actually get it.

Borrow the value as a mutable slice, then call zeroize.

It's a bit of a pain, but the best we can do without specialization.

guess I don't see a way if we're not interested in discussing zeroize 2.0 here

I still don't see anything worth making breaking changes over until specialization lands, and even then it would probably still make sense to have a specialization feature which adds the specialized impls.

cbeck88 added a commit to cbeck88/utils that referenced this issue Feb 28, 2023
The purpose of the change is to make calls to `x.as_mut_slice().zeroize()`
considerably faster, particularly for types like `[u8; n]`.

The reason it becomes faster is that the call to `volatile_set` before
this change appears not to be easily optimizable, and (for example) leads
to setting bytes one at a time, instead of the compiler consolidating them
into SIMD instructions.

In the modified code, we don't use `volatile_set`, we instead loop over
the slice setting the elements to `Default::default()`, and to ensure
that the writes are not optimized out, we use an empty asm block.
There is discussion of the correct asm options to use here in the issue.

Because the asm block potentially reads from the pointer and makes a
syscall of some kind, the compiler cannot optimize out the zeroizing,
or it could cause observable side-effects. In the improved code, we only
create such an optimization barrier once, rather than after each byte that
it is written.

The call to `atomic_fence()` is not changed.

---

This change may help give users a way to improve performance, if they
have to zeroize very large objects, or, frequently have to zeroize many
small objects. We tested code-gen here in godbolt (in addition to the
tests posted in the github issue) and found that this change is
typically enough for llvm to start adding in SIMD optimizations that
zero many bytes at once.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants