Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize reverse for 32-bit trivial types (optimize for all Intel and recent AMD, pessimize for Excavator/Zen/Zen+) #2383

Merged
merged 4 commits into from
Mar 19, 2022

Conversation

AlexGuteniev
Copy link
Contributor

change vpermq+vpshufd to vpermd

(see https://uops.info/table.html)

Benchmark:

#include <cstdint>
#include <chrono>
#include <iostream>

#include <intrin.h>

constexpr std::size_t N = 2048;
constexpr std::size_t R = 100'000'000;

std::uint32_t   a32[N / 4];

inline size_t byte_length(const void* first, const void* last) noexcept {
    return static_cast<const unsigned char*>(last) - static_cast<const unsigned char*>(first);
}

inline void advance_bytes(void*& target, ptrdiff_t offset) noexcept {
    target = static_cast<unsigned char*>(target) + offset;
}

__declspec(noalias) void std_reverse_trivially_swappable_4_old(void* first, void* last) noexcept {
    void* stop_at = first;
    advance_bytes(stop_at, byte_length(first, last) >> 6 << 5);
    do {
        advance_bytes(last, -32);
        const __m256i left = _mm256_loadu_si256(static_cast<__m256i*>(first));
        const __m256i right = _mm256_loadu_si256(static_cast<__m256i*>(last));
        const __m256i left_perm = _mm256_permute4x64_epi64(left, _MM_SHUFFLE(1, 0, 3, 2));
        const __m256i right_perm = _mm256_permute4x64_epi64(right, _MM_SHUFFLE(1, 0, 3, 2));
        const __m256i left_reversed = _mm256_shuffle_epi32(left_perm, _MM_SHUFFLE(0, 1, 2, 3));
        const __m256i right_reversed = _mm256_shuffle_epi32(right_perm, _MM_SHUFFLE(0, 1, 2, 3));
        _mm256_storeu_si256(static_cast<__m256i*>(first), left_reversed);
        _mm256_storeu_si256(static_cast<__m256i*>(last), right_reversed);
        advance_bytes(first, 32);
    } while (first != stop_at);
}

__declspec(noalias) void std_reverse_trivially_swappable_4_new(void* first, void* last) noexcept {
    void* stop_at = first;
    advance_bytes(stop_at, byte_length(first, last) >> 6 << 5);
    const __m256i shuf = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7);
    do {
        advance_bytes(last, -32);
        const __m256i left = _mm256_loadu_si256(static_cast<__m256i*>(first));
        const __m256i right = _mm256_loadu_si256(static_cast<__m256i*>(last));
        const __m256i left_reversed = _mm256_permutevar8x32_epi32(left, shuf);
        const __m256i right_reversed = _mm256_permutevar8x32_epi32(right, shuf);
        _mm256_storeu_si256(static_cast<__m256i*>(first), left_reversed);
        _mm256_storeu_si256(static_cast<__m256i*>(last), right_reversed);
        advance_bytes(first, 32);
    } while (first != stop_at);
}

void rev(void r(void* first, void* last), const char* name) {
    auto t1 = std::chrono::steady_clock::now();
    for (std::size_t i = 0; i < R; i++) {
        r(&*std::begin(a32), &*std::end(a32));
    }
    auto t2 = std::chrono::steady_clock::now();
    std::cout << name << ":\t" << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() << "s\n";
}

int main() {
    rev(std_reverse_trivially_swappable_4_old, "old");
    rev(std_reverse_trivially_swappable_4_new, "new");
}

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner December 9, 2021 07:46
@AlexGuteniev AlexGuteniev changed the title Optimize reverse for 32-bit thrivial types Optimize reverse for 32-bit trivial types Dec 9, 2021
@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Dec 9, 2021

Since the optimization is certainly dependent on CPU microarchitecture, I've looked into https://uops.info data to see the results for different CPUs. I was primarily looking into Throughput (TP).

This appeared to be a significant win for Intel Skylake and earlier, a smaller win for Intel Ice Lake and later.

Out of AMD, upos.info only has Zen+, Zen 2, Zen 3. It appears to be some loss for Zen+ while still a small win for Zen 2 and Zen 3.

@CaseyCarter confirmed a small win for Zen 2 (3950X) and @StephanTLavavej confirmed a small win for Zen 3 (5950X)

@AlexGuteniev
Copy link
Contributor Author

I'm now also suspecting that original code might be not specifically optimized for some CPU that for which it is more efficient. Rather it may be a simple upscaling of SSE code to AVX. In this case the existing 32-bit implementation makes sense.

@StephanTLavavej StephanTLavavej added the performance Must go faster label Dec 9, 2021
@AlexGuteniev AlexGuteniev changed the title Optimize reverse for 32-bit trivial types Optimize reverse for 32-bit trivial types (optimize for all Intel and recent AMD, pessimize for Excavator/Zen/Zen+) Dec 11, 2021
@StephanTLavavej
Copy link
Member

Looks good to me, thanks. The resulting code is both shorter and simpler to understand (as the intrinsic directly does what we want). I confirmed that this intrinsic is AVX2 so it's correctly guarded. The performance results you've collected are compelling - while of course it would be nice to have a pure win across the board, having wins for all modern processors (2019+) and slight losses for older processors is acceptable. (Especially considering that not too long ago, we shipped completely unvectorized reverse() and reverse_copy().)

@barcharcraz
Copy link
Member

whew, sorry it took so long. I got tangled up in the AVX2 documentation several times, this looks equivalent to the old code. The mask parameter could be "constexpr" but I think it's merely const intentionally, as the intent is to use an indirect version of the instruction (without an immediate), at least that's my understanding

@AlexGuteniev
Copy link
Contributor Author

The mask parameter could be "constexpr" but I think it's merely const intentionally, as the intent is to use an indirect version of the instruction (without an immediate), at least that's my understanding

No, just an oversight. Could be constexpr.
Made just const by looking at other cases here, but those other cases are *_set_* intrinsics, so they can't be constexpr, and this one can.

Please push the change if you think ith worth resetting testing.

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Mar 17, 2022

The mask parameter could be "constexpr" but I think it's merely const intentionally, as the intent is to use an indirect version of the instruction (without an immediate), at least that's my understanding

No, just an oversight. Could be constexpr. Made just const by looking at other cases here, but those other cases are *_set_* intrinsics, so they can't be constexpr, and this one can.

Please push the change if you think ith worth resetting testing.

Actually disregard that. This one is also *_set_* inttinsic, so can't be constexpr.

@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit cfc02a1 into microsoft:main Mar 19, 2022
@StephanTLavavej
Copy link
Member

😻 🎉 😹 !uoy knahT

@AlexGuteniev AlexGuteniev deleted the reverse_4 branch March 19, 2022 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants