Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto vectorize replace_copy, replace_copy_if #4431

Merged
merged 15 commits into from
Mar 21, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Mar 2, 2024

Work around DevCom-10606350

Benchmark result before:

----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
rc<uint8_t>           1458 ns         1465 ns       448000
rc<uint16_t>          1647 ns         1650 ns       407273
rc<uint32_t>          1680 ns         1650 ns       407273
rc<uint64_t>          1666 ns         1674 ns       448000
rc_if<uint8_t>        1644 ns         1650 ns       407273
rc_if<uint16_t>       1686 ns         1688 ns       407273
rc_if<uint32_t>       1648 ns         1674 ns       448000
rc_if<uint64_t>       1750 ns         1726 ns       407273

After:

----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
rc<uint8_t>            159 ns          160 ns      4977778
rc<uint16_t>           272 ns          270 ns      2488889
rc<uint32_t>           505 ns          502 ns      1120000
rc<uint64_t>          1008 ns         1004 ns       746667
rc_if<uint8_t>         166 ns          165 ns      4072727
rc_if<uint16_t>        348 ns          339 ns      2488889
rc_if<uint32_t>        566 ns          572 ns      1120000
rc_if<uint64_t>       1474 ns         1465 ns       448000

After with /arch:AVX2:

----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
rc<uint8_t>           58.5 ns         59.4 ns     10000000
rc<uint16_t>           102 ns          103 ns      6400000
rc<uint32_t>           187 ns          188 ns      3733333
rc<uint64_t>           559 ns          558 ns      1120000
rc_if<uint8_t>        76.8 ns         76.7 ns     11200000
rc_if<uint16_t>        131 ns          132 ns      4977778
rc_if<uint32_t>        230 ns          230 ns      2986667
rc_if<uint64_t>       1691 ns         1688 ns       407273

The anomality for rc_if<uint64_t> AVX2 case it that compiler generates AVX512 code and checks for AVX512 in __isa_available and I don't have that, so it is scalar, still a bit faster due to branchless. Applies specifically to unsigned 64 bit only.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner March 2, 2024 15:23
@StephanTLavavej StephanTLavavej added the performance Must go faster label Mar 2, 2024
@StephanTLavavej StephanTLavavej self-assigned this Mar 2, 2024
stl/inc/algorithm Outdated Show resolved Hide resolved
@StephanTLavavej StephanTLavavej removed their assignment Mar 5, 2024
@StephanTLavavej StephanTLavavej added the decision needed We need to choose something before working on this label Mar 5, 2024
@StephanTLavavej StephanTLavavej removed the decision needed We need to choose something before working on this label Mar 6, 2024
@StephanTLavavej StephanTLavavej self-assigned this Mar 7, 2024
@AlexGuteniev
Copy link
Contributor Author

I don't think we need to require the output type to be the same as the source types.

Curiously, the AVX2 auto vectorization does indeed work even for different source and destination element size, this is achieved using vpmovsx* or vpmovzx* to widen and vpshufb to compress.

Surprising the compiles goes that much far, and misses the final piece to vectorize without this PR

@AlexGuteniev AlexGuteniev changed the title Auto vectorize replace_copy Auto vectorize replace_copy, replace_copy_if Mar 8, 2024
benchmarks/src/replace.cpp Show resolved Hide resolved
benchmarks/src/replace.cpp Outdated Show resolved Hide resolved
benchmarks/src/replace.cpp Show resolved Hide resolved
benchmarks/src/replace.cpp Outdated Show resolved Hide resolved
benchmarks/src/replace.cpp Outdated Show resolved Hide resolved
benchmarks/src/replace.cpp Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
@StephanTLavavej StephanTLavavej removed their assignment Mar 20, 2024
@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej self-assigned this Mar 21, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 1e7d7f8 into microsoft:main Mar 21, 2024
35 checks passed
@StephanTLavavej
Copy link
Member

Thanks for helping the compiler, said the developer who once gave a talk titled Don't Help The Compiler! 😹 🤪 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants