Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help the compiler vectorize std::iota #4627

Merged
merged 12 commits into from
Apr 27, 2024

Conversation

AlexGuteniev
Copy link
Contributor

The compiler vectorization of iota algorithm exists, but it is very fragile; one of issues reported as DevCom-10593477

Can do ranges::iota as well, if this is considered acceptable approach.

Benchmark results are with default benchmark options, I think it is SSE2 and not AVX2.

Before:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
bm<std::uint32_t>/7          3.60 ns         1.36 ns    448000000
bm<std::uint32_t>/18         7.26 ns         3.61 ns    264084211
bm<std::uint32_t>/43         17.4 ns         10.3 ns     74666667
bm<std::uint32_t>/131        50.8 ns         17.1 ns     50176000
bm<std::uint32_t>/315         105 ns         50.0 ns     10000000
bm<std::uint32_t>/1212        393 ns          155 ns      3733333
bm<std::uint64_t>/7          2.88 ns         1.32 ns    640000000
bm<std::uint64_t>/18         4.98 ns         2.59 ns    320000000
bm<std::uint64_t>/43         16.5 ns         8.12 ns    100000000
bm<std::uint64_t>/131        49.0 ns         24.3 ns     37333333
bm<std::uint64_t>/315        91.9 ns         31.5 ns     20363636
bm<std::uint64_t>/1212        308 ns          147 ns     10000000

After:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
bm<std::uint32_t>/7          3.69 ns         2.96 ns    560000000
bm<std::uint32_t>/18         3.65 ns         1.75 ns    454623280
bm<std::uint32_t>/43         7.19 ns         3.40 ns    179200000
bm<std::uint32_t>/131        18.6 ns         10.5 ns     89600000
bm<std::uint32_t>/315        44.2 ns         20.1 ns     28000000
bm<std::uint32_t>/1212        176 ns         82.3 ns     11200000
bm<std::uint64_t>/7          2.60 ns         1.28 ns   1000000000
bm<std::uint64_t>/18         4.53 ns         2.43 ns    560000000
bm<std::uint64_t>/43         8.84 ns         4.81 ns    194782609
bm<std::uint64_t>/131        26.6 ns         11.4 ns     66901333
bm<std::uint64_t>/315        66.3 ns         35.6 ns     26352941
bm<std::uint64_t>/1212        247 ns          107 ns      4977778

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner April 24, 2024 17:58
@StephanTLavavej StephanTLavavej added the performance Must go faster label Apr 24, 2024
stl/inc/numeric Outdated Show resolved Hide resolved
@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Apr 25, 2024

I admit I can still vectorize this better that the compiler.
Specifically handle 8 and 16 bit cases, and handle 32 and 64 bit cases better. I've created DevCom-10646935 for that.

benchmarks/src/iota.cpp Outdated Show resolved Hide resolved
benchmarks/src/iota.cpp Outdated Show resolved Hide resolved
stl/inc/numeric Outdated Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@AlexGuteniev
Copy link
Contributor Author

I pushed changes because I thought the following case was broken.

_Ty = int
size_t = unsigned int
_Val = -1000
(_Last - First) = 0x8000'0001

Probably we need coverage

…'t need to be C++20 `consteval`.

They're always stored in `constexpr` variables.
…ons.

The `_Ugly` functions aren't marked `_EXPORT_STD`.
This matches how we implement `_Is_standard_unsigned_integer` in `<__msvc_bit_utils.hpp>`.
This answers the question of whether `static_cast<_Ty>(_Size)` is value-preserving.
@StephanTLavavej
Copy link
Member

Good catch! After talking on Discord, I've replaced the increasingly complicated helper function with C++20 in_range, suitably uglified to _In_range for downlevel usage. There have been other places in the STL where I've wanted to use this but it wasn't available, so I think it'll be very worthwhile to extract it. (in_range is very tricky to implement, and we managed to do so both correctly and very elegantly in terms of performance.)

Allocating 2 gigs on a 32-bit system would be problematic, so I think that the current level of test coverage should be fine.

I reran the benchmarks to verify that the optimization is still effective:

Benchmark (ns) main Now
bm<std::uint32_t>/7 2.75 4.45
bm<std::uint32_t>/18 5.35 3.39
bm<std::uint32_t>/43 10.9 6.70
bm<std::uint32_t>/131 33.0 15.0
bm<std::uint32_t>/315 71.5 39.1
bm<std::uint32_t>/1212 259 142
bm<std::uint64_t>/7 2.75 2.75
bm<std::uint64_t>/18 5.41 3.84
bm<std::uint64_t>/43 10.9 7.22
bm<std::uint64_t>/131 33.1 22.6
bm<std::uint64_t>/315 71.7 53.6
bm<std::uint64_t>/1212 259 203

@StephanTLavavej

This comment was marked as resolved.

…rt` in `_Min_limit`/`_Max_limit`.

These are internal helpers, and the "public" `_In_range` validates the user-provided types.
@StephanTLavavej StephanTLavavej merged commit 290a95b into microsoft:main Apr 27, 2024
39 checks passed
@StephanTLavavej
Copy link
Member

🔢 🔢 🔢

@AlexGuteniev AlexGuteniev deleted the one_iota branch April 27, 2024 04:40
AlexGuteniev added a commit to AlexGuteniev/STL that referenced this pull request Apr 27, 2024
Based on microsoft#4627 fixup that extracts `_In_range`
This also makes some place dirctly using `_In_range`,
but mostly `_Max_limit` is used
AlexGuteniev added a commit to AlexGuteniev/STL that referenced this pull request May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants