`find_first_of` vectorized: generalize fast approach for 4 and 8 bytes elements #4623

AlexGuteniev · 2024-04-23T20:15:04Z

Before:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
bm<uint32_t>/2/3           3.11 ns         2.12 ns    640000000
bm<uint32_t>/7/4           13.8 ns         10.3 ns    154482759
bm<uint32_t>/9/3           4.29 ns         3.07 ns    448000000
bm<uint32_t>/22/5          8.13 ns         5.90 ns    235789474
bm<uint32_t>/58/2          5.70 ns         4.01 ns    358400000
bm<uint32_t>/102/4         12.3 ns         8.57 ns    165925926
bm<uint32_t>/325/1         26.8 ns         19.5 ns     68923077
bm<uint32_t>/1011/11       1035 ns          643 ns      2357895
bm<uint32_t>/3056/7         542 ns          337 ns      3895652
bm<uint64_t>/2/3           3.02 ns         1.90 ns    814545455
bm<uint64_t>/7/4           14.0 ns         9.00 ns    149333333
bm<uint64_t>/9/3           4.60 ns         2.76 ns    497777778
bm<uint64_t>/22/5          26.8 ns         16.0 ns     99555556
bm<uint64_t>/58/2          10.6 ns         6.89 ns    235789474
bm<uint64_t>/102/4         20.5 ns         12.8 ns    112000000
bm<uint64_t>/325/1         35.8 ns         22.6 ns     74666667
bm<uint64_t>/1011/11       1371 ns         1029 ns      1518644
bm<uint64_t>/3056/7        3183 ns         2066 ns       597333

After:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
bm<uint32_t>/2/3           3.22 ns         2.34 ns    640000000
bm<uint32_t>/7/4           13.8 ns         10.5 ns    112000000
bm<uint32_t>/9/3           5.11 ns         3.93 ns    373333333
bm<uint32_t>/22/5          10.3 ns         7.67 ns    179200000
bm<uint32_t>/58/2          8.33 ns         4.02 ns    248888889
bm<uint32_t>/102/4         13.4 ns         4.44 ns    235789474
bm<uint32_t>/325/1         16.1 ns         7.91 ns    213333333
bm<uint32_t>/1011/11        329 ns          195 ns      8960000
bm<uint32_t>/3056/7         552 ns          310 ns      7466667
bm<uint64_t>/2/3           3.04 ns        0.734 ns   1000000000
bm<uint64_t>/7/4           14.1 ns         4.68 ns    497777778
bm<uint64_t>/9/3           5.07 ns         1.46 ns    746666667
bm<uint64_t>/22/5          10.1 ns         2.22 ns    689230769
bm<>uint64_t/58/2          12.2 ns         2.76 ns    441107692
bm<uint64_t>/102/4         31.1 ns         19.6 ns     81454545
bm<uint64_t>/325/1         36.0 ns         20.4 ns     89600000
bm<uint64_t>/1011/11        636 ns          289 ns      3733333
bm<uint64_t>/3056/7        1217 ns          330 ns      8960000

… elements

AlexGuteniev · 2024-04-23T20:26:52Z

There are unexpected changes in benchmark results in other lines (presumably due to different loop alignments), but the target for this optimization are the following, where the needle is too big to fit AVX2 reg:

Before:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
bm<uint32_t>/1011/11       1035 ns          643 ns      2357895
bm<uint64_t>/1011/11       1371 ns         1029 ns      1518644
bm<uint64_t>/3056/7        3183 ns         2066 ns       597333

After:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
bm<uint32_t>/1011/11        329 ns          195 ns      8960000
bm<uint64_t>/1011/11        636 ns          289 ns      3733333
bm<uint64_t>/3056/7        1217 ns          330 ns      8960000

stl/src/vector_algorithms.cpp

StephanTLavavej · 2024-04-26T01:07:50Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-04-26T23:46:39Z

Faster! Faster! FASTER! 😸 🚀 🚀

find_first_of vectorized: generalize fast approach for 4 and 8 byte…

1523b90

… elements

AlexGuteniev requested a review from a team as a code owner April 23, 2024 20:15

StephanTLavavej added the performance Must go faster label Apr 23, 2024

StephanTLavavej self-assigned this Apr 23, 2024

AlexGuteniev and others added 4 commits April 23, 2024 23:29

unnecessary branch

8efb7e0

More concise unreachable

eaa5f7f

Add const.

6f3a3d2

We've extracted _Needle_length_large.

ef4e2a5

StephanTLavavej reviewed Apr 24, 2024

View reviewed changes

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved

StephanTLavavej approved these changes Apr 24, 2024

View reviewed changes

StephanTLavavej assigned StephanTLavavej and unassigned StephanTLavavej Apr 24, 2024

StephanTLavavej merged commit fce83bf into microsoft:main Apr 26, 2024
39 checks passed

AlexGuteniev deleted the big_types_big_needles branch April 27, 2024 04:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`find_first_of` vectorized: generalize fast approach for 4 and 8 bytes elements #4623

`find_first_of` vectorized: generalize fast approach for 4 and 8 bytes elements #4623

AlexGuteniev commented Apr 23, 2024

AlexGuteniev commented Apr 23, 2024

StephanTLavavej commented Apr 26, 2024

StephanTLavavej commented Apr 26, 2024

find_first_of vectorized: generalize fast approach for 4 and 8 bytes elements #4623

find_first_of vectorized: generalize fast approach for 4 and 8 bytes elements #4623

Conversation

AlexGuteniev commented Apr 23, 2024

AlexGuteniev commented Apr 23, 2024

StephanTLavavej commented Apr 26, 2024

StephanTLavavej commented Apr 26, 2024

`find_first_of` vectorized: generalize fast approach for 4 and 8 bytes elements #4623

`find_first_of` vectorized: generalize fast approach for 4 and 8 bytes elements #4623