Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find_first_of vectorized: generalize fast approach for 4 and 8 bytes elements #4623

Merged
merged 5 commits into from
Apr 26, 2024

Conversation

AlexGuteniev
Copy link
Contributor

Before:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
bm<uint32_t>/2/3           3.11 ns         2.12 ns    640000000
bm<uint32_t>/7/4           13.8 ns         10.3 ns    154482759
bm<uint32_t>/9/3           4.29 ns         3.07 ns    448000000
bm<uint32_t>/22/5          8.13 ns         5.90 ns    235789474
bm<uint32_t>/58/2          5.70 ns         4.01 ns    358400000
bm<uint32_t>/102/4         12.3 ns         8.57 ns    165925926
bm<uint32_t>/325/1         26.8 ns         19.5 ns     68923077
bm<uint32_t>/1011/11       1035 ns          643 ns      2357895
bm<uint32_t>/3056/7         542 ns          337 ns      3895652
bm<uint64_t>/2/3           3.02 ns         1.90 ns    814545455
bm<uint64_t>/7/4           14.0 ns         9.00 ns    149333333
bm<uint64_t>/9/3           4.60 ns         2.76 ns    497777778
bm<uint64_t>/22/5          26.8 ns         16.0 ns     99555556
bm<uint64_t>/58/2          10.6 ns         6.89 ns    235789474
bm<uint64_t>/102/4         20.5 ns         12.8 ns    112000000
bm<uint64_t>/325/1         35.8 ns         22.6 ns     74666667
bm<uint64_t>/1011/11       1371 ns         1029 ns      1518644
bm<uint64_t>/3056/7        3183 ns         2066 ns       597333

After:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
bm<uint32_t>/2/3           3.22 ns         2.34 ns    640000000
bm<uint32_t>/7/4           13.8 ns         10.5 ns    112000000
bm<uint32_t>/9/3           5.11 ns         3.93 ns    373333333
bm<uint32_t>/22/5          10.3 ns         7.67 ns    179200000
bm<uint32_t>/58/2          8.33 ns         4.02 ns    248888889
bm<uint32_t>/102/4         13.4 ns         4.44 ns    235789474
bm<uint32_t>/325/1         16.1 ns         7.91 ns    213333333
bm<uint32_t>/1011/11        329 ns          195 ns      8960000
bm<uint32_t>/3056/7         552 ns          310 ns      7466667
bm<uint64_t>/2/3           3.04 ns        0.734 ns   1000000000
bm<uint64_t>/7/4           14.1 ns         4.68 ns    497777778
bm<uint64_t>/9/3           5.07 ns         1.46 ns    746666667
bm<uint64_t>/22/5          10.1 ns         2.22 ns    689230769
bm<>uint64_t/58/2          12.2 ns         2.76 ns    441107692
bm<uint64_t>/102/4         31.1 ns         19.6 ns     81454545
bm<uint64_t>/325/1         36.0 ns         20.4 ns     89600000
bm<uint64_t>/1011/11        636 ns          289 ns      3733333
bm<uint64_t>/3056/7        1217 ns          330 ns      8960000

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner April 23, 2024 20:15
@StephanTLavavej StephanTLavavej added the performance Must go faster label Apr 23, 2024
@StephanTLavavej StephanTLavavej self-assigned this Apr 23, 2024
@AlexGuteniev
Copy link
Contributor Author

There are unexpected changes in benchmark results in other lines (presumably due to different loop alignments), but the target for this optimization are the following, where the needle is too big to fit AVX2 reg:

Before:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
bm<uint32_t>/1011/11       1035 ns          643 ns      2357895
bm<uint64_t>/1011/11       1371 ns         1029 ns      1518644
bm<uint64_t>/3056/7        3183 ns         2066 ns       597333

After:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
bm<uint32_t>/1011/11        329 ns          195 ns      8960000
bm<uint64_t>/1011/11        636 ns          289 ns      3733333
bm<uint64_t>/3056/7        1217 ns          330 ns      8960000

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit fce83bf into microsoft:main Apr 26, 2024
39 checks passed
@StephanTLavavej
Copy link
Member

Faster! Faster! FASTER! 😸 🚀 🚀

@AlexGuteniev AlexGuteniev deleted the big_types_big_needles branch April 27, 2024 04:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants