Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize find_first_of for 8 and 16 bit elements with SSE4.2 pcmpestri #4466

Merged
merged 18 commits into from
Mar 21, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Mar 10, 2024

Suggested by @Alcaro in #4415 (comment)

For element set that fits SSE register, that is length of "needle" is up to 16 for 8-bit element, up to 8 for 16-bit element

Possible future work:

  • Use find instead for 1-element cases as @Alcaro suggested
  • basic_string::find_first_of. Certainly not the general case with user-provided char traits, but maybe for standard cases;
  • Expand to more that 16 bytes "needle". Will add inner loop, make it square algorithmic complexity, and more implementation complexity, but may be useful.

Benchmark result

The first non-type template parameter in the benchmark results is the position where the value is found, "haystack" length is twice that.
The second non-type template parameter in the benchmark results is the "needle" size.

Before:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
bm<uint8_t, 2, 3>            6.33 ns         6.42 ns    112000000
bm<uint16_t, 2, 3>           6.47 ns         6.45 ns     89600000
bm<uint8_t, 7, 4>            22.8 ns         22.9 ns     32000000
bm<uint16_t, 7, 4>           15.1 ns         15.3 ns     44800000
bm<uint8_t, 9, 3>            17.7 ns         17.6 ns     37333333
bm<uint16_t, 9, 3>           16.6 ns         16.7 ns     44800000
bm<uint8_t, 22, 5>           73.7 ns         73.2 ns      8960000
bm<uint16_t, 22, 5>          69.3 ns         69.8 ns     11200000
bm<uint8_t, 3056, 7>         7991 ns         8022 ns        89600
bm<uint16_t, 3056, 7>        7314 ns         7324 ns        89600
bm<uint8_t, 1011, 11>        3709 ns         3767 ns       186667
bm<uint16_t, 1011, 11>       3662 ns         3599 ns       186667

After:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
bm<uint8_t, 2, 3>            6.22 ns         6.14 ns    112000000
bm<uint16_t, 2, 3>           6.95 ns         6.98 ns     89600000
bm<uint8_t, 7, 4>            23.9 ns         24.1 ns     29866667
bm<uint16_t, 7, 4>           23.1 ns         22.9 ns     32000000
bm<uint8_t, 9, 3>            13.1 ns         13.2 ns     49777778
bm<uint16_t, 9, 3>           14.1 ns         14.1 ns     49777778
bm<uint8_t, 22, 5>           14.3 ns         14.3 ns     44800000
bm<uint16_t, 22, 5>          15.0 ns         15.0 ns     44800000
bm<uint8_t, 3056, 7>          314 ns          314 ns      2240000
bm<uint16_t, 3056, 7>         611 ns          614 ns      1120000
bm<uint8_t, 1011, 11>         107 ns          107 ns      6400000
bm<uint16_t, 1011, 11>       3478 ns         3516 ns       213333

Explanation:

  • Starting with bm<uint8_t, 9, 3> row, the vectorization is engaged;
  • bm<uint8_t, 22, 5> already shows significant improvement;
  • The exact amount of the improvement is hard to determine, as it is comparison of O(N**2) vs O(N) algorithm, the longer "needle" is, the more improvement is observed;
  • bm<uint16_t, 1011, 11> falls back to the scalar algorithm due to "needle" length not fitting SSE register.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner March 10, 2024 14:38
@StephanTLavavej StephanTLavavej added the performance Must go faster label Mar 10, 2024
@StephanTLavavej StephanTLavavej self-assigned this Mar 10, 2024
@Alcaro
Copy link
Contributor

Alcaro commented Mar 19, 2024

As we discovered on Discord a while ago, some people call std::find_first_of with a single-element needle. This should be forwarded to std::find (or memchr or something), even if vectorization is otherwise disabled.

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Mar 19, 2024

As we discovered on Discord a while ago, some people call std::find_first_of with a single-element needle. This should be forwarded to std::find (or memchr or something), even if vectorization is otherwise disabled.

I remember this, but it is so much unrelated, that I think it better fits a separate PR

stl/inc/algorithm Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
benchmarks/src/find_first_of.cpp Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
@StephanTLavavej StephanTLavavej removed their assignment Mar 20, 2024
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 9d761bd into microsoft:main Mar 21, 2024
35 checks passed
@StephanTLavavej
Copy link
Member

Thanks for vectorizing more algorithms! 🚀 ⏩ 🎉

@AlexGuteniev AlexGuteniev deleted the first_of branch March 21, 2024 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants