Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSE2 vectorization for bitset::to_string #3960

Merged
merged 21 commits into from
Jan 30, 2024

Conversation

AlexGuteniev
Copy link
Contributor

Resolves #3858

@Alcaro suggested the original approach with forming mask with _mm_and_si128(_Vec4, _mm_set1_epi64x(0x0102040810204080)) and populating 2 bytes to low and high 8 bytes of SSE vector via repeated _mm_unpacklo_epi8

Results

Without vectorization:

------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char>                    281 ns          283 ns      2488889
BM_bitset_to_string<64, char>                   9675 ns         9835 ns        74667
BM_bitset_to_string_large_single<char>          9366 ns         9417 ns        74667
BM_bitset_to_string<7, wchar_t>                  210 ns          209 ns      3446154
BM_bitset_to_string<64, wchar_t>                5583 ns         5625 ns       100000
BM_bitset_to_string_large_single<wchar_t>       2689 ns         2679 ns       280000

With vectorization:

------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char>                    378 ns          374 ns      2133333
BM_bitset_to_string<64, char>                   3593 ns         3641 ns       154483
BM_bitset_to_string_large_single<char>           339 ns          345 ns      2036364
BM_bitset_to_string<7, wchar_t>                  192 ns          192 ns      4072727
BM_bitset_to_string<64, wchar_t>                3845 ns         3767 ns       165926
BM_bitset_to_string_large_single<wchar_t>        613 ns          614 ns      1120000

(15, char and 7, wchar_t results are variation between runs, but there's still a strong indication that others are better with vectorization)

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner August 13, 2023 06:29
@StephanTLavavej StephanTLavavej added the performance Must go faster label Aug 13, 2023
Stop pretending these are meaningful
@StephanTLavavej StephanTLavavej self-assigned this Aug 14, 2023
stl/inc/bitset Outdated Show resolved Hide resolved
stl/inc/bitset Outdated Show resolved Hide resolved
stl/inc/bitset Outdated Show resolved Hide resolved
stl/inc/bitset Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
@StephanTLavavej StephanTLavavej removed their assignment Jan 27, 2024
@StephanTLavavej
Copy link
Member

Thanks! 😻 (And sorry for taking so long to review this. 🐌)

I pushed a merge with main, fixed a stealth merge conflict, and pushed some minor improvements, nothing affecting your core vectorized algorithms. I enhanced the test coverage by testing non-default characters 'o', 'x' and adding randomized testing of a bitset<2048>.

@StephanTLavavej StephanTLavavej self-assigned this Jan 30, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 63e4c2f into microsoft:main Jan 30, 2024
35 checks passed
@StephanTLavavej
Copy link
Member

Thanks for optimizing this function! 🚀 🚀 🎉

@AlexGuteniev AlexGuteniev deleted the bitvector branch January 30, 2024 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

<bitset>: Investigate further performance improvements
3 participants