Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize bitset from string construction #4839

Merged
merged 50 commits into from
Oct 12, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Jul 13, 2024

📜 The approach

We need to:

  • Compare against '0' and '1' (or other characters) and report if there's neither of them
  • Reverse the order of characters
  • Convert byte mask or wchar_t to bit mask
  • Handle bitset and string odd length tails
  • Handle too short and too long string properly

It is implemented as follows:

  • We compare against '1' vector to get byte or wchar_t mask. We xor the input with '0' vector, and check if anything not masked in the resulting byte mask is zero. This way we both validate, and have the mask.
  • If we have longer string than bitset, we don't do further steps than byte or wchar_t mask, as we only need to validate,
  • To both reverse the order, and convert wchar_t mask to byte mask we use (v)pshufb (..._shuffle_epi8). The AVX2 version is not cross-lane, and additional cross-lane instruction is somewhat expensive, so we fix up that with bits, rather than now.
  • (v)pmovmskb is the instruction that does byte mask to bit mask conversion, it is emitted by ..._movemask_epi8 intrinsic.
  • Now we do rotations to fix up the missing AVX2 cross-lane reordering
  • Then put that to bitset. We assume padding, even if bitset length is smaller.
  • Repeat until string ends
  • Write zeros to remaining portion of bitset, if there's something left

⚠️ Padding assumption

To avoid unnecessary complication, and have a bit better performance, an assumption was made that bitset has at least a multiple of 2 padding on SSE4.2 code path, and at lest a multiple of 4 padding on AVX2 code path.

For large bitset we have units of uint64_t, and it is not reasonable to have unit less than native integer size, and AVX2 code path has the threshold of 256, since it is not efficient on smaller sizes.

For small bitset we have units of uint32_t, but there's a vNext comment to consider smaller units. But we have a threshold on the whole vectorization as 16 bits, so 8 or less bits bitset that could potentially be less that 2 bytes, would not vectorize.

⚖️ Threshold selection

According to my benchmarks, the vectorized approach benefits for 16 bits, but doesn't for 8 bits. I tried 12, and it seems in favor of non-vectorization. Let it be round up to 16 then.

📏 String length

There are overloads with implicit and explicit length.

For implicit length the non-vectorized implementation calls strlen/wcslen (via character traits).

Hypothetically, vectorized implementation could determine length during conversion. We don't do that here, because:

Unfortunately, this causes implicit length overload to benefit much less from vectorization.

⏱️ Benchmark results

Benchmark main this
explicit length, 1-byte chars
BM_bitset_from_string<length_type::char_count, 15, char> 1813 ns 1778 ns
BM_bitset_from_string<length_type::char_count, 16, char> 1794 ns 921 ns
BM_bitset_from_string<length_type::char_count, 36, char> 2230 ns 651 ns
BM_bitset_from_string<length_type::char_count, 64, char> 2075 ns 274 ns
BM_bitset_from_string<length_type::char_count, 512, char> 2026 ns 78.1 ns
BM_bitset_from_string<length_type::char_count, 2048, char> 2072 ns 65.7 ns
explicit length, 2-byte chars
BM_bitset_from_string<length_type::char_count, 15, wchar_t> 1777 ns 1774 ns
BM_bitset_from_string<length_type::char_count, 16, wchar_t> 1786 ns 979 ns
BM_bitset_from_string<length_type::char_count, 36, wchar_t> 2230 ns 741 ns
BM_bitset_from_string<length_type::char_count, 64, wchar_t> 1988 ns 349 ns
BM_bitset_from_string<length_type::char_count, 512, wchar_t> 1970 ns 189 ns
BM_bitset_from_string<length_type::char_count, 2048, wchar_t> 1912 ns 143 ns
implicit length, 1-byte chars
BM_bitset_from_string<length_type::null_term, 15, char> 2825 ns 2777 ns
BM_bitset_from_string<length_type::null_term, 16, char> 2118 ns 1439 ns
BM_bitset_from_string<length_type::null_term, 36, char> 2571 ns 1173 ns
BM_bitset_from_string<length_type::null_term, 64, char> 2710 ns 1018 ns
BM_bitset_from_string<length_type::null_term, 512, char> 2482 ns 606 ns
BM_bitset_from_string<length_type::null_term, 2048, char> 2407 ns 566 ns
implicit length, 2-byte chars
BM_bitset_from_string<length_type::null_term, 15, wchar_t> 2098 ns 2102 ns
BM_bitset_from_string<length_type::null_term, 16, wchar_t> 2515 ns 1613 ns
BM_bitset_from_string<length_type::null_term, 36, wchar_t> 2128 ns 1246 ns
BM_bitset_from_string<length_type::null_term, 64, wchar_t> 2739 ns 1055 ns
BM_bitset_from_string<length_type::null_term, 512, wchar_t> 2491 ns 719 ns
BM_bitset_from_string<length_type::null_term, 2048, wchar_t> 2407 ns 639 ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner July 13, 2024 13:59
@StephanTLavavej StephanTLavavej added the performance Must go faster label Jul 17, 2024
@StephanTLavavej StephanTLavavej self-assigned this Jul 17, 2024
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
benchmarks/src/bitset_from_string.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej

This comment was marked as resolved.

stl/inc/bitset Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/inc/bitset Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
@StephanTLavavej StephanTLavavej removed their assignment Oct 7, 2024
@StephanTLavavej
Copy link
Member

Final results on my 5950X:

Benchmark Before After Speedup
BM_bitset_from_string<length_type::char_count, 15, char> 1587 ns 1518 ns 1.05
BM_bitset_from_string<length_type::char_count, 16, char> 1592 ns 1046 ns 1.52
BM_bitset_from_string<length_type::char_count, 36, char> 1396 ns 675 ns 2.07
BM_bitset_from_string<length_type::char_count, 64, char> 1492 ns 301 ns 4.96
BM_bitset_from_string<length_type::char_count, 512, char> 1457 ns 69.9 ns 20.84
BM_bitset_from_string<length_type::char_count, 2048, char> 1438 ns 50.1 ns 28.70
BM_bitset_from_string<length_type::char_count, 15, wchar_t> 1493 ns 1635 ns 0.91
BM_bitset_from_string<length_type::char_count, 16, wchar_t> 1563 ns 1122 ns 1.39
BM_bitset_from_string<length_type::char_count, 36, wchar_t> 1433 ns 738 ns 1.94
BM_bitset_from_string<length_type::char_count, 64, wchar_t> 1495 ns 393 ns 3.80
BM_bitset_from_string<length_type::char_count, 512, wchar_t> 1436 ns 141 ns 10.18
BM_bitset_from_string<length_type::char_count, 2048, wchar_t> 1420 ns 124 ns 11.45
BM_bitset_from_string<length_type::null_term, 15, char> 2070 ns 1993 ns 1.04
BM_bitset_from_string<length_type::null_term, 16, char> 1974 ns 1512 ns 1.31
BM_bitset_from_string<length_type::null_term, 36, char> 1864 ns 1156 ns 1.61
BM_bitset_from_string<length_type::null_term, 64, char> 1981 ns 755 ns 2.62
BM_bitset_from_string<length_type::null_term, 512, char> 1917 ns 531 ns 3.61
BM_bitset_from_string<length_type::null_term, 2048, char> 1895 ns 483 ns 3.92
BM_bitset_from_string<length_type::null_term, 15, wchar_t> 2071 ns 2078 ns 1.00
BM_bitset_from_string<length_type::null_term, 16, wchar_t> 2041 ns 1589 ns 1.28
BM_bitset_from_string<length_type::null_term, 36, wchar_t> 1927 ns 1222 ns 1.58
BM_bitset_from_string<length_type::null_term, 64, wchar_t> 1958 ns 864 ns 2.27
BM_bitset_from_string<length_type::null_term, 512, wchar_t> 1932 ns 594 ns 3.25
BM_bitset_from_string<length_type::null_term, 2048, wchar_t> 1907 ns 555 ns 3.44

This is great and a single 0.91 scenario is nothing to worry about. 😻

@AlexGuteniev
Copy link
Contributor Author

and a single 0.91 scenario is nothing to worry about.

This is one of the four 15 cases which are just below vectorization threshold. Just one of them has exactly 1.00 as expected

@StephanTLavavej StephanTLavavej self-assigned this Oct 11, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit ab555ad into microsoft:main Oct 12, 2024
39 checks passed
@StephanTLavavej
Copy link
Member

Thanks a bit a lot!

😹 0️⃣ 1️⃣

@AlexGuteniev AlexGuteniev deleted the bitstring branch October 12, 2024 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants