Vectorize `bitset` from string construction #4839

AlexGuteniev · 2024-07-13T13:59:51Z

📜 The approach

We need to:

Compare against '0' and '1' (or other characters) and report if there's neither of them
Reverse the order of characters
Convert byte mask or wchar_t to bit mask
Handle bitset and string odd length tails
Handle too short and too long string properly

It is implemented as follows:

We compare against '1' vector to get byte or wchar_t mask. We xor the input with '0' vector, and check if anything not masked in the resulting byte mask is zero. This way we both validate, and have the mask.
If we have longer string than bitset, we don't do further steps than byte or wchar_t mask, as we only need to validate,
To both reverse the order, and convert wchar_t mask to byte mask we use (v)pshufb (..._shuffle_epi8). The AVX2 version is not cross-lane, and additional cross-lane instruction is somewhat expensive, so we fix up that with bits, rather than now.
(v)pmovmskb is the instruction that does byte mask to bit mask conversion, it is emitted by ..._movemask_epi8 intrinsic.
Now we do rotations to fix up the missing AVX2 cross-lane reordering
Then put that to bitset. We assume padding, even if bitset length is smaller.
Repeat until string ends
Write zeros to remaining portion of bitset, if there's something left

⚠️ Padding assumption

To avoid unnecessary complication, and have a bit better performance, an assumption was made that bitset has at least a multiple of 2 padding on SSE4.2 code path, and at lest a multiple of 4 padding on AVX2 code path.

For large bitset we have units of uint64_t, and it is not reasonable to have unit less than native integer size, and AVX2 code path has the threshold of 256, since it is not efficient on smaller sizes.

For small bitset we have units of uint32_t, but there's a vNext comment to consider smaller units. But we have a threshold on the whole vectorization as 16 bits, so 8 or less bits bitset that could potentially be less that 2 bytes, would not vectorize.

⚖️ Threshold selection

According to my benchmarks, the vectorized approach benefits for 16 bits, but doesn't for 8 bits. I tried 12, and it seems in favor of non-vectorization. Let it be round up to 16 then.

📏 String length

There are overloads with implicit and explicit length.

For implicit length the non-vectorized implementation calls strlen/wcslen (via character traits).

Hypothetically, vectorized implementation could determine length during conversion. We don't do that here, because:

This cannot be done ASAN-friendly, see Remove attempt on unsized find vectorization... #4486
Even if it was possible, this adds complication to already not very simple algorithm

Unfortunately, this causes implicit length overload to benefit much less from vectorization.

⏱️ Benchmark results

Benchmark	main	this
explicit length, 1-byte chars
BM_bitset_from_string<length_type::char_count, 15, char>	1813 ns	1778 ns
BM_bitset_from_string<length_type::char_count, 16, char>	1794 ns	921 ns
BM_bitset_from_string<length_type::char_count, 36, char>	2230 ns	651 ns
BM_bitset_from_string<length_type::char_count, 64, char>	2075 ns	274 ns
BM_bitset_from_string<length_type::char_count, 512, char>	2026 ns	78.1 ns
BM_bitset_from_string<length_type::char_count, 2048, char>	2072 ns	65.7 ns
explicit length, 2-byte chars
BM_bitset_from_string<length_type::char_count, 15, wchar_t>	1777 ns	1774 ns
BM_bitset_from_string<length_type::char_count, 16, wchar_t>	1786 ns	979 ns
BM_bitset_from_string<length_type::char_count, 36, wchar_t>	2230 ns	741 ns
BM_bitset_from_string<length_type::char_count, 64, wchar_t>	1988 ns	349 ns
BM_bitset_from_string<length_type::char_count, 512, wchar_t>	1970 ns	189 ns
BM_bitset_from_string<length_type::char_count, 2048, wchar_t>	1912 ns	143 ns
implicit length, 1-byte chars
BM_bitset_from_string<length_type::null_term, 15, char>	2825 ns	2777 ns
BM_bitset_from_string<length_type::null_term, 16, char>	2118 ns	1439 ns
BM_bitset_from_string<length_type::null_term, 36, char>	2571 ns	1173 ns
BM_bitset_from_string<length_type::null_term, 64, char>	2710 ns	1018 ns
BM_bitset_from_string<length_type::null_term, 512, char>	2482 ns	606 ns
BM_bitset_from_string<length_type::null_term, 2048, char>	2407 ns	566 ns
implicit length, 2-byte chars
BM_bitset_from_string<length_type::null_term, 15, wchar_t>	2098 ns	2102 ns
BM_bitset_from_string<length_type::null_term, 16, wchar_t>	2515 ns	1613 ns
BM_bitset_from_string<length_type::null_term, 36, wchar_t>	2128 ns	1246 ns
BM_bitset_from_string<length_type::null_term, 64, wchar_t>	2739 ns	1055 ns
BM_bitset_from_string<length_type::null_term, 512, wchar_t>	2491 ns	719 ns
BM_bitset_from_string<length_type::null_term, 2048, wchar_t>	2407 ns	639 ns

This reverts commit 70374d9.

benchmarks/src/bitset_from_string.cpp

stl/src/vector_algorithms.cpp

stl/inc/bitset

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

stl/src/vector_algorithms.cpp

stl/inc/bitset

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

StephanTLavavej · 2024-10-08T13:36:16Z

Final results on my 5950X:

Benchmark	Before	After	Speedup
`BM_bitset_from_string<length_type::char_count, 15, char>`	1587 ns	1518 ns	1.05
`BM_bitset_from_string<length_type::char_count, 16, char>`	1592 ns	1046 ns	1.52
`BM_bitset_from_string<length_type::char_count, 36, char>`	1396 ns	675 ns	2.07
`BM_bitset_from_string<length_type::char_count, 64, char>`	1492 ns	301 ns	4.96
`BM_bitset_from_string<length_type::char_count, 512, char>`	1457 ns	69.9 ns	20.84
`BM_bitset_from_string<length_type::char_count, 2048, char>`	1438 ns	50.1 ns	28.70
`BM_bitset_from_string<length_type::char_count, 15, wchar_t>`	1493 ns	1635 ns	0.91
`BM_bitset_from_string<length_type::char_count, 16, wchar_t>`	1563 ns	1122 ns	1.39
`BM_bitset_from_string<length_type::char_count, 36, wchar_t>`	1433 ns	738 ns	1.94
`BM_bitset_from_string<length_type::char_count, 64, wchar_t>`	1495 ns	393 ns	3.80
`BM_bitset_from_string<length_type::char_count, 512, wchar_t>`	1436 ns	141 ns	10.18
`BM_bitset_from_string<length_type::char_count, 2048, wchar_t>`	1420 ns	124 ns	11.45
`BM_bitset_from_string<length_type::null_term, 15, char>`	2070 ns	1993 ns	1.04
`BM_bitset_from_string<length_type::null_term, 16, char>`	1974 ns	1512 ns	1.31
`BM_bitset_from_string<length_type::null_term, 36, char>`	1864 ns	1156 ns	1.61
`BM_bitset_from_string<length_type::null_term, 64, char>`	1981 ns	755 ns	2.62
`BM_bitset_from_string<length_type::null_term, 512, char>`	1917 ns	531 ns	3.61
`BM_bitset_from_string<length_type::null_term, 2048, char>`	1895 ns	483 ns	3.92
`BM_bitset_from_string<length_type::null_term, 15, wchar_t>`	2071 ns	2078 ns	1.00
`BM_bitset_from_string<length_type::null_term, 16, wchar_t>`	2041 ns	1589 ns	1.28
`BM_bitset_from_string<length_type::null_term, 36, wchar_t>`	1927 ns	1222 ns	1.58
`BM_bitset_from_string<length_type::null_term, 64, wchar_t>`	1958 ns	864 ns	2.27
`BM_bitset_from_string<length_type::null_term, 512, wchar_t>`	1932 ns	594 ns	3.25
`BM_bitset_from_string<length_type::null_term, 2048, wchar_t>`	1907 ns	555 ns	3.44

This is great and a single 0.91 scenario is nothing to worry about. 😻

AlexGuteniev · 2024-10-08T13:42:05Z

and a single 0.91 scenario is nothing to worry about.

This is one of the four 15 cases which are just below vectorization threshold. Just one of them has exactly 1.00 as expected

StephanTLavavej · 2024-10-11T10:17:06Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-10-12T03:48:01Z

Thanks ~~a bit~~ a lot!

😹 0️⃣ 1️⃣

AlexGuteniev added 12 commits July 12, 2024 20:15

test and benchmark

9f574ff

null hypothesis

4318dc5

use threshold

70374d9

assume bitset word size

551eef7

Revert "use threshold"

a38fe93

This reverts commit 70374d9.

SSE4.2 algorithm

2b64c6e

more cases

73097bc

use threshold, now separate one

afaa90d

AVX2 algorithm

2e3a816

mighty ARM

22e5bae

assumption-shmassumption

36108d1

more useful row

adfde23

AlexGuteniev requested a review from a team as a code owner July 13, 2024 13:59

AlexGuteniev added 15 commits July 13, 2024 17:23

fewer instructions for AVX2 wchar_t case

4eeb157

constexpr

c360e88

early return

9fe1973

pick correct range of too long string

89f9806

this check is unnecessary

bc993cc

TMP!

fce8222

clarify flow

692dd76

more TMP!

8c04d1a

mighty ARM again

db18ee2

fallback fixup

dc191d0

fallback move

93abc9e

array

b48b553

padding

3529d5c

common word increment

3d075c6

TMP loop

e63ab56

StephanTLavavej added the performance Must go faster label Jul 17, 2024

StephanTLavavej self-assigned this Jul 17, 2024

StephanTLavavej added 2 commits September 9, 2024 20:36

Consistent param order for random_digits_init().

467e01a

Drop unnecessary static.

1f48693

StephanTLavavej reviewed Sep 10, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

StephanTLavavej and others added 2 commits September 11, 2024 20:41

Merge branch 'main' into bitstring

03b1fdd

FizzBuzz!

1bb6faf

This comment was marked as resolved.

Sign in to view

AlexGuteniev commented Sep 13, 2024

View reviewed changes

stl/inc/bitset Outdated Show resolved Hide resolved

AlexGuteniev and others added 11 commits September 20, 2024 07:40

Merge remote-tracking branch 'upstream/main' into bitstring

327e258

unicorns

dac65c3

Merge remote-tracking branch 'upstream/main' into bitstring

c9d19c0

Add const, drop std::.

238c8dd

Add const.

e738b17

Drop unnecessary _Mx.

e30ce6b

Mark _Dispatch as noexcept.

1578287

Improve comment grammar, add citation.

0e1fe07

Use string::npos and wstring::npos.

ac71a7e

Include <stdexcept> for invalid_argument.

7b54b0c

Add more test coverage for invalid_argument.

ec1e458

StephanTLavavej reviewed Oct 7, 2024

View reviewed changes

StephanTLavavej approved these changes Oct 7, 2024

View reviewed changes

StephanTLavavej removed their assignment Oct 7, 2024

StephanTLavavej mentioned this pull request Oct 7, 2024

Maintainer priorities #4700

Open

StephanTLavavej self-assigned this Oct 11, 2024

StephanTLavavej merged commit ab555ad into microsoft:main Oct 12, 2024
39 checks passed

AlexGuteniev deleted the bitstring branch October 12, 2024 04:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize `bitset` from string construction #4839

Vectorize `bitset` from string construction #4839

AlexGuteniev commented Jul 13, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented Oct 8, 2024

AlexGuteniev commented Oct 8, 2024

StephanTLavavej commented Oct 11, 2024

StephanTLavavej commented Oct 12, 2024

Vectorize bitset from string construction #4839

Vectorize bitset from string construction #4839

Conversation

AlexGuteniev commented Jul 13, 2024 • edited Loading

📜 The approach

⚠️ Padding assumption

⚖️ Threshold selection

📏 String length

⏱️ Benchmark results

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented Oct 8, 2024

AlexGuteniev commented Oct 8, 2024

StephanTLavavej commented Oct 11, 2024

StephanTLavavej commented Oct 12, 2024

😹 0️⃣ 1️⃣

Vectorize `bitset` from string construction #4839

Vectorize `bitset` from string construction #4839

AlexGuteniev commented Jul 13, 2024 •

edited

Loading