`remove` vectorization #4987

AlexGuteniev · 2024-09-28T14:36:24Z

📜 The algorithm

In-place remove algorithm that uses bit mask of vector comparisons as shuffle index to remove certain elements.
The destination is advanced by a size taken from a lookup table too, although popcnt could have been used.

The details vary on depending on element size:

The tables are selected to be of up to 256 entries, to process up to 8 elements at once. Bigger tables don't fit top level caches well. Some tables are smaller, when less than 8 elements can fit the vector.
8 and 16 bit variants use SSE only, as 8 elements of 8 or 16 bits fit SSE register, and more elements will take bigger table, Also there's no cross-lane shuffle within SSE4.2 / AVX2 that works with elements of such sizes. They use the famous pshufb / _mm_shuffle_epi8 to remove elements.
32 and 64 bit variants use AVX2, and 64 bit variant uses only a table of 16 entries, as there are up to 4 elements only. They use vpermd / _mm256_permutevar8x32_epi32 to remove elements, which is cross lane. SEE fallbacks are used with smaller tables, still surprisingly more efficient than scalar.
Need contiguous mask of bits, single bit per element, not few bits for the same comparison. 8-bit uses pmovmskb. 32 and 64 bit use vmovmskps / vmovmskpd, though they are for floating types, they fit well, and avoid the need of cross-lane swizzling to compress the mask. For 16-bit, packsswb is used, although pshufb could have been used as well.

🔍 Find first!

Before even starting, find is performed to find the first mismatch element. This is done for the correctness, and also there are performance reasons why it is good:

Correctness. [algorithms.requirements]/3 states: For purposes of determining the existence of data races, algorithms shall not modify objects referenced through an iterator argument unless the specification requires such modification. Whereas [alg.remove] is vague on how the algorithm should work, I think we should only write to elements that has to be written to
Vectorization, We can have full AVX2 vector size as the step always, not only for 32 and 64 bit elements
Memory bandwidth. The vectorized algorithm might be memory bound, saving writes may make it faster
Number of operations. Fewer ops to just test the content

The existing find implementation is called. Hypothetically I could implement it inline and save some instructions in some cases, but such optimization has too negligible effect on performance, while increasing complexity noticeably. Though this might be revised for future remove_copy if that and this would share the implementation.

⚠️ Correctness doubt - superfluous writes

The algorithm removes elements from the source vector (of 8 or less elements) by a shuffle operation, so that non-removed elements are placed contiguously in that vector. Then it writes the whole vector to the destination, and advances the destination pointer to the size of non-removed elements.

As a result:

In the remaining range some elements are overwritten with some values, before they are overwritten with expected values
In the removed range, some amount of elements (up to 8 of them) are overwritten with values of other elements, and never restored.

I have no doubts that overwriting elements in the resulting range to to some intermediate values before setting them to the expected values is correct. The write and the data race (in abstract machine terms) exist anyway, so extra write is not observable.

I have concerns regarding damaging the removed range. Changing these values is observable.

I'd appeal to that elements in removed range stay in valid-but-uspecified state, although I understand that the purpose of standard saying that is to enable moving of non-trivially-copyables, but not to do what I did.

Note that:

It is possible to avoid to do any of superfluous write, but it will have some cost
The cost of avoiding superfluous writes is small for 32 and 64 bit elements, and larger for 8 and 16 bit elements
When//if vectorizing remove_copy in a similar way have to avoid superfluous writes anyway

🗄️ Memory usage

Unlike most other vectorization algorithms, this one uses large lookup tables. 8 and 32 bit variants use 2 KiB table, 16 bit variant uses 4 KiB table.

This has different performance characteristics, compared to pure-computational optimizations. In particular, it tends to behave worse in some programs that don't fit cache well on their critical path. This doesn't apply to benchmarks, but unfortunately often applies to realistic programs, especially the ones that are not written with having performance in mind.

I believe that the optimization is still good or at least not bad most of the time where it is needed.

⏱️ Benchmark results

Benchmark	main	this
r<alg_type::std_fn, std::uint8_t>	944 ns	294 ns
r<alg_type::std_fn, std::uint16_t>	1470 ns	297 ns
r<alg_type::std_fn, std::uint32_t>	1059 ns	403 ns
r<alg_type::std_fn, std::uint64_t>	1498 ns	884 ns
r<alg_type::rng, std::uint8_t>	1208 ns	307 ns
r<alg_type::rng, std::uint16_t>	1386 ns	288 ns
r<alg_type::rng, std::uint32_t>	1218 ns	397 ns
r<alg_type::rng, std::uint64_t>	1411 ns	842 ns

CaseyCarter · 2024-09-28T17:16:26Z

Irony: a PR that adds vectorization entitled "remove vectorization".

AlexGuteniev · 2024-10-04T12:13:24Z

Modern AMD data would be interesting.
To avoid tricky swizzling, I added mixing integers and floats in 39974d1. The overall results are better for me, but this mixing seems to have more penalty on AMD than on Intel

stl/src/vector_algorithms.cpp

AlexGuteniev · 2024-10-06T10:04:51Z

remove_copy if it succeeds going to be very similar, but would always use AVX2 mask, and so available only for 32 and 64 bit elements. I can try this within the same PR, although it seems big already.

I'm now seeing multiple solutions how to do remove_copy for 8 and 16 bit elements

stl/src/vector_algorithms.cpp

stl/inc/algorithm

stl/inc/xmemory

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

StephanTLavavej · 2024-10-23T14:49:09Z

Thanks for the detailed writeup! I have no correctness concerns here - the Standard has a note that the garbage values are valid-but-unspecified, so we're fully within our rights to leave totally random values in there, even if the range originally contained (for example) only 10 and 20.

"If you want partition, you know where to find it."

search.cpp can construct `std::string` directly from `lorem_ipsum`. sv_equal.cpp can use `lorem_ipsum.substr(0, 2048)`.

…und a value to remove. Move the vectorized codepath below the existing call to `_RANGES _Find_unchecked`. Drop `_Could_compare_equal_to_value_type`, as `_Find_unchecked` has handled that and found a value. Finally, `__std_remove_N` doesn't need to start with `__std_find_trivial_N`.

benchmarks/src/sv_equal.cpp

benchmarks/src/remove.cpp

stl/inc/xmemory

stl/inc/algorithm

stl/inc/xmemory

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

stl/src/vector_algorithms.cpp

StephanTLavavej · 2024-10-23T17:21:20Z

Thanks! 😻 Pushed changes as usual, the most significant making the <algorithm>/<xmemory> layer responsible for performing the initial find. Good results on my 5950X:

Benchmark	Before	After	Speedup
`r<alg_type::std_fn, std::uint8_t>`	1291 ns	360 ns	3.59
`r<alg_type::std_fn, std::uint16_t>`	1291 ns	338 ns	3.82
`r<alg_type::std_fn, std::uint32_t>`	1285 ns	491 ns	2.62
`r<alg_type::std_fn, std::uint64_t>`	1504 ns	1151 ns	1.31
`r<alg_type::rng, std::uint8_t>`	1317 ns	355 ns	3.71
`r<alg_type::rng, std::uint16_t>`	1250 ns	330 ns	3.79
`r<alg_type::rng, std::uint32_t>`	1330 ns	489 ns	2.72
`r<alg_type::rng, std::uint64_t>`	2090 ns	1082 ns	1.93

StephanTLavavej · 2024-10-23T19:09:30Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-10-24T15:00:31Z

I've pushed a merge with main to resolve merge conflicts in search.cpp (structured as one commit to replicate the stupid thing I did in MSVC, followed by a proper fix, so I can mirror the latter).

Now, this directly constructs std::string from the std::string_views src_haystack and src_needle, and constructs std::vector from their .begin() and .end(). These are stylistic improvements to reduce verbosity.

StephanTLavavej · 2024-10-24T15:38:21Z

Thanks for removing time when users call remove! 😹 🤪 ⏱️

remove vectorization

c2c21e1

AlexGuteniev requested a review from a team as a code owner September 28, 2024 14:36

ADL

d0f2e68

CaseyCarter added the performance Must go faster label Sep 28, 2024

AlexGuteniev added 7 commits September 28, 2024 20:57

compress 1-byte data

d6b27af

compact also 4 and 8 tables

44a9278

reduce copypasta

18c0b7c

-leftovers

2b068f6

wrong comparison!

d99ff0c

reduce copypasta even more

d0e5938

bingo consistency

7972780

StephanTLavavej self-assigned this Sep 28, 2024

This comment was marked as outdated.

Sign in to view

AlexGuteniev added 3 commits October 2, 2024 19:32

vzeroupper

4a7d60b

mask like floats

39974d1

also remove shuffle from here

3e9109a

AlexGuteniev added 3 commits October 5, 2024 09:47

elaborate comments on the complex part of obtaining the tables

c8ec13b

doesn't matter, but it is unsigned

63cb681

what did I say

61a97c6

AlexGuteniev commented Oct 5, 2024

View reviewed changes

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved

AlexGuteniev added 4 commits October 6, 2024 18:21

no strict

167c276

Still something on SSE4.2 for 32 and 64 bit eleemnts

e6dc56d

Not patterns just tables

c48d887

Merge branch 'microsoft:main' into remove

85d69d6

StephanTLavavej requested changes Oct 22, 2024

View reviewed changes

StephanTLavavej removed their assignment Oct 22, 2024

AlexGuteniev requested a review from StephanTLavavej October 23, 2024 05:11

StephanTLavavej self-assigned this Oct 23, 2024

StephanTLavavej added 10 commits October 23, 2024 08:01

Use string_view for lorem_ipsum.

55e632f

search.cpp can construct `std::string` directly from `lorem_ipsum`. sv_equal.cpp can use `lorem_ipsum.substr(0, 2048)`.

Add const.

dd093d0

Add noexcept to _Meow_vectorized.

2e704e3

In C++20, use to_address.

400d778

Drop unnecessary static_casts.

93b9222

Fix argument order.

1722ec4

Comment cleanups. Drop the bit about "surprising behavior".

0993532

Give _Remove_tables a name.

afc9543

Scope _Size_bytes within if in __std_remove_4.

2dd0b41

StephanTLavavej reviewed Oct 23, 2024

View reviewed changes

StephanTLavavej approved these changes Oct 23, 2024

View reviewed changes

StephanTLavavej removed their assignment Oct 23, 2024

StephanTLavavej mentioned this pull request Oct 23, 2024

Maintainer priorities #4700

Open

StephanTLavavej self-assigned this Oct 23, 2024

StephanTLavavej added 2 commits October 24, 2024 07:49

Merge branch 'main' into remove (STL's bad MSVC-PR)

4fc12f2

Fix the merge conflicts properly.

4a68a28

StephanTLavavej approved these changes Oct 24, 2024

View reviewed changes

StephanTLavavej merged commit 742c328 into microsoft:main Oct 24, 2024
39 checks passed

AlexGuteniev deleted the remove branch October 24, 2024 15:38

StephanTLavavej mentioned this pull request Oct 26, 2024

Initially use [[msvc::no_unique_address]] for some C++23 components #4960

Draft

AlexGuteniev mentioned this pull request Nov 1, 2024

Vectorize remove_copy for 4 and 8 byte elements #5062

Closed

AlexGuteniev mentioned this pull request Nov 16, 2024

Vectorize unique #5092

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`remove` vectorization #4987

`remove` vectorization #4987

AlexGuteniev commented Sep 28, 2024 •

edited

Loading

CaseyCarter commented Sep 28, 2024

This comment was marked as outdated.

This comment was marked as outdated.

AlexGuteniev commented Oct 4, 2024

AlexGuteniev commented Oct 6, 2024

StephanTLavavej commented Oct 23, 2024

StephanTLavavej commented Oct 23, 2024 •

edited

Loading

StephanTLavavej commented Oct 23, 2024

StephanTLavavej commented Oct 24, 2024

StephanTLavavej commented Oct 24, 2024

remove vectorization #4987

remove vectorization #4987

Conversation

AlexGuteniev commented Sep 28, 2024 • edited Loading

📜 The algorithm

🔍 Find first!

⚠️ Correctness doubt - superfluous writes

🗄️ Memory usage

⏱️ Benchmark results

CaseyCarter commented Sep 28, 2024

This comment was marked as outdated.

This comment was marked as outdated.

AlexGuteniev commented Oct 4, 2024

AlexGuteniev commented Oct 6, 2024

StephanTLavavej commented Oct 23, 2024

StephanTLavavej commented Oct 23, 2024 • edited Loading

StephanTLavavej commented Oct 23, 2024

StephanTLavavej commented Oct 24, 2024

StephanTLavavej commented Oct 24, 2024

`remove` vectorization #4987

`remove` vectorization #4987

AlexGuteniev commented Sep 28, 2024 •

edited

Loading

StephanTLavavej commented Oct 23, 2024 •

edited

Loading