-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove
vectorization
#4987
remove
vectorization
#4987
Conversation
Irony: a PR that adds vectorization entitled " |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Modern AMD data would be interesting. |
I'm now seeing multiple solutions how to do |
Thanks for the detailed writeup! I have no correctness concerns here - the Standard has a note that the garbage values are valid-but-unspecified, so we're fully within our rights to leave totally random values in there, even if the range originally contained (for example) only "If you want |
search.cpp can construct `std::string` directly from `lorem_ipsum`. sv_equal.cpp can use `lorem_ipsum.substr(0, 2048)`.
…und a value to remove. Move the vectorized codepath below the existing call to `_RANGES _Find_unchecked`. Drop `_Could_compare_equal_to_value_type`, as `_Find_unchecked` has handled that and found a value. Finally, `__std_remove_N` doesn't need to start with `__std_find_trivial_N`.
Thanks! 😻 Pushed changes as usual, the most significant making the
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
I've pushed a merge with Now, this directly constructs |
Thanks for removing time when users call |
📜 The algorithm
In-place
remove
algorithm that uses bit mask of vector comparisons as shuffle index to remove certain elements.The destination is advanced by a size taken from a lookup table too, although
popcnt
could have been used.The details vary on depending on element size:
pshufb
/_mm_shuffle_epi8
to remove elements.vpermd
/_mm256_permutevar8x32_epi32
to remove elements, which is cross lane. SEE fallbacks are used with smaller tables, still surprisingly more efficient than scalar.pmovmskb
. 32 and 64 bit usevmovmskps
/vmovmskpd
, though they are for floating types, they fit well, and avoid the need of cross-lane swizzling to compress the mask. For 16-bit,packsswb
is used, althoughpshufb
could have been used as well.🔍 Find first!
Before even starting, find is performed to find the first mismatch element. This is done for the correctness, and also there are performance reasons why it is good:
The existing
find
implementation is called. Hypothetically I could implement it inline and save some instructions in some cases, but such optimization has too negligible effect on performance, while increasing complexity noticeably. Though this might be revised for futureremove_copy
if that and this would share the implementation.The algorithm removes elements from the source vector (of 8 or less elements) by a shuffle operation, so that non-removed elements are placed contiguously in that vector. Then it writes the whole vector to the destination, and advances the destination pointer to the size of non-removed elements.
As a result:
I have no doubts that overwriting elements in the resulting range to to some intermediate values before setting them to the expected values is correct. The write and the data race (in abstract machine terms) exist anyway, so extra write is not observable.
I have concerns regarding damaging the removed range. Changing these values is observable.
I'd appeal to that elements in removed range stay in valid-but-uspecified state, although I understand that the purpose of standard saying that is to enable moving of non-trivially-copyables, but not to do what I did.
Note that:
remove_copy
in a similar way have to avoid superfluous writes anyway🗄️ Memory usage
Unlike most other vectorization algorithms, this one uses large lookup tables. 8 and 32 bit variants use 2 KiB table, 16 bit variant uses 4 KiB table.
This has different performance characteristics, compared to pure-computational optimizations. In particular, it tends to behave worse in some programs that don't fit cache well on their critical path. This doesn't apply to benchmarks, but unfortunately often applies to realistic programs, especially the ones that are not written with having performance in mind.
I believe that the optimization is still good or at least not bad most of the time where it is needed.
⏱️ Benchmark results