Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use string::resize_and_overwrite in bitset to avoid buffer initialization #3904

Merged
merged 3 commits into from
Aug 11, 2023

Conversation

AlexGuteniev
Copy link
Contributor

Towards #3858.

  • @AlexGuteniev suggested using basic_string::resize_and_overwrite, creating an _Ugly version for unconditional use internally. (We generally avoid adding totally novel secret machinery to Standard classes as it can become a maintenance burden, but simply having _Ugly names to access Future Technology is easy and common.)

This is what was addressed

  • @AlexGuteniev also suggested investigating whether branchless codegen for assigning _Elem0 vs. _Elem1 would be superior. (Note: we cannot assume that their values are consecutive, they could be 'M' and 'E'.)

This is confirmed to be non applicable. Before the change it was cmovne to set '1'. After it became setne al to set to 0 or 1, then add al, '0'. Both are branchless. I think, the compiler will be able to make efficient branchless code for non-consecutive values by going back to cmovcc

Benchmark is already there - thanks @achabense!

Benchmark results before:

------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char>                    922 ns          924 ns       896000
BM_bitset_to_string<64, char>                 125791 ns       125558 ns         5600
BM_bitset_to_string_large_single<char>          2449 ns         2431 ns       263529
BM_bitset_to_string<7, wchar_t>                  539 ns          547 ns      1000000
BM_bitset_to_string<64, wchar_t>                4399 ns         4325 ns       144516
BM_bitset_to_string_large_single<wchar_t>       4792 ns         4708 ns       149333

after

------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char>                    235 ns          234 ns      3200000
BM_bitset_to_string<64, char>                 119085 ns       119978 ns         5600
BM_bitset_to_string_large_single<char>          2111 ns         2100 ns       320000
BM_bitset_to_string<7, wchar_t>                  153 ns          153 ns      4480000
BM_bitset_to_string<64, wchar_t>                4106 ns         4081 ns       172308
BM_bitset_to_string_large_single<wchar_t>       2137 ns         2148 ns       320000

The anomality of BM_bitset_to_string<64, char> is caused by too angry loop unrolling, and, IMO, should be fixed in compiler, if needs fixing.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner July 25, 2023 12:06
@StephanTLavavej StephanTLavavej added the performance Must go faster label Jul 25, 2023
stl/inc/xstring Outdated Show resolved Hide resolved
if (_Subscript(_Bits - 1 - _Pos)) {
_Str[_Pos] = _Elem1;
basic_string<_Elem, _Tr, _Alloc> _Str;
_Str._Resize_and_overwrite(_Bits, [this, _Elem0, _Elem1](_Elem* _Buf, size_t _Len) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change requested: I observed that we could ignore _Len and take advantage of our knowledge that it will always be the compile-time constant _Bits, which is available without capturing. I checked the codegen, and this indeed allows the compiler to bake in the constant. However, while benchmarking showed some improvement for BM_bitset_to_string<64, char>, it showed significant pessimization for BM_bitset_to_string<15, char> .

BEFORE USING _Bits
------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char>                    140 ns          144 ns      4977778
BM_bitset_to_string<64, char>                   3336 ns         3348 ns       224000
BM_bitset_to_string_large_single<char>          2008 ns         2040 ns       344615
BM_bitset_to_string<7, wchar_t>                 88.3 ns         87.9 ns      7466667
BM_bitset_to_string<64, wchar_t>                2211 ns         2197 ns       298667
BM_bitset_to_string_large_single<wchar_t>       1451 ns         1430 ns       448000

AFTER USING _Bits
------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char>                    640 ns          642 ns      1120000
BM_bitset_to_string<64, char>                   2646 ns         2609 ns       263529
BM_bitset_to_string_large_single<char>          2010 ns         2040 ns       344615
BM_bitset_to_string<7, wchar_t>                 82.3 ns         82.0 ns      8960000
BM_bitset_to_string<64, wchar_t>                2407 ns         2441 ns       320000
BM_bitset_to_string_large_single<wchar_t>       1372 ns         1360 ns       448000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think despite some pessimization of the certain case, it is still the right thing to do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually the pessimization is too much. I think, angry loop unrolling strikes here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it as a parameter for now then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As _Bits is the natural decision but we have to use _Len for performance reasons, I think we need to add a comment for it.

Copy link
Contributor

@achabense achabense Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how to compare benchmark results before and after change in the same run in a convenient way? I'm doing this in a very awkward way.

Benchmark
#include <array>
#include <benchmark/benchmark.h>
#include <bit>
#include <bitset>

#include <climits>
#include <cstddef>
#include <cstdint>
#include <random>
#include <iostream>

using namespace std;

struct impl_old {
    template <class _Elem = char, size_t _Bits, class _Tr = char_traits<_Elem>, class _Alloc = allocator<_Elem>>
    static basic_string<_Elem, _Tr, _Alloc> to_string(const std::bitset<_Bits>& bs,
        const _Elem _Elem0 = static_cast<_Elem>('0'), const _Elem _Elem1 = static_cast<_Elem>('1')) {
        basic_string<_Elem, _Tr, _Alloc> _Str(_Bits, _Elem0);
        for (size_t _Pos = 0; _Pos < _Bits; ++_Pos) {
            if (bs[_Bits - 1 - _Pos]) {
                _Str[_Pos] = _Elem1;
            }
        }
        return _Str;
    }
};

struct impl_new_Bits {
    template <class _Elem = char, size_t _Bits, class _Tr = char_traits<_Elem>, class _Alloc = allocator<_Elem>>
    static basic_string<_Elem, _Tr, _Alloc> to_string(const std::bitset<_Bits>& bs,
        const _Elem _Elem0 = static_cast<_Elem>('0'), const _Elem _Elem1 = static_cast<_Elem>('1')) {
        basic_string<_Elem, _Tr, _Alloc> _Str;
        _Str.resize_and_overwrite(_Bits, [&bs, _Elem0, _Elem1](_Elem* _Buf, size_t) {
            for (size_t _Pos = 0; _Pos < _Bits; ++_Pos) {
                _Buf[_Pos] = bs[_Bits - 1 - _Pos] ? _Elem1 : _Elem0;
            }
            return _Bits;
            });
        return _Str;
    }
};

struct impl_new_Len {
    template <class _Elem = char, size_t _Bits, class _Tr = char_traits<_Elem>, class _Alloc = allocator<_Elem>>
    static basic_string<_Elem, _Tr, _Alloc> to_string(const std::bitset<_Bits>& bs,
        const _Elem _Elem0 = static_cast<_Elem>('0'), const _Elem _Elem1 = static_cast<_Elem>('1')) {
        basic_string<_Elem, _Tr, _Alloc> _Str;
        _Str.resize_and_overwrite(_Bits, [&bs, _Elem0, _Elem1](_Elem* _Buf, size_t _Len) {
            for (size_t _Pos = 0; _Pos < _Len; ++_Pos) {
                _Buf[_Pos] = bs[_Len - 1 - _Pos] ? _Elem1 : _Elem0;
            }
            return _Len;
            });
        return _Str;
    }
};

namespace {
    const auto random_bits = [] {
        mt19937_64 rnd{};
        array<uint64_t, 32> arr;
        for (auto& d : arr) {
            d = rnd();
        }
        return arr;
        }();

        template <class op, size_t N, class charT>
        void BM_bitset_to_string(benchmark::State& state) {
            for (auto _ : state) {
                for (const auto& bits : random_bits) {
                    bitset<N> bs{ bits };
                    benchmark::DoNotOptimize(op::template to_string<charT>(bs));
                }
            }
        }

        template <class op, class charT>
        void BM_bitset_to_string_large_single(benchmark::State& state) {
            const auto large_bitset = bit_cast<bitset<CHAR_BIT * sizeof(random_bits)>>(random_bits);
            for (auto _ : state) {
                benchmark::DoNotOptimize(op::template to_string<charT>(large_bitset));
            }
        }
} // namespace

#define MAKE_BENCHMARKS(impl)\
BENCHMARK(BM_bitset_to_string<impl, 15, char>)->Name("\nBM_bitset_to_string<" #impl ", 15, char>");\
BENCHMARK(BM_bitset_to_string<impl, 64, char>);\
BENCHMARK(BM_bitset_to_string_large_single<impl, char>);\
BENCHMARK(BM_bitset_to_string<impl, 7, wchar_t>);\
BENCHMARK(BM_bitset_to_string<impl, 64, wchar_t>);\
BENCHMARK(BM_bitset_to_string_large_single<impl, wchar_t>);

MAKE_BENCHMARKS(impl_old);
MAKE_BENCHMARKS(impl_new_Bits);
MAKE_BENCHMARKS(impl_new_Len);

BENCHMARK_MAIN();

(And I'm feeling nervious that the effect is very complex and can even vary between different runs)

Results

image

---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------

BM_bitset_to_string<impl_old, 15, char>                       1562 ns         1535 ns       448000
BM_bitset_to_string<impl_old, 64, char>                       11188 ns        11230 ns        64000
BM_bitset_to_string_large_single<impl_old, char>               9014 ns         9050 ns       127758
BM_bitset_to_string<impl_old, 7, wchar_t>                       915 ns          907 ns       896000
BM_bitset_to_string<impl_old, 64, wchar_t>                     6158 ns         6166 ns        83627
BM_bitset_to_string_large_single<impl_old, wchar_t>            5732 ns         5755 ns       149333

BM_bitset_to_string<impl_new_Bits, 15, char>                   898 ns          914 ns       768981
BM_bitset_to_string<impl_new_Bits, 64, char>                   8399 ns         8438 ns       100000
BM_bitset_to_string_large_single<impl_new_Bits, char>          3976 ns         3962 ns       228761
BM_bitset_to_string<impl_new_Bits, 7, wchar_t>                  198 ns          197 ns      3733333
BM_bitset_to_string<impl_new_Bits, 64, wchar_t>               11847 ns        11858 ns        44800
BM_bitset_to_string_large_single<impl_new_Bits, wchar_t>       7289 ns         7357 ns       148670

BM_bitset_to_string<impl_new_Len, 15, char>                    550 ns          562 ns      1000000
BM_bitset_to_string<impl_new_Len, 64, char>                    6000 ns         6099 ns       128085
BM_bitset_to_string_large_single<impl_new_Len, char>           3976 ns         4004 ns       320000
BM_bitset_to_string<impl_new_Len, 7, wchar_t>                   135 ns          137 ns      5120000
BM_bitset_to_string<impl_new_Len, 64, wchar_t>                12374 ns        12556 ns        44800
BM_bitset_to_string_large_single<impl_new_Len, wchar_t>       15409 ns        15346 ns        56000

Another run

---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------

BM_bitset_to_string<impl_old, 15, char>                       1731 ns         1726 ns       298667
BM_bitset_to_string<impl_old, 64, char>                        7147 ns         7150 ns        89600
BM_bitset_to_string_large_single<impl_old, char>               4201 ns         4148 ns       139378
BM_bitset_to_string<impl_old, 7, wchar_t>                       734 ns          732 ns       746667
BM_bitset_to_string<impl_old, 64, wchar_t>                    10492 ns        10463 ns        74667
BM_bitset_to_string_large_single<impl_old, wchar_t>            5644 ns         5620 ns       122334

BM_bitset_to_string<impl_new_Bits, 15, char>                  1435 ns         1416 ns       640000
BM_bitset_to_string<impl_new_Bits, 64, char>                   5507 ns         5418 ns       115348
BM_bitset_to_string_large_single<impl_new_Bits, char>          3620 ns         3594 ns       247782
BM_bitset_to_string<impl_new_Bits, 7, wchar_t>                  188 ns          190 ns      2635294
BM_bitset_to_string<impl_new_Bits, 64, wchar_t>                8363 ns         8231 ns       112000
BM_bitset_to_string_large_single<impl_new_Bits, wchar_t>       5042 ns         4933 ns       104533

BM_bitset_to_string<impl_new_Len, 15, char>                    232 ns          235 ns      2986667
BM_bitset_to_string<impl_new_Len, 64, char>                    6508 ns         6562 ns       100000
BM_bitset_to_string_large_single<impl_new_Len, char>           3964 ns         3941 ns       281493
BM_bitset_to_string<impl_new_Len, 7, wchar_t>                   174 ns          174 ns      4480000
BM_bitset_to_string<impl_new_Len, 64, wchar_t>                15721 ns        15625 ns        56000
BM_bitset_to_string_large_single<impl_new_Len, wchar_t>        7529 ns         7533 ns       112000

Copy link
Contributor

@achabense achabense Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there problems with my comparing benchmark? I'm finding resize_and_overwrite&_Len approach stably slow for large wchar_t cases in my testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be CPU-dependent.
For _Bits the compiler is more likely to advantage of knowing _Bits and unroll loops or vectorize something.
For _Len the compiler may be restricted from doing that.

Whether the optimizations like loop unrolling are useful or harmful -- dependent on CPU type.

Also non-unrolled loop (_Len) is subject to JCC errata pessimization on some CPUs, /QIntel-jcc-erratum might help in this case https://learn.microsoft.com/ru-ru/cpp/build/reference/qintel-jcc-erratum?view=msvc-170

I'm neutral with regards to _Bits vs _Len if none of them are absolutely advantage.

Copy link
Contributor

@achabense achabense Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to test for _Bits vs _Len in more environments to do the decision. I tend to believe that by using _Bits we can get general improvements over previous impl, though not significant in some cases like small bitset.to_string<char>; by using _Len we risk introducing some regressions.

Here is some results after turning on that switch:
.......I find my benchmark result highly unstable between separate runs...

Result
---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------

BM_bitset_to_string<impl_old, 15, char>                       1718 ns         1726 ns       298667
BM_bitset_to_string<impl_old, 64, char>                        6191 ns         6147 ns       119467
BM_bitset_to_string_large_single<impl_old, char>               5596 ns         5580 ns        89600
BM_bitset_to_string<impl_old, 7, wchar_t>                       842 ns          825 ns      1003520
BM_bitset_to_string<impl_old, 64, wchar_t>                     8230 ns         8196 ns        89600
BM_bitset_to_string_large_single<impl_old, wchar_t>            5406 ns         5312 ns       100000

BM_bitset_to_string<impl_new_Bits, 15, char>                   993 ns          993 ns       802816
BM_bitset_to_string<impl_new_Bits, 64, char>                   7076 ns         7063 ns       139378
BM_bitset_to_string_large_single<impl_new_Bits, char>          4549 ns         4552 ns       281493
BM_bitset_to_string<impl_new_Bits, 7, wchar_t>                  132 ns          133 ns      4480000
BM_bitset_to_string<impl_new_Bits, 64, wchar_t>                5765 ns         5755 ns        89600
BM_bitset_to_string_large_single<impl_new_Bits, wchar_t>       7747 ns         7782 ns       136533

BM_bitset_to_string<impl_new_Len, 15, char>                    446 ns          445 ns      2986667
BM_bitset_to_string<impl_new_Len, 64, char>                    5924 ns         5929 ns        89600
BM_bitset_to_string_large_single<impl_new_Len, char>           2957 ns         2979 ns       320000
BM_bitset_to_string<impl_new_Len, 7, wchar_t>                   242 ns          239 ns      3200000
BM_bitset_to_string<impl_new_Len, 64, wchar_t>                11633 ns        11509 ns        44800
BM_bitset_to_string_large_single<impl_new_Len, wchar_t>        5270 ns         5336 ns       149333

Another run

---------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations
---------------------------------------------------------------------------------------------------

BM_bitset_to_string<impl_old, 15, char>                       1415 ns         1445 ns       432552
BM_bitset_to_string<impl_old, 64, char>                        6740 ns         6627 ns        89600
BM_bitset_to_string_large_single<impl_old, char>               5079 ns         5000 ns       100000
BM_bitset_to_string<impl_old, 7, wchar_t>                       632 ns          628 ns      1120000
BM_bitset_to_string<impl_old, 64, wchar_t>                     8644 ns         8545 ns        89600
BM_bitset_to_string_large_single<impl_old, wchar_t>            7400 ns         7324 ns        74667

BM_bitset_to_string<impl_new_Bits, 15, char>                  1096 ns         1108 ns       719369
BM_bitset_to_string<impl_new_Bits, 64, char>                   6058 ns         6104 ns       143360
BM_bitset_to_string_large_single<impl_new_Bits, char>          4695 ns         4708 ns       298667
BM_bitset_to_string<impl_new_Bits, 7, wchar_t>                  247 ns          248 ns      5600000
BM_bitset_to_string<impl_new_Bits, 64, wchar_t>                5696 ns         5755 ns        89600
BM_bitset_to_string_large_single<impl_new_Bits, wchar_t>       6840 ns         6875 ns       100000

BM_bitset_to_string<impl_new_Len, 15, char>                    231 ns          234 ns      2800000
BM_bitset_to_string<impl_new_Len, 64, char>                    5700 ns         5622 ns       144516
BM_bitset_to_string_large_single<impl_new_Len, char>           2904 ns         2916 ns       203636
BM_bitset_to_string<impl_new_Len, 7, wchar_t>                   133 ns          132 ns      5439529
BM_bitset_to_string<impl_new_Len, 64, wchar_t>                13153 ns        13253 ns        44800
BM_bitset_to_string_large_single<impl_new_Len, wchar_t>        5030 ns         5017 ns       102784

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest don't think too much on _Bits vs _Len. The next step would be vectorization, it would be obviously faster than this, and obviously will use _Bits.

@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 5df7834 into microsoft:main Aug 11, 2023
@StephanTLavavej
Copy link
Member

Thanks for further optimizing this function! 🐇 🐇 🐇

@AlexGuteniev AlexGuteniev deleted the bits branch August 11, 2023 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants