Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unconditional make_heap for priority_queue::push_range #4025

Merged
merged 15 commits into from
Sep 21, 2023

Conversation

achabense
Copy link
Contributor

@achabense achabense commented Sep 13, 2023

Fixes #2826.

  1. It's dangerous to unconditionally call make_heap - we will get great performance penalty if we do push_heap in small batches.
  2. To remake the entire heap, for some types (like string), we can actually get better performance if we just do a push_heap loop. For those types, we can unconditionally do push_heap loop instead. However, for some trivial types (especially scalars) we will really get better performance if we call make_heap.
  3. Based on 1&2, to be conservative, I'm choosing new_size/2>old_size as the boundary to prefer make_heap to push_heap loop. This will completely avoid the problem in 1.

About new_size/2>old_size: make_heap(size 2*o) does o _Pop_heap_hole_by_index, and each _Pop_heap_hole_by_index calls a _Push_heap_by_index, so I think it's safe to assume doing a o-push_heap loop can be faster before/around this boundary.

The following benchmark shows a generally better result, including those at the boundary (Arg(vec_size / 2 + 1); 5001). However, I'm not sure going on in u32/u64/float's Arg(vec_size / 2 + 1) case (in which case remaking the heap is actually faster). (This does not necessarily mean we have to set a smaller ratio boundary, as it looks size/type/sequence-sensitive. See the update part below; for more extensive benchmarks see the next comment.)

Benchmark result (vec_size==10000):

Previous
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_push_range<uint8_t, vec_u8>/1             132314020 ns    131250000 ns            5
BM_push_range<uint8_t, vec_u8>/100             1378146 ns      1380522 ns          498
BM_push_range<uint8_t, vec_u8>/10000             23991 ns        24065 ns        29867
BM_push_range<uint8_t, vec_u8>/5001              44356 ns        43493 ns        15448

BM_push_range<uint16_t, vec_u16>/1           129688300 ns    130208333 ns            6
BM_push_range<uint16_t, vec_u16>/100           1332057 ns      1339286 ns          560
BM_push_range<uint16_t, vec_u16>/10000           21871 ns        21973 ns        32000
BM_push_range<uint16_t, vec_u16>/5001            42268 ns        42375 ns        16593

BM_push_range<uint32_t, vec_u32>/1            49569970 ns     50000000 ns           10
BM_push_range<uint32_t, vec_u32>/100            534551 ns       531250 ns         1000
BM_push_range<uint32_t, vec_u32>/10000            8961 ns         8789 ns        74667
BM_push_range<uint32_t, vec_u32>/5001            23805 ns        24065 ns        29867

BM_push_range<uint64_t, vec_u64>/1            52815690 ns     53125000 ns           10
BM_push_range<uint64_t, vec_u64>/100            568480 ns       578125 ns         1000
BM_push_range<uint64_t, vec_u64>/10000           10845 ns        10882 ns        74667
BM_push_range<uint64_t, vec_u64>/5001            26797 ns        27274 ns        26353

BM_push_range<float, vec_float>/1             85530500 ns     85069444 ns            9
BM_push_range<float, vec_float>/100             902526 ns       899431 ns          747
BM_push_range<float, vec_float>/10000            16287 ns        16113 ns        40727
BM_push_range<float, vec_float>/5001             35487 ns        35296 ns        19478

BM_push_range<double, vec_double>/1           91232514 ns     91517857 ns            7
BM_push_range<double, vec_double>/100           931463 ns       927734 ns          640
BM_push_range<double, vec_double>/10000          18252 ns        17997 ns        37333
BM_push_range<double, vec_double>/5001           38361 ns        38365 ns        17920

BM_push_range<string_view, vec_str>/1        485819750 ns    484375000 ns            2
BM_push_range<string_view, vec_str>/100        5771718 ns      5781250 ns          100
BM_push_range<string_view, vec_str>/10000       128556 ns       128348 ns         5600
BM_push_range<string_view, vec_str>/5001        197701 ns       194972 ns         3446

BM_push_range<string, vec_str>/1             562048300 ns    562500000 ns            1
BM_push_range<string, vec_str>/100             6463913 ns      6556920 ns          112
BM_push_range<string, vec_str>/10000            160829 ns       160435 ns         4480
BM_push_range<string, vec_str>/5001             241398 ns       239955 ns         2800

BM_push_range<wstring_view, vec_wstr>/1      448094050 ns    445312500 ns            2
BM_push_range<wstring_view, vec_wstr>/100      6137002 ns      6250000 ns          100
BM_push_range<wstring_view, vec_wstr>/10000     125909 ns       125558 ns         5600
BM_push_range<wstring_view, vec_wstr>/5001      209610 ns       208575 ns         3446

BM_push_range<wstring, vec_wstr>/1           618522000 ns    625000000 ns            1
BM_push_range<wstring, vec_wstr>/100           8002004 ns      7986111 ns           90
BM_push_range<wstring, vec_wstr>/10000          452893 ns       449219 ns         1600
BM_push_range<wstring, vec_wstr>/5001           569804 ns       571987 ns         1120
Now
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_push_range<uint8_t, vec_u8>/1                129866 ns       128691 ns         4978
BM_push_range<uint8_t, vec_u8>/100               58166 ns        57812 ns        10000
BM_push_range<uint8_t, vec_u8>/10000             21505 ns        21484 ns        32000
BM_push_range<uint8_t, vec_u8>/5001              29584 ns        29820 ns        23579

BM_push_range<uint16_t, vec_u16>/1              134689 ns       136719 ns         5600
BM_push_range<uint16_t, vec_u16>/100             61244 ns        61384 ns        11200
BM_push_range<uint16_t, vec_u16>/10000           21685 ns        21484 ns        32000
BM_push_range<uint16_t, vec_u16>/5001            31650 ns        31390 ns        22400

BM_push_range<uint32_t, vec_u32>/1              133096 ns       131138 ns         5600
BM_push_range<uint32_t, vec_u32>/100             66086 ns        66964 ns        11200
BM_push_range<uint32_t, vec_u32>/10000           11868 ns        11440 ns        56000
BM_push_range<uint32_t, vec_u32>/5001            31735 ns        32087 ns        22400

BM_push_range<uint64_t, vec_u64>/1              137059 ns       134969 ns         4978
BM_push_range<uint64_t, vec_u64>/100             64009 ns        64174 ns        11200
BM_push_range<uint64_t, vec_u64>/10000           13127 ns        13393 ns        56000
BM_push_range<uint64_t, vec_u64>/5001            29801 ns        29820 ns        23579

BM_push_range<float, vec_float>/1               152839 ns       153460 ns         4480
BM_push_range<float, vec_float>/100              74416 ns        74986 ns         8960
BM_push_range<float, vec_float>/10000            15920 ns        16044 ns        44800
BM_push_range<float, vec_float>/5001             46340 ns        46038 ns        11200

BM_push_range<double, vec_double>/1             155410 ns       153460 ns         4480
BM_push_range<double, vec_double>/100            74945 ns        74986 ns         8960
BM_push_range<double, vec_double>/10000          21200 ns        21310 ns        34462
BM_push_range<double, vec_double>/5001           30883 ns        31145 ns        23579

BM_push_range<string_view, vec_str>/1           249675 ns       245536 ns         2800
BM_push_range<string_view, vec_str>/100         156717 ns       156948 ns         4480
BM_push_range<string_view, vec_str>/10000       131120 ns       131138 ns         5600
BM_push_range<string_view, vec_str>/5001        141407 ns       141246 ns         4978

BM_push_range<string, vec_str>/1                366666 ns       362739 ns         1723
BM_push_range<string, vec_str>/100              275281 ns       276215 ns         2489
BM_push_range<string, vec_str>/10000            162060 ns       160435 ns         4480
BM_push_range<string, vec_str>/5001             192471 ns       192540 ns         3733

BM_push_range<wstring_view, vec_wstr>/1         256913 ns       260911 ns         2635
BM_push_range<wstring_view, vec_wstr>/100       149590 ns       150663 ns         4978
BM_push_range<wstring_view, vec_wstr>/10000     125749 ns       125558 ns         5600
BM_push_range<wstring_view, vec_wstr>/5001      129565 ns       128691 ns         4978

BM_push_range<wstring, vec_wstr>/1              849736 ns       836680 ns          747
BM_push_range<wstring, vec_wstr>/100            716124 ns       714983 ns          896
BM_push_range<wstring, vec_wstr>/10000          488530 ns       497405 ns         1445
BM_push_range<wstring, vec_wstr>/5001           521849 ns       531250 ns         1000

UPDATE: The new approach is actually faster around the boundary (Arg(vec_size / 2 + 1); 4001,6001) when vec_size==8000 or 12000:

When vec_size==8000:

Previous
-------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations
-------------------------------------------------------------------------------------
BM_push_range<uint8_t, vec_u8>/1             86345400 ns     84821429 ns            7
BM_push_range<uint8_t, vec_u8>/100             896486 ns       899431 ns          747
BM_push_range<uint8_t, vec_u8>/8000             17466 ns        17648 ns        40727
BM_push_range<uint8_t, vec_u8>/4001             32089 ns        32227 ns        21333

BM_push_range<uint16_t, vec_u16>/1           82768829 ns     82589286 ns            7
BM_push_range<uint16_t, vec_u16>/100           880055 ns       878514 ns          747
BM_push_range<uint16_t, vec_u16>/8000           17378 ns        17648 ns        40727
BM_push_range<uint16_t, vec_u16>/4001           29446 ns        29157 ns        23579

BM_push_range<uint32_t, vec_u32>/1           66910836 ns     66964286 ns           14
BM_push_range<uint32_t, vec_u32>/100           713974 ns       711496 ns         1120
BM_push_range<uint32_t, vec_u32>/8000           19359 ns        19671 ns        37333
BM_push_range<uint32_t, vec_u32>/4001           33408 ns        32993 ns        20364

BM_push_range<uint64_t, vec_u64>/1           73805889 ns     74652778 ns            9
BM_push_range<uint64_t, vec_u64>/100           781004 ns       773929 ns          747
BM_push_range<uint64_t, vec_u64>/8000           22574 ns        22496 ns        29867
BM_push_range<uint64_t, vec_u64>/4001           38794 ns        39237 ns        17920

BM_push_range<float, vec_float>/1            79784767 ns     79861111 ns            9
BM_push_range<float, vec_float>/100            840182 ns       836680 ns          747
BM_push_range<float, vec_float>/8000            22378 ns        22321 ns        28000
BM_push_range<float, vec_float>/4001            39774 ns        39550 ns        16593

BM_push_range<double, vec_double>/1          85299433 ns     86805556 ns            9
BM_push_range<double, vec_double>/100          947088 ns       962182 ns          747
BM_push_range<double, vec_double>/8000          24678 ns        24554 ns        28000
BM_push_range<double, vec_double>/4001          44897 ns        44643 ns        11200

BM_push_range<string_view, vec_str>/1       535751600 ns    531250000 ns            1
BM_push_range<string_view, vec_str>/100       6278931 ns      6277902 ns          112
BM_push_range<string_view, vec_str>/8000       161811 ns       163923 ns         4480
BM_push_range<string_view, vec_str>/4001       267711 ns       263660 ns         2489

BM_push_range<string, vec_str>/1            587909800 ns    593750000 ns            1
BM_push_range<string, vec_str>/100            7078906 ns      6975446 ns          112
BM_push_range<string, vec_str>/8000            202050 ns       204041 ns         3446
BM_push_range<string, vec_str>/4001            318114 ns       320871 ns         2240

BM_push_range<wstring_view, vec_wstr>/1     579887500 ns    578125000 ns            1
BM_push_range<wstring_view, vec_wstr>/100     7370850 ns      7393973 ns          112
BM_push_range<wstring_view, vec_wstr>/8000     163061 ns       163923 ns         4480
BM_push_range<wstring_view, vec_wstr>/4001     278648 ns       276215 ns         2489

BM_push_range<wstring, vec_wstr>/1          786779400 ns    781250000 ns            1
BM_push_range<wstring, vec_wstr>/100         10046687 ns     10000000 ns           75
BM_push_range<wstring, vec_wstr>/8000          651075 ns       655692 ns         1120
BM_push_range<wstring, vec_wstr>/4001          468535 ns       468532 ns         1434
Now
-------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations
-------------------------------------------------------------------------------------
BM_push_range<uint8_t, vec_u8>/1               108965 ns       109863 ns         6400
BM_push_range<uint8_t, vec_u8>/100              44074 ns        43316 ns        16593
BM_push_range<uint8_t, vec_u8>/8000             17109 ns        17264 ns        40727
BM_push_range<uint8_t, vec_u8>/4001             17287 ns        17264 ns        40727

BM_push_range<uint16_t, vec_u16>/1             112337 ns       112305 ns         6400
BM_push_range<uint16_t, vec_u16>/100            45185 ns        45409 ns        14452
BM_push_range<uint16_t, vec_u16>/8000           17384 ns        17648 ns        40727
BM_push_range<uint16_t, vec_u16>/4001           31069 ns        31250 ns        32000

BM_push_range<uint32_t, vec_u32>/1             157909 ns       160435 ns         4480
BM_push_range<uint32_t, vec_u32>/100            58340 ns        58594 ns        11200
BM_push_range<uint32_t, vec_u32>/8000           18087 ns        17997 ns        37333
BM_push_range<uint32_t, vec_u32>/4001           32847 ns        32087 ns        22400

BM_push_range<uint64_t, vec_u64>/1             160725 ns       160435 ns         4480
BM_push_range<uint64_t, vec_u64>/100            64445 ns        64174 ns        11200
BM_push_range<uint64_t, vec_u64>/8000           21812 ns        21973 ns        32000
BM_push_range<uint64_t, vec_u64>/4001           36642 ns        36901 ns        19478

BM_push_range<float, vec_float>/1              143998 ns       145777 ns         4073
BM_push_range<float, vec_float>/100             55574 ns        55804 ns        11200
BM_push_range<float, vec_float>/8000            11642 ns        11719 ns        64000
BM_push_range<float, vec_float>/4001            33503 ns        33692 ns        21333

BM_push_range<double, vec_double>/1            128889 ns       128348 ns         5600
BM_push_range<double, vec_double>/100           58495 ns        57199 ns        11200
BM_push_range<double, vec_double>/8000          12518 ns        12277 ns        56000
BM_push_range<double, vec_double>/4001          18326 ns        17997 ns        37333

BM_push_range<string_view, vec_str>/1          205341 ns       204041 ns         3446
BM_push_range<string_view, vec_str>/100        129671 ns       128691 ns         4978
BM_push_range<string_view, vec_str>/8000       103065 ns       104627 ns         7467
BM_push_range<string_view, vec_str>/4001       111280 ns       111607 ns         5600

BM_push_range<string, vec_str>/1               251210 ns       251116 ns         2800
BM_push_range<string, vec_str>/100             244371 ns       238550 ns         2358
BM_push_range<string, vec_str>/8000            135798 ns       134969 ns         4978
BM_push_range<string, vec_str>/4001            159636 ns       160435 ns         4480

BM_push_range<wstring_view, vec_wstr>/1        201277 ns       199507 ns         3446
BM_push_range<wstring_view, vec_wstr>/100      115028 ns       114397 ns         5600
BM_push_range<wstring_view, vec_wstr>/8000      96846 ns        96257 ns         7467
BM_push_range<wstring_view, vec_wstr>/4001      98729 ns        97656 ns         6400

BM_push_range<wstring, vec_wstr>/1             632818 ns       627790 ns         1120
BM_push_range<wstring, vec_wstr>/100           644742 ns       641741 ns         1120
BM_push_range<wstring, vec_wstr>/8000          381329 ns       383650 ns         1792
BM_push_range<wstring, vec_wstr>/4001          411618 ns       399013 ns         1723

When vec_size==12000:

Previous
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_push_range<uint8_t, vec_u8>/1             194486750 ns    195312500 ns            4
BM_push_range<uint8_t, vec_u8>/100             1976526 ns      1992754 ns          345
BM_push_range<uint8_t, vec_u8>/10000             69261 ns        66267 ns         8960
BM_push_range<uint8_t, vec_u8>/12000             35290 ns        35296 ns        19478
BM_push_range<uint8_t, vec_u8>/6001              59185 ns        58594 ns        11200

BM_push_range<uint16_t, vec_u16>/1           186077975 ns    187500000 ns            4
BM_push_range<uint16_t, vec_u16>/100           1906289 ns      1902174 ns          345
BM_push_range<uint16_t, vec_u16>/10000           65333 ns        65569 ns        11200
BM_push_range<uint16_t, vec_u16>/12000           34557 ns        34528 ns        20364
BM_push_range<uint16_t, vec_u16>/6001            61532 ns        61384 ns        11200

BM_push_range<uint32_t, vec_u32>/1           158355750 ns    156250000 ns            4
BM_push_range<uint32_t, vec_u32>/100           1635760 ns      1639230 ns          448
BM_push_range<uint32_t, vec_u32>/10000           68773 ns        69754 ns        11200
BM_push_range<uint32_t, vec_u32>/12000           38369 ns        38504 ns        18667
BM_push_range<uint32_t, vec_u32>/6001            61885 ns        61384 ns        11200

BM_push_range<uint64_t, vec_u64>/1           178653575 ns    179687500 ns            4
BM_push_range<uint64_t, vec_u64>/100           1854461 ns      1885054 ns          373
BM_push_range<uint64_t, vec_u64>/10000           79417 ns        79517 ns         7467
BM_push_range<uint64_t, vec_u64>/12000           43515 ns        43493 ns        15448
BM_push_range<uint64_t, vec_u64>/6001            68684 ns        68011 ns         8960

BM_push_range<float, vec_float>/1            189661800 ns    187500000 ns            4
BM_push_range<float, vec_float>/100            1948918 ns      1947464 ns          345
BM_push_range<float, vec_float>/10000            79855 ns        78474 ns         8960
BM_push_range<float, vec_float>/12000            43277 ns        43316 ns        16593
BM_push_range<float, vec_float>/6001             73937 ns        73242 ns         8960

BM_push_range<double, vec_double>/1          207170200 ns    203125000 ns            3
BM_push_range<double, vec_double>/100          2159209 ns      2148438 ns          320
BM_push_range<double, vec_double>/10000          89253 ns        90681 ns         8960
BM_push_range<double, vec_double>/12000          49647 ns        50000 ns        10000
BM_push_range<double, vec_double>/6001           77352 ns        76730 ns         8960

BM_push_range<string_view, vec_str>/1        745629400 ns    750000000 ns            1
BM_push_range<string_view, vec_str>/100        8897124 ns      8750000 ns           75
BM_push_range<string_view, vec_str>/10000       312743 ns       304813 ns         2358
BM_push_range<string_view, vec_str>/12000       167952 ns       170898 ns         4480
BM_push_range<string_view, vec_str>/6001        261305 ns       262277 ns         2800

BM_push_range<string, vec_str>/1             895937500 ns    890625000 ns            1
BM_push_range<string, vec_str>/100            10564659 ns     10602679 ns           56
BM_push_range<string, vec_str>/10000            384051 ns       383650 ns         1792
BM_push_range<string, vec_str>/12000            204698 ns       208575 ns         3446
BM_push_range<string, vec_str>/6001             317548 ns       313895 ns         2240

BM_push_range<wstring_view, vec_wstr>/1      711221700 ns    718750000 ns            1
BM_push_range<wstring_view, vec_wstr>/100      9292516 ns      9166667 ns           75
BM_push_range<wstring_view, vec_wstr>/10000     328228 ns       329641 ns         2133
BM_push_range<wstring_view, vec_wstr>/12000     160418 ns       160435 ns         4480
BM_push_range<wstring_view, vec_wstr>/6001      262532 ns       260911 ns         2635

BM_push_range<wstring, vec_wstr>/1           985596800 ns    968750000 ns            1
BM_push_range<wstring, vec_wstr>/100          12637839 ns     12555804 ns           56
BM_push_range<wstring, vec_wstr>/10000          831965 ns       837054 ns          896
BM_push_range<wstring, vec_wstr>/12000          587392 ns       585938 ns         1120
BM_push_range<wstring, vec_wstr>/6001           717141 ns       714983 ns          896
Now
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_push_range<uint8_t, vec_u8>/1                170014 ns       170898 ns         4480
BM_push_range<uint8_t, vec_u8>/100               79727 ns        80218 ns         8960
BM_push_range<uint8_t, vec_u8>/10000             35387 ns        35295 ns        20364
BM_push_range<uint8_t, vec_u8>/12000             30244 ns        30483 ns        23579
BM_push_range<uint8_t, vec_u8>/6001              51225 ns        50223 ns        11200

BM_push_range<uint16_t, vec_u16>/1              175976 ns       176467 ns         4073
BM_push_range<uint16_t, vec_u16>/100             79007 ns        77424 ns         7467
BM_push_range<uint16_t, vec_u16>/10000           25113 ns        25670 ns        28000
BM_push_range<uint16_t, vec_u16>/12000           30247 ns        29820 ns        23579
BM_push_range<uint16_t, vec_u16>/6001            41706 ns        41433 ns        16593

BM_push_range<uint32_t, vec_u32>/1              174957 ns       171611 ns         3733
BM_push_range<uint32_t, vec_u32>/100             84230 ns        85449 ns         8960
BM_push_range<uint32_t, vec_u32>/10000           19107 ns        19252 ns        37333
BM_push_range<uint32_t, vec_u32>/12000           21303 ns        20856 ns        34462
BM_push_range<uint32_t, vec_u32>/6001            43138 ns        43316 ns        16593

BM_push_range<uint64_t, vec_u64>/1              174656 ns       171611 ns         3733
BM_push_range<uint64_t, vec_u64>/100             82811 ns        81609 ns         7467
BM_push_range<uint64_t, vec_u64>/10000           18511 ns        18415 ns        40727
BM_push_range<uint64_t, vec_u64>/12000           23564 ns        23542 ns        29867
BM_push_range<uint64_t, vec_u64>/6001            38358 ns        37493 ns        17920

BM_push_range<float, vec_float>/1               192319 ns       192540 ns         3733
BM_push_range<float, vec_float>/100              95799 ns        96257 ns         7467
BM_push_range<float, vec_float>/10000            42719 ns        41992 ns        16000
BM_push_range<float, vec_float>/12000            32692 ns        32227 ns        21333
BM_push_range<float, vec_float>/6001             61729 ns        61384 ns        11200

BM_push_range<double, vec_double>/1             191856 ns       192540 ns         3733
BM_push_range<double, vec_double>/100            96711 ns        96257 ns         7467
BM_push_range<double, vec_double>/10000          31391 ns        31808 ns        23579
BM_push_range<double, vec_double>/12000          35254 ns        35296 ns        19478
BM_push_range<double, vec_double>/6001           54175 ns        54408 ns        11200

BM_push_range<string_view, vec_str>/1           311485 ns       313895 ns         2240
BM_push_range<string_view, vec_str>/100         200384 ns       199507 ns         3446
BM_push_range<string_view, vec_str>/10000       158855 ns       157286 ns         4073
BM_push_range<string_view, vec_str>/12000       150926 ns       149972 ns         4480
BM_push_range<string_view, vec_str>/6001        168579 ns       167411 ns         4480

BM_push_range<string, vec_str>/1                469091 ns       474330 ns         1120
BM_push_range<string, vec_str>/100              364140 ns       360947 ns         1948
BM_push_range<string, vec_str>/10000            244597 ns       239955 ns         2800
BM_push_range<string, vec_str>/12000            205429 ns       208575 ns         3446
BM_push_range<string, vec_str>/6001             246112 ns       245536 ns         2800

BM_push_range<wstring_view, vec_wstr>/1         320187 ns       320871 ns         2240
BM_push_range<wstring_view, vec_wstr>/100       207174 ns       208575 ns         3446
BM_push_range<wstring_view, vec_wstr>/10000     171600 ns       172631 ns         4073
BM_push_range<wstring_view, vec_wstr>/12000     160788 ns       163923 ns         4480
BM_push_range<wstring_view, vec_wstr>/6001      171280 ns       172631 ns         4073

BM_push_range<wstring, vec_wstr>/1             1046840 ns      1049805 ns          640
BM_push_range<wstring, vec_wstr>/100            921364 ns       927734 ns          640
BM_push_range<wstring, vec_wstr>/10000          625379 ns       627790 ns         1120
BM_push_range<wstring, vec_wstr>/12000          596649 ns       609375 ns         1000
BM_push_range<wstring, vec_wstr>/6001           622325 ns       627790 ns         1120

@achabense achabense requested a review from a team as a code owner September 13, 2023 05:05
@achabense
Copy link
Contributor Author

achabense commented Sep 13, 2023

The following benchmark compares make_heap vs push_heap loop after pushing the same size(ratio==2) with much wider data scale (up to millions). It shows push_heap loop is generally better, but can be slower at scalar types when data size falls into some range.

Additional benchmark
#include <benchmark/benchmark.h>
#include <random>
#include <span>
#include <string>
#include <vector>
#include<array>
#include<functional>
using namespace std;

namespace {
    constexpr size_t vec_size = 10000000;

    template<class T>
    auto create_vec(size_t vsize, function<T(uint64_t)> transform) {
        vector<T> vec(vsize);
        for (mt19937_64 rnd(1); auto & e:vec) {
            e = transform(rnd());
        }
        return vec;
    }

    template<class T>
    T cast_to(uint64_t val) {
        return static_cast<T>(val);
    }

    const auto vec_u8 = create_vec<uint8_t>(vec_size, cast_to<uint8_t>);
    const auto vec_u16 = create_vec<uint16_t>(vec_size, cast_to<uint16_t>);
    const auto vec_u32 = create_vec<uint32_t>(vec_size, cast_to<uint32_t>);
    const auto vec_u64 = create_vec<uint64_t>(vec_size, cast_to<uint64_t>);
    const auto vec_float = create_vec<float>(vec_size, cast_to<float>);
    const auto vec_double = create_vec<double>(vec_size, cast_to<double>);

    const auto vec_str = create_vec<string>(vec_size, [](uint64_t v) {
        return to_string(static_cast<uint32_t>(v));
        });
    const auto vec_wstr = create_vec<wstring>(vec_size, [](uint64_t v) {
        return to_wstring(static_cast<uint32_t>(v));
        });

    using keypair = array<uint64_t, 2>;

    const auto vec_keypair = create_vec<keypair>(vec_size, [](uint64_t v) {
        keypair pr{};
        pr[0] = v & 0xffffffff;
        pr[1] = v >> 32;
        return pr;
        });

    struct keyobj {
        uint64_t key;
        uint64_t dat;
        bool operator<(const keyobj& r)const {
            return key < r.key;
        }
    };
    const auto vec_keyobj = create_vec<keyobj>(vec_size, [](uint64_t v) {
        keyobj obj;
        obj.key = v;
        obj.dat = v;
        return obj;
        });

    template<size_t L>
    void putvs(const benchmark::State&) {
        static bool b = [] {
            puts("↑vs↓");
            return true;
            }();
    }

    template<size_t L>
    void putln(const benchmark::State&) {
        static bool b = [] {
            puts(string(94, '-').c_str());
            return true;
            }();
    }

    template <class T, const auto& Data, bool Remake>
    void BM_test(benchmark::State& state) {
        const size_t size_a = static_cast<size_t>(state.range(0));
        const size_t rate = static_cast<size_t>(state.range(1)); // new / old
        const size_t size_b = size_a * (rate - 1);

        for (auto _ : state) {
            state.PauseTiming();
            span spn(Data);
            assert(spn.size() >= size_a + size_b);

            vector<T> c;
            c.append_range(spn.subspan(0, size_a));
            std::make_heap(c.begin(), c.end());

            spn = spn.subspan(size_a);
            c.append_range(spn.subspan(0, size_b));
            state.ResumeTiming();

            if constexpr (Remake) {
                std::make_heap(c.begin(), c.end());
            }
            else {
                const auto _Begin = _Get_unwrapped(c.begin());
                auto _Heap_end = _Begin + size_a;
                const auto _End = _Get_unwrapped(c.end());
                while (_Heap_end != _End) {
                    std::push_heap(_Begin, ++_Heap_end);
                }
            }

            benchmark::DoNotOptimize(c);
        }
    }
}

enum :bool { Remake = true, PushEach = false };

const int ratio = 2; // new_size / old_size, >=2
#define ADD_BENCHMARK(T,source) \
BENCHMARK(BM_test<T, source, Remake>)->ArgsProduct({ benchmark::CreateRange(10, vec_size/ratio, 10),{ratio} })->Setup(putln<__LINE__>);\
BENCHMARK(BM_test<T, source, PushEach>)->ArgsProduct({ benchmark::CreateRange(10, vec_size/ratio, 10),{ratio} })->Setup(putvs<__LINE__>);

ADD_BENCHMARK(uint8_t, vec_u8);
ADD_BENCHMARK(uint16_t, vec_u16);
ADD_BENCHMARK(uint32_t, vec_u32);
ADD_BENCHMARK(uint64_t, vec_u64);
ADD_BENCHMARK(float, vec_float);
ADD_BENCHMARK(double, vec_double);

ADD_BENCHMARK(string_view, vec_str);
ADD_BENCHMARK(string, vec_str);
ADD_BENCHMARK(wstring_view, vec_wstr);
ADD_BENCHMARK(wstring, vec_wstr);
ADD_BENCHMARK(keypair, vec_keypair);
ADD_BENCHMARK(keyobj, vec_keyobj);

BENCHMARK_MAIN();
Result
----------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_test<uint8_t, vec_u8, Remake>/10/2                      197 ns          195 ns      3200000
BM_test<uint8_t, vec_u8, Remake>/100/2                     548 ns          459 ns      1294865
BM_test<uint8_t, vec_u8, Remake>/1000/2                   4592 ns         3990 ns       160563
BM_test<uint8_t, vec_u8, Remake>/10000/2                 66221 ns        62500 ns        10000
BM_test<uint8_t, vec_u8, Remake>/100000/2               723090 ns       669643 ns         1120
BM_test<uint8_t, vec_u8, Remake>/1000000/2             7340321 ns      7672991 ns          112
BM_test<uint8_t, vec_u8, Remake>/5000000/2            36779800 ns     37006579 ns           19
↑vs↓
BM_test<uint8_t, vec_u8, PushEach>/10/2                    188 ns          225 ns      2635294
BM_test<uint8_t, vec_u8, PushEach>/100/2                   293 ns          280 ns      3733333
BM_test<uint8_t, vec_u8, PushEach>/1000/2                 1834 ns         1709 ns       448000
BM_test<uint8_t, vec_u8, PushEach>/10000/2               67748 ns        71150 ns        11200
BM_test<uint8_t, vec_u8, PushEach>/100000/2             753025 ns       710945 ns          989
BM_test<uint8_t, vec_u8, PushEach>/1000000/2           7640121 ns      7812500 ns          112
BM_test<uint8_t, vec_u8, PushEach>/5000000/2          38420782 ns     35845588 ns           17
----------------------------------------------------------------------------------------------
BM_test<uint16_t, vec_u16, Remake>/10/2                    139 ns          144 ns      4977778
BM_test<uint16_t, vec_u16, Remake>/100/2                   551 ns          575 ns       896000
BM_test<uint16_t, vec_u16, Remake>/1000/2                 4803 ns         5312 ns       100000
BM_test<uint16_t, vec_u16, Remake>/10000/2               67403 ns        66267 ns         8960
BM_test<uint16_t, vec_u16, Remake>/100000/2             785687 ns       830078 ns          640
BM_test<uint16_t, vec_u16, Remake>/1000000/2           7979522 ns      8091518 ns          112
BM_test<uint16_t, vec_u16, Remake>/5000000/2          42615024 ns     42279412 ns           17
↑vs↓
BM_test<uint16_t, vec_u16, PushEach>/10/2                  123 ns          111 ns     10000000
BM_test<uint16_t, vec_u16, PushEach>/100/2                 251 ns          256 ns      2986667
BM_test<uint16_t, vec_u16, PushEach>/1000/2               1540 ns         1507 ns       560000
BM_test<uint16_t, vec_u16, PushEach>/10000/2             72404 ns        78125 ns        11200
BM_test<uint16_t, vec_u16, PushEach>/100000/2           780217 ns       781250 ns          640
BM_test<uint16_t, vec_u16, PushEach>/1000000/2         7969297 ns      7812500 ns           90
BM_test<uint16_t, vec_u16, PushEach>/5000000/2        40114094 ns     40798611 ns           18
----------------------------------------------------------------------------------------------
BM_test<uint32_t, vec_u32, Remake>/10/2                    148 ns          131 ns      5600000
BM_test<uint32_t, vec_u32, Remake>/100/2                   330 ns          270 ns      2488889
BM_test<uint32_t, vec_u32, Remake>/1000/2                 1976 ns         1953 ns       640000
BM_test<uint32_t, vec_u32, Remake>/10000/2               37352 ns        36098 ns        19478
BM_test<uint32_t, vec_u32, Remake>/100000/2             567012 ns       500000 ns         1000
BM_test<uint32_t, vec_u32, Remake>/1000000/2           6456496 ns      5555556 ns           90
BM_test<uint32_t, vec_u32, Remake>/5000000/2          39963841 ns     38602941 ns           17
↑vs↓
BM_test<uint32_t, vec_u32, PushEach>/10/2                  120 ns         83.7 ns     11200000
BM_test<uint32_t, vec_u32, PushEach>/100/2                 312 ns          293 ns      5600000
BM_test<uint32_t, vec_u32, PushEach>/1000/2               2258 ns         2267 ns       344615
BM_test<uint32_t, vec_u32, PushEach>/10000/2             77546 ns        86975 ns        10240
BM_test<uint32_t, vec_u32, PushEach>/100000/2           939340 ns      1066767 ns          498
BM_test<uint32_t, vec_u32, PushEach>/1000000/2        10507542 ns     10986328 ns           64
BM_test<uint32_t, vec_u32, PushEach>/5000000/2        53250500 ns     51562500 ns           10
----------------------------------------------------------------------------------------------
BM_test<uint64_t, vec_u64, Remake>/10/2                    229 ns          204 ns      2986667
BM_test<uint64_t, vec_u64, Remake>/100/2                   560 ns          562 ns      1000000
BM_test<uint64_t, vec_u64, Remake>/1000/2                 3382 ns         3599 ns       186667
BM_test<uint64_t, vec_u64, Remake>/10000/2               52984 ns        59989 ns        11200
BM_test<uint64_t, vec_u64, Remake>/100000/2             673568 ns       662667 ns          896
BM_test<uint64_t, vec_u64, Remake>/1000000/2           8662277 ns      8544922 ns           64
BM_test<uint64_t, vec_u64, Remake>/5000000/2          57779018 ns     58238636 ns           11
↑vs↓
BM_test<uint64_t, vec_u64, PushEach>/10/2                  123 ns         85.4 ns      8960000
BM_test<uint64_t, vec_u64, PushEach>/100/2                 239 ns          215 ns      4072727
BM_test<uint64_t, vec_u64, PushEach>/1000/2               1497 ns         1416 ns       640000
BM_test<uint64_t, vec_u64, PushEach>/10000/2             68167 ns        68750 ns        10000
BM_test<uint64_t, vec_u64, PushEach>/100000/2           846652 ns       773929 ns          747
BM_test<uint64_t, vec_u64, PushEach>/1000000/2         8862061 ns      8861940 ns           67
BM_test<uint64_t, vec_u64, PushEach>/5000000/2        44602307 ns     42708333 ns           15
----------------------------------------------------------------------------------------------
BM_test<float, vec_float, Remake>/10/2                     155 ns          159 ns      3733333
BM_test<float, vec_float, Remake>/100/2                    459 ns          460 ns      1120000
BM_test<float, vec_float, Remake>/1000/2                  3229 ns         3557 ns       224000
BM_test<float, vec_float, Remake>/10000/2                53471 ns        51618 ns        11200
BM_test<float, vec_float, Remake>/100000/2              756483 ns       669643 ns         1120
BM_test<float, vec_float, Remake>/1000000/2            8275301 ns      8246528 ns           72
BM_test<float, vec_float, Remake>/5000000/2           48572753 ns     48958333 ns           15
↑vs↓
BM_test<float, vec_float, PushEach>/10/2                   131 ns          154 ns      4977778
BM_test<float, vec_float, PushEach>/100/2                  317 ns          345 ns      2036364
BM_test<float, vec_float, PushEach>/1000/2                2359 ns         2051 ns       320000
BM_test<float, vec_float, PushEach>/10000/2              86495 ns        83705 ns         8960
BM_test<float, vec_float, PushEach>/100000/2            930952 ns       924247 ns          896
BM_test<float, vec_float, PushEach>/1000000/2          9772667 ns     12152778 ns           81
BM_test<float, vec_float, PushEach>/5000000/2         49184933 ns     47916667 ns           15
----------------------------------------------------------------------------------------------
BM_test<double, vec_double, Remake>/10/2                   156 ns          195 ns      3200000
BM_test<double, vec_double, Remake>/100/2                  455 ns          589 ns      1723077
BM_test<double, vec_double, Remake>/1000/2                3324 ns         3516 ns       186667
BM_test<double, vec_double, Remake>/10000/2              57314 ns        60938 ns        10000
BM_test<double, vec_double, Remake>/100000/2            853829 ns       868984 ns          935
BM_test<double, vec_double, Remake>/1000000/2         10014946 ns     10044643 ns           56
BM_test<double, vec_double, Remake>/5000000/2         66871067 ns     65972222 ns            9
↑vs↓
BM_test<double, vec_double, PushEach>/10/2                 131 ns          174 ns      4480000
BM_test<double, vec_double, PushEach>/100/2                319 ns          328 ns      2240000
BM_test<double, vec_double, PushEach>/1000/2              2374 ns         2511 ns       280000
BM_test<double, vec_double, PushEach>/10000/2            78100 ns        83705 ns         8960
BM_test<double, vec_double, PushEach>/100000/2          979385 ns      1093750 ns         1000
BM_test<double, vec_double, PushEach>/1000000/2       10231650 ns      9765625 ns           64
BM_test<double, vec_double, PushEach>/5000000/2       51510170 ns     53125000 ns           10
----------------------------------------------------------------------------------------------
BM_test<string_view, vec_str, Remake>/10/2                 212 ns          264 ns      2488889
BM_test<string_view, vec_str, Remake>/100/2               1302 ns         1328 ns      1000000
BM_test<string_view, vec_str, Remake>/1000/2             18160 ns        14997 ns        44800
BM_test<string_view, vec_str, Remake>/10000/2           268737 ns       231923 ns         2358
BM_test<string_view, vec_str, Remake>/100000/2         3288700 ns      3447770 ns          213
BM_test<string_view, vec_str, Remake>/1000000/2       65113091 ns     65340909 ns           11
BM_test<string_view, vec_str, Remake>/5000000/2      338344550 ns    343750000 ns            2
↑vs↓
BM_test<string_view, vec_str, PushEach>/10/2               159 ns          126 ns      5600000
BM_test<string_view, vec_str, PushEach>/100/2              832 ns          781 ns       640000
BM_test<string_view, vec_str, PushEach>/1000/2           12781 ns        13951 ns        44800
BM_test<string_view, vec_str, PushEach>/10000/2         160970 ns       171611 ns         3733
BM_test<string_view, vec_str, PushEach>/100000/2       1762045 ns      1855469 ns          320
BM_test<string_view, vec_str, PushEach>/1000000/2     19845579 ns     19301471 ns           34
BM_test<string_view, vec_str, PushEach>/5000000/2     95836357 ns     95982143 ns            7
----------------------------------------------------------------------------------------------
BM_test<string, vec_str, Remake>/10/2                      272 ns          254 ns      3200000
BM_test<string, vec_str, Remake>/100/2                    2006 ns         2058 ns       448000
BM_test<string, vec_str, Remake>/1000/2                  27989 ns        29053 ns        26353
BM_test<string, vec_str, Remake>/10000/2                310039 ns       324322 ns         2987
BM_test<string, vec_str, Remake>/100000/2              3654548 ns      3815407 ns          172
BM_test<string, vec_str, Remake>/1000000/2            61587627 ns     58238636 ns           11
BM_test<string, vec_str, Remake>/5000000/2           332438800 ns    328125000 ns            2
↑vs↓
BM_test<string, vec_str, PushEach>/10/2                    201 ns          206 ns      4919215
BM_test<string, vec_str, PushEach>/100/2                  1274 ns         1325 ns       448000
BM_test<string, vec_str, PushEach>/1000/2                17436 ns        13044 ns        40727
BM_test<string, vec_str, PushEach>/10000/2              189617 ns       190438 ns         3446
BM_test<string, vec_str, PushEach>/100000/2            2284274 ns      2265049 ns          407
BM_test<string, vec_str, PushEach>/1000000/2          27453029 ns     27994792 ns           24
BM_test<string, vec_str, PushEach>/5000000/2         137308600 ns    137500000 ns            5
----------------------------------------------------------------------------------------------
BM_test<wstring_view, vec_wstr, Remake>/10/2               191 ns          232 ns      2488889
BM_test<wstring_view, vec_wstr, Remake>/100/2              820 ns          820 ns       896000
BM_test<wstring_view, vec_wstr, Remake>/1000/2            9371 ns        10045 ns        56000
BM_test<wstring_view, vec_wstr, Remake>/10000/2         301015 ns       353021 ns         2036
BM_test<wstring_view, vec_wstr, Remake>/100000/2       4446935 ns      6250000 ns          100
BM_test<wstring_view, vec_wstr, Remake>/1000000/2     97118814 ns     93750000 ns            7
BM_test<wstring_view, vec_wstr, Remake>/5000000/2    499256900 ns    500000000 ns            1
↑vs↓
BM_test<wstring_view, vec_wstr, PushEach>/10/2             138 ns          164 ns      4480000
BM_test<wstring_view, vec_wstr, PushEach>/100/2            474 ns          496 ns      1544828
BM_test<wstring_view, vec_wstr, PushEach>/1000/2          4018 ns         3809 ns       172308
BM_test<wstring_view, vec_wstr, PushEach>/10000/2       157040 ns       138105 ns         4073
BM_test<wstring_view, vec_wstr, PushEach>/100000/2     2118473 ns      2287946 ns          280
BM_test<wstring_view, vec_wstr, PushEach>/1000000/2   23200818 ns     23995536 ns           28
BM_test<wstring_view, vec_wstr, PushEach>/5000000/2  116072467 ns    117187500 ns            6
----------------------------------------------------------------------------------------------
BM_test<wstring, vec_wstr, Remake>/10/2                    442 ns          469 ns      1600000
BM_test<wstring, vec_wstr, Remake>/100/2                  3514 ns         3770 ns       194783
BM_test<wstring, vec_wstr, Remake>/1000/2                41629 ns        39193 ns        23123
BM_test<wstring, vec_wstr, Remake>/10000/2              595905 ns       665509 ns         1080
BM_test<wstring, vec_wstr, Remake>/100000/2            9049275 ns      9375000 ns           75
BM_test<wstring, vec_wstr, Remake>/1000000/2         192242650 ns    191406250 ns            4
BM_test<wstring, vec_wstr, Remake>/5000000/2         920316300 ns    921875000 ns            1
↑vs↓
BM_test<wstring, vec_wstr, PushEach>/10/2                  403 ns          349 ns      1792000
BM_test<wstring, vec_wstr, PushEach>/100/2                2950 ns         3320 ns       263529
BM_test<wstring, vec_wstr, PushEach>/1000/2              32644 ns        36830 ns        20364
BM_test<wstring, vec_wstr, PushEach>/10000/2            429655 ns       352926 ns         1948
BM_test<wstring, vec_wstr, PushEach>/100000/2          5552056 ns      5781250 ns          100
BM_test<wstring, vec_wstr, PushEach>/1000000/2        72471273 ns     71022727 ns           11
BM_test<wstring, vec_wstr, PushEach>/5000000/2       399656450 ns    406250000 ns            2
----------------------------------------------------------------------------------------------
BM_test<keypair, vec_keypair, Remake>/10/2                 171 ns          172 ns      3446154
BM_test<keypair, vec_keypair, Remake>/100/2                621 ns          627 ns      1294865
BM_test<keypair, vec_keypair, Remake>/1000/2              5607 ns         5625 ns       100000
BM_test<keypair, vec_keypair, Remake>/10000/2            98423 ns        98438 ns        10000
BM_test<keypair, vec_keypair, Remake>/100000/2         1212722 ns      1098633 ns          640
BM_test<keypair, vec_keypair, Remake>/1000000/2       18098857 ns     17314189 ns           37
BM_test<keypair, vec_keypair, Remake>/5000000/2      115323817 ns    117187500 ns            6
↑vs↓
BM_test<keypair, vec_keypair, PushEach>/10/2               143 ns          172 ns      3733333
BM_test<keypair, vec_keypair, PushEach>/100/2              456 ns          519 ns      1445161
BM_test<keypair, vec_keypair, PushEach>/1000/2            9398 ns         8881 ns        80929
BM_test<keypair, vec_keypair, PushEach>/10000/2         106448 ns       117182 ns         7467
BM_test<keypair, vec_keypair, PushEach>/100000/2       1216059 ns      1311384 ns          560
BM_test<keypair, vec_keypair, PushEach>/1000000/2     13099080 ns     13437500 ns           50
BM_test<keypair, vec_keypair, PushEach>/5000000/2     65799400 ns     63920455 ns           11
----------------------------------------------------------------------------------------------
BM_test<keyobj, vec_keyobj, Remake>/10/2                   152 ns          146 ns      3733333
BM_test<keyobj, vec_keyobj, Remake>/100/2                  386 ns          374 ns      2133333
BM_test<keyobj, vec_keyobj, Remake>/1000/2                2653 ns         2407 ns       298667
BM_test<keyobj, vec_keyobj, Remake>/10000/2              62565 ns        67188 ns        10000
BM_test<keyobj, vec_keyobj, Remake>/100000/2            944318 ns       871931 ns          896
BM_test<keyobj, vec_keyobj, Remake>/1000000/2         15422817 ns     16006098 ns           41
BM_test<keyobj, vec_keyobj, Remake>/5000000/2        101054657 ns    100446429 ns            7
↑vs↓
BM_test<keyobj, vec_keyobj, PushEach>/10/2                 127 ns          135 ns      4977778
BM_test<keyobj, vec_keyobj, PushEach>/100/2                274 ns          322 ns      2133333
BM_test<keyobj, vec_keyobj, PushEach>/1000/2              1987 ns         2100 ns       320000
BM_test<keyobj, vec_keyobj, PushEach>/10000/2            75382 ns        82310 ns        11200
BM_test<keyobj, vec_keyobj, PushEach>/100000/2          998249 ns      1032366 ns         1120
BM_test<keyobj, vec_keyobj, PushEach>/1000000/2       10785954 ns     11160714 ns           56
BM_test<keyobj, vec_keyobj, PushEach>/5000000/2       53694118 ns     51136364 ns           11

If we change the `BM_test` part to:
replace BM_test and ADD_BENCHMARK
    template <class T, const auto& Data, bool Remake>
    void BM_test(benchmark::State& state) {
        const size_t vsize = static_cast<size_t>(state.range(0));

        for (auto _ : state) {
            span spn(Data);
            assert(spn.size() >= vsize);

            state.PauseTiming();
            vector<T> c(from_range, spn.subspan(0, vsize));
            state.ResumeTiming();

            if constexpr (Remake) {
                std::make_heap(c.begin(), c.end());
            }
            else {
                const auto _Begin = _Get_unwrapped(c.begin());
                auto _Heap_end = _Begin;
                const auto _End = _Get_unwrapped(c.end());
                while (_Heap_end != _End) {
                    std::push_heap(_Begin, ++_Heap_end);
                }
            }

            benchmark::DoNotOptimize(c);
        }
    }
}

enum :bool { Remake = true, PushEach = false };

#define ADD_BENCHMARK(T,source) \
BENCHMARK(BM_test<T, source, Remake>)->ArgsProduct({ benchmark::CreateRange(10, vec_size, 10) })->Setup(putln<__LINE__>);\
BENCHMARK(BM_test<T, source, PushEach>)->ArgsProduct({ benchmark::CreateRange(10, vec_size, 10) })->Setup(putvs<__LINE__>);
Result
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
BM_test<uint8_t, vec_u8, Remake>/10                       122 ns         94.9 ns      5600000
BM_test<uint8_t, vec_u8, Remake>/100                      317 ns          268 ns      1866667
BM_test<uint8_t, vec_u8, Remake>/1000                    2229 ns         2246 ns       320000
BM_test<uint8_t, vec_u8, Remake>/10000                  22517 ns        22496 ns        29867
BM_test<uint8_t, vec_u8, Remake>/100000                383407 ns       384976 ns         1867
BM_test<uint8_t, vec_u8, Remake>/1000000              3985601 ns      3928073 ns          179
BM_test<uint8_t, vec_u8, Remake>/10000000            40590724 ns     40441176 ns           17
↑vs↓
BM_test<uint8_t, vec_u8, PushEach>/10                     119 ns          119 ns      4977778
BM_test<uint8_t, vec_u8, PushEach>/100                    252 ns          246 ns      2986667
BM_test<uint8_t, vec_u8, PushEach>/1000                  1612 ns         1639 ns       448000
BM_test<uint8_t, vec_u8, PushEach>/10000                58932 ns        58594 ns        11200
BM_test<uint8_t, vec_u8, PushEach>/100000              743059 ns       739397 ns         1120
BM_test<uint8_t, vec_u8, PushEach>/1000000            7524824 ns      7465278 ns           90
BM_test<uint8_t, vec_u8, PushEach>/10000000          77603811 ns     78125000 ns            9
----------------------------------------------------------------------------------------------
BM_test<uint16_t, vec_u16, Remake>/10                     126 ns          126 ns     11200000
BM_test<uint16_t, vec_u16, Remake>/100                    321 ns          308 ns      2133333
BM_test<uint16_t, vec_u16, Remake>/1000                  2257 ns         2246 ns       320000
BM_test<uint16_t, vec_u16, Remake>/10000                25124 ns        25112 ns        28000
BM_test<uint16_t, vec_u16, Remake>/100000              408577 ns       404988 ns         1659
BM_test<uint16_t, vec_u16, Remake>/1000000            4268498 ns      4723837 ns          172
BM_test<uint16_t, vec_u16, Remake>/10000000          46360233 ns     46875000 ns           15
↑vs↓
BM_test<uint16_t, vec_u16, PushEach>/10                   124 ns          138 ns      7466667
BM_test<uint16_t, vec_u16, PushEach>/100                  252 ns          268 ns      2800000
BM_test<uint16_t, vec_u16, PushEach>/1000                1534 ns         1569 ns       497778
BM_test<uint16_t, vec_u16, PushEach>/10000              63547 ns        61384 ns        11200
BM_test<uint16_t, vec_u16, PushEach>/100000            758937 ns       767299 ns         1120
BM_test<uint16_t, vec_u16, PushEach>/1000000          7834772 ns      7986111 ns           90
BM_test<uint16_t, vec_u16, PushEach>/10000000        79212100 ns     78125000 ns            9
----------------------------------------------------------------------------------------------
BM_test<uint32_t, vec_u32, Remake>/10                     122 ns          119 ns      4977778
BM_test<uint32_t, vec_u32, Remake>/100                    209 ns          191 ns      4977778
BM_test<uint32_t, vec_u32, Remake>/1000                  1025 ns          903 ns       640000
BM_test<uint32_t, vec_u32, Remake>/10000                13632 ns        14230 ns        56000
BM_test<uint32_t, vec_u32, Remake>/100000              346827 ns       337672 ns         2036
BM_test<uint32_t, vec_u32, Remake>/1000000            3792709 ns      3676471 ns          204
BM_test<uint32_t, vec_u32, Remake>/10000000          45967253 ns     47916667 ns           15
↑vs↓
BM_test<uint32_t, vec_u32, PushEach>/10                   122 ns          127 ns      8960000
BM_test<uint32_t, vec_u32, PushEach>/100                  232 ns          241 ns      2986667
BM_test<uint32_t, vec_u32, PushEach>/1000                1454 ns         1500 ns       448000
BM_test<uint32_t, vec_u32, PushEach>/10000              65348 ns        65569 ns        11200
BM_test<uint32_t, vec_u32, PushEach>/100000            755426 ns       749860 ns          896
BM_test<uint32_t, vec_u32, PushEach>/1000000          7909446 ns      7638889 ns           90
BM_test<uint32_t, vec_u32, PushEach>/10000000        80133722 ns     79861111 ns            9
----------------------------------------------------------------------------------------------
BM_test<uint64_t, vec_u64, Remake>/10                     118 ns          128 ns      5600000
BM_test<uint64_t, vec_u64, Remake>/100                    200 ns          230 ns      4072727
BM_test<uint64_t, vec_u64, Remake>/1000                   948 ns          952 ns       640000
BM_test<uint64_t, vec_u64, Remake>/10000                15418 ns        15695 ns        44800
BM_test<uint64_t, vec_u64, Remake>/100000              352627 ns       344905 ns         1948
BM_test<uint64_t, vec_u64, Remake>/1000000            4441137 ns      4727564 ns          195
BM_test<uint64_t, vec_u64, Remake>/10000000          64993700 ns     63920455 ns           11
↑vs↓
BM_test<uint64_t, vec_u64, PushEach>/10                   121 ns          119 ns      4977778
BM_test<uint64_t, vec_u64, PushEach>/100                  248 ns          278 ns      2357895
BM_test<uint64_t, vec_u64, PushEach>/1000                1561 ns         1430 ns       448000
BM_test<uint64_t, vec_u64, PushEach>/10000              62403 ns        62500 ns        10000
BM_test<uint64_t, vec_u64, PushEach>/100000            769181 ns       732422 ns          896
BM_test<uint64_t, vec_u64, PushEach>/1000000          8240882 ns      7986111 ns           90
BM_test<uint64_t, vec_u64, PushEach>/10000000        84632929 ns     84821429 ns            7
----------------------------------------------------------------------------------------------
BM_test<float, vec_float, Remake>/10                      122 ns          126 ns      5600000
BM_test<float, vec_float, Remake>/100                     237 ns          213 ns      3733333
BM_test<float, vec_float, Remake>/1000                   1373 ns         1392 ns       640000
BM_test<float, vec_float, Remake>/10000                 16285 ns        15904 ns        37333
BM_test<float, vec_float, Remake>/100000               441348 ns       449219 ns         1600
BM_test<float, vec_float, Remake>/1000000             4818440 ns      5082831 ns          166
BM_test<float, vec_float, Remake>/10000000           55255373 ns     58238636 ns           11
↑vs↓
BM_test<float, vec_float, PushEach>/10                    127 ns          116 ns      4977778
BM_test<float, vec_float, PushEach>/100                   336 ns          377 ns      1866667
BM_test<float, vec_float, PushEach>/1000                 2434 ns         2490 ns       263529
BM_test<float, vec_float, PushEach>/10000               74388 ns        73242 ns         8960
BM_test<float, vec_float, PushEach>/100000             904533 ns       899431 ns          747
BM_test<float, vec_float, PushEach>/1000000           9409705 ns      9277344 ns           64
BM_test<float, vec_float, PushEach>/10000000         95706157 ns     95982143 ns            7
----------------------------------------------------------------------------------------------
BM_test<double, vec_double, Remake>/10                    122 ns          112 ns      4480000
BM_test<double, vec_double, Remake>/100                   237 ns          206 ns      4480000
BM_test<double, vec_double, Remake>/1000                 1367 ns         1395 ns       896000
BM_test<double, vec_double, Remake>/10000               18628 ns        19252 ns        37333
BM_test<double, vec_double, Remake>/100000             451185 ns       394417 ns         1545
BM_test<double, vec_double, Remake>/1000000           5406529 ns      5440848 ns          112
BM_test<double, vec_double, Remake>/10000000         72387822 ns     72916667 ns            9
↑vs↓
BM_test<double, vec_double, PushEach>/10                  129 ns          142 ns      4072727
BM_test<double, vec_double, PushEach>/100                 333 ns          293 ns      1866667
BM_test<double, vec_double, PushEach>/1000               2487 ns         2550 ns       263529
BM_test<double, vec_double, PushEach>/10000             84926 ns        83705 ns         8960
BM_test<double, vec_double, PushEach>/100000          1179283 ns      1199777 ns          560
BM_test<double, vec_double, PushEach>/1000000        13005580 ns     12152778 ns           45
BM_test<double, vec_double, PushEach>/10000000      130754380 ns    131250000 ns            5
----------------------------------------------------------------------------------------------
BM_test<string_view, vec_str, Remake>/10                  250 ns          186 ns      3446154
BM_test<string_view, vec_str, Remake>/100                 993 ns          942 ns       746667
BM_test<string_view, vec_str, Remake>/1000              12041 ns        12695 ns        64000
BM_test<string_view, vec_str, Remake>/10000            200345 ns       199507 ns         3446
BM_test<string_view, vec_str, Remake>/100000          2520021 ns      2582097 ns          236
BM_test<string_view, vec_str, Remake>/1000000        49534310 ns     50000000 ns           10
BM_test<string_view, vec_str, Remake>/10000000      556311900 ns    546875000 ns            1
↑vs↓
BM_test<string_view, vec_str, PushEach>/10                247 ns          225 ns      3200000
BM_test<string_view, vec_str, PushEach>/100              1466 ns         1569 ns       497778
BM_test<string_view, vec_str, PushEach>/1000            18451 ns        17997 ns        37333
BM_test<string_view, vec_str, PushEach>/10000          217737 ns       229492 ns         3200
BM_test<string_view, vec_str, PushEach>/100000        2361740 ns      2249053 ns          264
BM_test<string_view, vec_str, PushEach>/1000000      25398857 ns     26785714 ns           28
BM_test<string_view, vec_str, PushEach>/10000000    257395500 ns    260416667 ns            3
----------------------------------------------------------------------------------------------
BM_test<string, vec_str, Remake>/10                       307 ns          308 ns      2488889
BM_test<string, vec_str, Remake>/100                     1641 ns         1726 ns       407273
BM_test<string, vec_str, Remake>/1000                   18749 ns        17090 ns        32000
BM_test<string, vec_str, Remake>/10000                 227688 ns       244141 ns         3200
BM_test<string, vec_str, Remake>/100000               2657768 ns      2508361 ns          299
BM_test<string, vec_str, Remake>/1000000             42054665 ns     43198529 ns           17
BM_test<string, vec_str, Remake>/10000000           515694300 ns    515625000 ns            1
↑vs↓
BM_test<string, vec_str, PushEach>/10                     304 ns          320 ns      1659259
BM_test<string, vec_str, PushEach>/100                   2412 ns         2197 ns       298667
BM_test<string, vec_str, PushEach>/1000                 26279 ns        26367 ns        24889
BM_test<string, vec_str, PushEach>/10000               276699 ns       278308 ns         2358
BM_test<string, vec_str, PushEach>/100000             3050523 ns      3111758 ns          236
BM_test<string, vec_str, PushEach>/1000000           34666641 ns     31250000 ns           22
BM_test<string, vec_str, PushEach>/10000000         352408200 ns    351562500 ns            2
----------------------------------------------------------------------------------------------
BM_test<wstring_view, vec_wstr, Remake>/10                228 ns          194 ns      2986667
BM_test<wstring_view, vec_wstr, Remake>/100               754 ns          732 ns       896000
BM_test<wstring_view, vec_wstr, Remake>/1000             9207 ns         8906 ns       100000
BM_test<wstring_view, vec_wstr, Remake>/10000          199420 ns       195312 ns         3200
BM_test<wstring_view, vec_wstr, Remake>/100000        3092188 ns      3263052 ns          249
BM_test<wstring_view, vec_wstr, Remake>/1000000      75516755 ns     75284091 ns           11
BM_test<wstring_view, vec_wstr, Remake>/10000000    816481900 ns    828125000 ns            1
↑vs↓
BM_test<wstring_view, vec_wstr, PushEach>/10              219 ns          222 ns      3733333
BM_test<wstring_view, vec_wstr, PushEach>/100             857 ns          879 ns       640000
BM_test<wstring_view, vec_wstr, PushEach>/1000           8134 ns         7847 ns        89600
BM_test<wstring_view, vec_wstr, PushEach>/10000        215813 ns       214844 ns         3200
BM_test<wstring_view, vec_wstr, PushEach>/100000      3046046 ns      3074799 ns          249
BM_test<wstring_view, vec_wstr, PushEach>/1000000    45603000 ns     46875000 ns           15
BM_test<wstring_view, vec_wstr, PushEach>/10000000  470674400 ns    468750000 ns            2
----------------------------------------------------------------------------------------------
BM_test<wstring, vec_wstr, Remake>/10                     453 ns          516 ns      1605632
BM_test<wstring, vec_wstr, Remake>/100                   2938 ns         3439 ns       263529
BM_test<wstring, vec_wstr, Remake>/1000                 33090 ns        32087 ns        22400
BM_test<wstring, vec_wstr, Remake>/10000               452019 ns       460379 ns         1120
BM_test<wstring, vec_wstr, Remake>/100000             6785253 ns      6417411 ns          112
BM_test<wstring, vec_wstr, Remake>/1000000          172489025 ns    171875000 ns            4
BM_test<wstring, vec_wstr, Remake>/10000000        1825856200 ns   1828125000 ns            1
↑vs↓
BM_test<wstring, vec_wstr, PushEach>/10                   592 ns          565 ns      1493333
BM_test<wstring, vec_wstr, PushEach>/100                 3264 ns         5156 ns       100000
BM_test<wstring, vec_wstr, PushEach>/1000               33063 ns        30762 ns        24889
BM_test<wstring, vec_wstr, PushEach>/10000             472695 ns       470948 ns         1493
BM_test<wstring, vec_wstr, PushEach>/100000           6972019 ns      6423611 ns           90
BM_test<wstring, vec_wstr, PushEach>/1000000        131726560 ns    125000000 ns            5
BM_test<wstring, vec_wstr, PushEach>/10000000      1377621400 ns   1390625000 ns            1
----------------------------------------------------------------------------------------------
BM_test<keypair, vec_keypair, Remake>/10                  214 ns          195 ns      3200000
BM_test<keypair, vec_keypair, Remake>/100                 639 ns          584 ns      1338027
BM_test<keypair, vec_keypair, Remake>/1000               5284 ns         5441 ns       112000
BM_test<keypair, vec_keypair, Remake>/10000             82015 ns        85449 ns         8960
BM_test<keypair, vec_keypair, Remake>/100000          1045168 ns       890625 ns         1000
BM_test<keypair, vec_keypair, Remake>/1000000         8308346 ns      8007812 ns           80
BM_test<keypair, vec_keypair, Remake>/10000000      114482767 ns    114583333 ns            6
↑vs↓
BM_test<keypair, vec_keypair, PushEach>/10                135 ns          122 ns      4480000
BM_test<keypair, vec_keypair, PushEach>/100               435 ns          481 ns      1723077
BM_test<keypair, vec_keypair, PushEach>/1000             8159 ns         7952 ns        74667
BM_test<keypair, vec_keypair, PushEach>/10000          104220 ns       104980 ns         6400
BM_test<keypair, vec_keypair, PushEach>/100000        1168347 ns      1223645 ns          498
BM_test<keypair, vec_keypair, PushEach>/1000000      11983237 ns     11230469 ns           64
BM_test<keypair, vec_keypair, PushEach>/10000000    121805917 ns    119791667 ns            6
----------------------------------------------------------------------------------------------
BM_test<keyobj, vec_keyobj, Remake>/10                    122 ns          117 ns      5600000
BM_test<keyobj, vec_keyobj, Remake>/100                   214 ns          218 ns      3446154
BM_test<keyobj, vec_keyobj, Remake>/1000                 1182 ns         1172 ns       640000
BM_test<keyobj, vec_keyobj, Remake>/10000               14112 ns        14439 ns        49778
BM_test<keyobj, vec_keyobj, Remake>/100000             605147 ns       593750 ns         1000
BM_test<keyobj, vec_keyobj, Remake>/1000000          10233219 ns     10881696 ns          112
BM_test<keyobj, vec_keyobj, Remake>/10000000        128376640 ns    128125000 ns            5
↑vs↓
BM_test<keyobj, vec_keyobj, PushEach>/10                  122 ns          123 ns      4072727
BM_test<keyobj, vec_keyobj, PushEach>/100                 258 ns          264 ns      2488889
BM_test<keyobj, vec_keyobj, PushEach>/1000               1692 ns         1695 ns       497778
BM_test<keyobj, vec_keyobj, PushEach>/10000             63880 ns        64523 ns         8960
BM_test<keyobj, vec_keyobj, PushEach>/100000           915264 ns       864955 ns         1120
BM_test<keyobj, vec_keyobj, PushEach>/1000000         9778583 ns     10009766 ns           64
BM_test<keyobj, vec_keyobj, PushEach>/10000000       97173957 ns     98214286 ns            7

It compares remaking the whole heap with make_heap or push_heap loop. The result is relatively more complex, but it shows that make_heap is not always faster than push_heap loop.

->Setup(putln<__LINE__>) \
->RangeMultiplier(100) \
->Range(1, vec_size) \
->Arg(vec_size / 2 + 1);
Copy link
Contributor Author

@achabense achabense Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For old version, vec_size/2+1 will make the function firstly make_heap the first half of data, then make_heap again after pushing the rest half of data:
image
For new version, vec_size/2+1 will make the function firstly make_heap the first half of data, then push_heap the rest half of data:
image

stl/inc/xutility Outdated
@@ -541,7 +541,8 @@ struct less_equal<void> {
template <class _Fx>
struct _Ref_fn { // pass function object by value as a reference
template <class... _Args>
constexpr decltype(auto) operator()(_Args&&... _Vals) { // forward function call operator
constexpr decltype(auto) operator()(_Args&&... _Vals) noexcept(
is_nothrow_invocable_v<_Fx&, _Args&&...>) { // forward function call operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use _Select_invoke_traits<_Fx&, _Args...>::_Is_nothrow_invocable::value (&& can be omitted as it is added by default).

inline constexpr bool is_nothrow_invocable_v = _Select_invoke_traits<_Callable, _Args...>::_Is_nothrow_invocable::value;

Comment on lines -544 to +545
constexpr decltype(auto) operator()(_Args&&... _Vals) { // forward function call operator
constexpr decltype(auto) operator()(_Args&&... _Vals) noexcept(
_Select_invoke_traits<_Fx&, _Args...>::_Is_nothrow_invocable::value) { // forward function call operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change required - I don't have data to back up the claim - but there's a lot of expensive SFINAE in _Select_invoke_traits that _Ref_fn doesn't need. We may be better off with:

template <class _Fx, bool = is_member_pointer_v<_Fx>>
struct _Ref_fn { // pass function object by value as a reference
    template <class... _Args>
    constexpr decltype(auto) operator()(_Args&&... _Vals) noexcept(
        noexcept(_STD invoke(_Fn, _STD forward<_Args>(_Vals)...))) {
        return _STD invoke(_Fn, _STD forward<_Args>(_Vals)...);
    }

    _Fx& _Fn;
};

template <class _Fx>
struct _Ref_fn<_Fx, false> {
    template <class... _Args>
    constexpr decltype(auto) operator()(_Args&&... _Vals) noexcept(noexcept(_Fn(_STD forward<_Args>(_Vals)...))) {
        return _Fn(_STD forward<_Args>(_Vals)...);
    }

    _Fx& _Fn;
};

stl/inc/queue Outdated Show resolved Hide resolved
benchmarks/src/priority_queue_push_range.cpp Show resolved Hide resolved
benchmarks/src/priority_queue_push_range.cpp Show resolved Hide resolved
benchmarks/src/priority_queue_push_range.cpp Show resolved Hide resolved
benchmarks/src/priority_queue_push_range.cpp Outdated Show resolved Hide resolved
benchmarks/src/priority_queue_push_range.cpp Outdated Show resolved Hide resolved
benchmarks/src/priority_queue_push_range.cpp Outdated Show resolved Hide resolved
benchmarks/src/priority_queue_push_range.cpp Outdated Show resolved Hide resolved
benchmarks/src/priority_queue_push_range.cpp Outdated Show resolved Hide resolved
benchmarks/src/priority_queue_push_range.cpp Outdated Show resolved Hide resolved
@StephanTLavavej StephanTLavavej removed their assignment Sep 20, 2023
@StephanTLavavej
Copy link
Member

I believe that this change is sufficiently safe, and the new benchmark provides justification for making the change, that we can merge this without a second maintainer approval.

@StephanTLavavej StephanTLavavej self-assigned this Sep 20, 2023
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 61cc2a5 into microsoft:main Sep 21, 2023
@StephanTLavavej
Copy link
Member

Thanks for investigating and improving this performance with an added benchmark! 🎉 🚀 ⏱️

@achabense achabense deleted the _Priority_queue branch September 22, 2023 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

<queue>: priority_queue::push_range could conditionally push_heap
4 participants