Implement more efficient pack and unpack uint5 #1138

xuzijian629 · 2024-10-22T16:09:28Z

Summary:
Implemented more efficient 5-bit packing and unpacking algorithm, for 8/64/128 values.
The algorithm is commented in code, but you may also refer to T204077841 for discussion.

Before

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
benchmark_pack_uint_values<1>/128/8           17.3 ns         17.2 ns     40530134
...
benchmark_pack_uint_values<5>/128/8           36.8 ns         36.5 ns     18974458
benchmark_pack_uint_values<5>/128/64          5.47 ns         5.43 ns    128341462
benchmark_pack_uint_values<5>/128/128         2.91 ns         2.70 ns    261633340
benchmark_unpack_uint_values<5>/128/8         28.8 ns         28.6 ns     24475696
benchmark_unpack_uint_values<5>/128/64        6.14 ns         5.65 ns    124953143
benchmark_unpack_uint_values<5>/128/128       2.90 ns         2.88 ns    242818639

After

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
benchmark_pack_uint_values<1>/128/8           17.9 ns         17.5 ns     40221794
...
benchmark_pack_uint_values<5>/128/8           24.9 ns         24.8 ns     28330676
benchmark_pack_uint_values<5>/128/64          2.63 ns         2.61 ns    267856460
benchmark_pack_uint_values<5>/128/128         2.04 ns         2.03 ns    344166380
benchmark_unpack_uint_values<5>/128/8         22.1 ns         22.0 ns     31850032
benchmark_unpack_uint_values<5>/128/64        2.92 ns         2.89 ns    242508228
benchmark_unpack_uint_values<5>/128/128       2.33 ns         2.25 ns    310688575

Reviewed By: metascroy

Differential Revision: D64703548

Summary: Implemented more efficient 5-bit packing and unpacking algorithm, for 8/64/128 values. The algorithm is commented in code, but you may also refer to T204077841 for discussion. ## Before ``` ---------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------- benchmark_pack_uint_values<1>/128/8 17.3 ns 17.2 ns 40530134 ... benchmark_pack_uint_values<5>/128/8 36.8 ns 36.5 ns 18974458 benchmark_pack_uint_values<5>/128/64 5.47 ns 5.43 ns 128341462 benchmark_pack_uint_values<5>/128/128 2.91 ns 2.70 ns 261633340 benchmark_unpack_uint_values<5>/128/8 28.8 ns 28.6 ns 24475696 benchmark_unpack_uint_values<5>/128/64 6.14 ns 5.65 ns 124953143 benchmark_unpack_uint_values<5>/128/128 2.90 ns 2.88 ns 242818639 ``` ## After ``` ---------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------- benchmark_pack_uint_values<1>/128/8 17.9 ns 17.5 ns 40221794 ... benchmark_pack_uint_values<5>/128/8 24.9 ns 24.8 ns 28330676 benchmark_pack_uint_values<5>/128/64 2.63 ns 2.61 ns 267856460 benchmark_pack_uint_values<5>/128/128 2.04 ns 2.03 ns 344166380 benchmark_unpack_uint_values<5>/128/8 22.1 ns 22.0 ns 31850032 benchmark_unpack_uint_values<5>/128/64 2.92 ns 2.89 ns 242508228 benchmark_unpack_uint_values<5>/128/128 2.33 ns 2.25 ns 310688575 ``` Reviewed By: metascroy Differential Revision: D64703548

pytorch-bot · 2024-10-22T16:09:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1138

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4071cc4 with merge base d84191c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-10-22T16:09:35Z

This pull request was exported from Phabricator. Differential Revision: D64703548

Update AAR to 0919

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 22, 2024

facebook-github-bot added the fb-exported label Oct 22, 2024

metascroy self-requested a review October 22, 2024 16:31

metascroy approved these changes Oct 22, 2024

View reviewed changes

facebook-github-bot merged commit f1b4c8e into pytorch:main Oct 22, 2024
19 checks passed

xuzijian629 deleted the export-D64703548 branch October 22, 2024 18:30

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

Android artifact update (pytorch#1138)

04ea309

Update AAR to 0919

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement more efficient pack and unpack uint5 #1138

Implement more efficient pack and unpack uint5 #1138

xuzijian629 commented Oct 22, 2024

pytorch-bot bot commented Oct 22, 2024 •

edited

Loading

facebook-github-bot commented Oct 22, 2024

Implement more efficient pack and unpack uint5 #1138

Implement more efficient pack and unpack uint5 #1138

Conversation

xuzijian629 commented Oct 22, 2024

Before

After

pytorch-bot bot commented Oct 22, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1138

✅ No Failures

facebook-github-bot commented Oct 22, 2024

pytorch-bot bot commented Oct 22, 2024 •

edited

Loading