Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement more efficient pack and unpack uint5 #1138

Merged
merged 1 commit into from
Oct 22, 2024

Conversation

xuzijian629
Copy link
Contributor

Summary:
Implemented more efficient 5-bit packing and unpacking algorithm, for 8/64/128 values.
The algorithm is commented in code, but you may also refer to T204077841 for discussion.

Before

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
benchmark_pack_uint_values<1>/128/8           17.3 ns         17.2 ns     40530134
...
benchmark_pack_uint_values<5>/128/8           36.8 ns         36.5 ns     18974458
benchmark_pack_uint_values<5>/128/64          5.47 ns         5.43 ns    128341462
benchmark_pack_uint_values<5>/128/128         2.91 ns         2.70 ns    261633340
benchmark_unpack_uint_values<5>/128/8         28.8 ns         28.6 ns     24475696
benchmark_unpack_uint_values<5>/128/64        6.14 ns         5.65 ns    124953143
benchmark_unpack_uint_values<5>/128/128       2.90 ns         2.88 ns    242818639

After

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
benchmark_pack_uint_values<1>/128/8           17.9 ns         17.5 ns     40221794
...
benchmark_pack_uint_values<5>/128/8           24.9 ns         24.8 ns     28330676
benchmark_pack_uint_values<5>/128/64          2.63 ns         2.61 ns    267856460
benchmark_pack_uint_values<5>/128/128         2.04 ns         2.03 ns    344166380
benchmark_unpack_uint_values<5>/128/8         22.1 ns         22.0 ns     31850032
benchmark_unpack_uint_values<5>/128/64        2.92 ns         2.89 ns    242508228
benchmark_unpack_uint_values<5>/128/128       2.33 ns         2.25 ns    310688575

Reviewed By: metascroy

Differential Revision: D64703548

Summary:
Implemented more efficient 5-bit packing and unpacking algorithm, for 8/64/128 values.
The algorithm is commented in code, but you may also refer to T204077841 for discussion.

## Before
```
----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
benchmark_pack_uint_values<1>/128/8           17.3 ns         17.2 ns     40530134
...
benchmark_pack_uint_values<5>/128/8           36.8 ns         36.5 ns     18974458
benchmark_pack_uint_values<5>/128/64          5.47 ns         5.43 ns    128341462
benchmark_pack_uint_values<5>/128/128         2.91 ns         2.70 ns    261633340
benchmark_unpack_uint_values<5>/128/8         28.8 ns         28.6 ns     24475696
benchmark_unpack_uint_values<5>/128/64        6.14 ns         5.65 ns    124953143
benchmark_unpack_uint_values<5>/128/128       2.90 ns         2.88 ns    242818639
```

## After
```
----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
benchmark_pack_uint_values<1>/128/8           17.9 ns         17.5 ns     40221794
...
benchmark_pack_uint_values<5>/128/8           24.9 ns         24.8 ns     28330676
benchmark_pack_uint_values<5>/128/64          2.63 ns         2.61 ns    267856460
benchmark_pack_uint_values<5>/128/128         2.04 ns         2.03 ns    344166380
benchmark_unpack_uint_values<5>/128/8         22.1 ns         22.0 ns     31850032
benchmark_unpack_uint_values<5>/128/64        2.92 ns         2.89 ns    242508228
benchmark_unpack_uint_values<5>/128/128       2.33 ns         2.25 ns    310688575
```

Reviewed By: metascroy

Differential Revision: D64703548
Copy link

pytorch-bot bot commented Oct 22, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1138

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4071cc4 with merge base d84191c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 22, 2024
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D64703548

@facebook-github-bot facebook-github-bot merged commit f1b4c8e into pytorch:main Oct 22, 2024
19 checks passed
@xuzijian629 xuzijian629 deleted the export-D64703548 branch October 22, 2024 18:30
yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants