Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix out-of-bounds memory access in Galore dequant kernel #1125

Merged
merged 1 commit into from
Oct 21, 2024

Conversation

gau-nernst
Copy link
Collaborator

This should fix the occasional illegal memory address (IMA) errors we observe. e.g. https://github.com/pytorch/ao/actions/runs/11437686813/job/31817703726

  • For some reasons torch2.5 in CI consistently produce this error (see Add PyTorch 2.5 to regression test #1100) (instead of randomly like other CI jobs)
  • Running the galore tests alone will not reproduce this error due to CUDA caching allocator - the out-of-bounds addresses might still be valid address
  • To reproduce IMA error locally (without running the whole test suite), I allocate random tensors at the start of the test (e.g. tensor = [torch.randn(torch.randint(100, 2000, ()).item(), torch.randint(100, 2000, ()).item(), device="cuda") for _ in range(100)])
  • tl.load(qmap_ptr + q_idx.to(tl.int32)) will never encounter out-of-bounds since 0 <= q_idx < 256 (due to uint8) and qmap always has size 256 (if it is used correctly)

Initially I made this change in #1100, but since the IMA resurface in another CI job, I decided to open a separate PR to merge this fix before all issues in #1100 can be addressed.

Copy link

pytorch-bot bot commented Oct 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1125

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8d89304 with merge base a2faafe (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 21, 2024
@gau-nernst gau-nernst requested a review from msaroufim October 21, 2024 13:15
@msaroufim msaroufim merged commit f33cff7 into pytorch:main Oct 21, 2024
17 checks passed
@gau-nernst gau-nernst deleted the galore_ima branch October 21, 2024 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants