Set gloo process group for FSDP with CPU offload #2108

ebsmothers · 2024-12-03T22:04:45Z

Addresses #1977.

As discussed in the issue (see this comment), FSDP's implementation of gradient clipping uses _NormPartial, which requires comms primitives (specifically all_reduce). This means that when running with CPU offloading we need to initialize the gloo process group to calculate the grad norm for DTensors on CPU. For simplicity this PR enables it whenever CPU offloading is enabled regardless of gradient clipping.

Test plan:

Added a test case for gradient clipping + CPU offload to test_full_finetune_distributed.py.

pytest -m integration_test tests/recipes/test_full_finetune_distributed.py -k 'test_loss'
...
========= 3 passed, 1 deselected in 43.54s ========

pytorch-bot · 2024-12-03T22:04:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2108

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e4f00c4 with merge base 32e265d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1 · 2024-12-03T22:18:10Z

recipes/full_finetune_distributed.py

    if cfg.get("fsdp_cpu_offload", False):
        # Utilize all available CPU cores for intra-op parallelism. This provides ~2x
        # speed up when benchmarking fused AdamW on CPU
        training.set_torch_num_threads()
+        process_group = "cuda:nccl,cpu:gloo"


n00b question: why not just make this the default every time, instead of "gloo" if cfg.device == "cpu" else "nccl"

I actually wasn't sure myself why we do this. Well now I know. So yes, I think we can take your suggestion here

felipemello1

this seems to be low risk, since it was tested in the dcp PR already. Approving to unblock.

ebsmothers added 2 commits December 3, 2024 13:56

Set gloo process group for FSDP with CPU offload

1ed3bf8

cleanup

7764a6d

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 3, 2024

ebsmothers mentioned this pull request Dec 3, 2024

Gradient clipping doesn't work with FSDP CPU offloading #1977

Closed

felipemello1 reviewed Dec 3, 2024

View reviewed changes

ebsmothers mentioned this pull request Dec 4, 2024

Support Early Exit Loss and/or Layer Dropout #1076

Merged

8 tasks

Remove special handling for CPU recipe tests

1b992aa

felipemello1 approved these changes Dec 4, 2024

View reviewed changes

add test case

e4f00c4

ebsmothers merged commit 5eb04cd into pytorch:main Dec 4, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set gloo process group for FSDP with CPU offload #2108

Set gloo process group for FSDP with CPU offload #2108

ebsmothers commented Dec 3, 2024 •

edited

Loading

pytorch-bot bot commented Dec 3, 2024 •

edited

Loading

felipemello1 Dec 3, 2024

felipemello1 Dec 3, 2024

ebsmothers Dec 4, 2024

felipemello1 left a comment •

edited

Loading

Set gloo process group for FSDP with CPU offload #2108

Set gloo process group for FSDP with CPU offload #2108

Conversation

ebsmothers commented Dec 3, 2024 • edited Loading

pytorch-bot bot commented Dec 3, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2108

✅ No Failures

felipemello1 Dec 3, 2024

Choose a reason for hiding this comment

felipemello1 Dec 3, 2024

Choose a reason for hiding this comment

ebsmothers Dec 4, 2024

Choose a reason for hiding this comment

felipemello1 left a comment • edited Loading

Choose a reason for hiding this comment

ebsmothers commented Dec 3, 2024 •

edited

Loading

pytorch-bot bot commented Dec 3, 2024 •

edited

Loading

felipemello1 left a comment •

edited

Loading