Fix gradient scaling to account for world_size normalization #2172

mirceamironenco · 2024-12-18T14:53:35Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

FSDP/FSDP2/DDP will normalize the gradients by world_size when performing all_reduce. For a sequence processing task where the desired loss is scaled by the total number of non-padded & non-ignored tokens this requires this normalization be undone. For example if world_size = 2, and we have 2 sets A, B of gradient producing tokens, the total loss we desire is loss(A) + loss(B) / (|A| + |B|) where the normalization factor 1 / (|A| + |B|) is currently being handled by scale_grads:

torchtune/recipes/full_finetune_distributed.py

Line 780 in 3518492

training.scale_grads(self._model, 1 / num_tokens)

If A, B are processed on separate data parallel workers the current gradients would be produced by loss(A) / 2 + loss(B) / 2, and with the normalization done as before our loss becomes (loss(A) + loss(B)) / (2 * (|A| + |B|)). This PR accounts for world_size cancelling out the scaling factor.

I haven't seen very large differences wrt loss curves in my preliminary experiments after this change:

Where world_size means the gradient scaling factor is world_size / num_tokens and otherwise 1 / num_tokens. The commands to replicate these plots being:

tune run --nproc_per_node 2 full_finetune_distributed --config llama3_2/3B_full metric_logger=torchtune.training.metric_logging.WandBLogger metric_logger.project=llama3.23b_fix metric_logger.name=world_size dataset.packed=True tokenizer.max_seq_len=512 compile=True

tune run --nproc_per_node 2 full_finetune_distributed.py --config configs/llama3_2/3B_full metric_logger=torchtune.training.metric_logging.WandBLogger metric_logger.project=llama3.23b_fix_noprompt metric_logger.name=world_size dataset.packed=True dataset.train_on_input=False tokenizer.max_seq_len=512 compile=True

Someone with more compute budget can probably get a better idea of the effect for larger models.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2024-12-18T14:53:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2172

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a6dc03a with merge base 27fd3a1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers · 2024-12-20T21:55:35Z

Thanks @mirceamironenco for finding this bug and for making the fix! Apologies for the delay in getting back to it, I wanted to put together a minimal repro to validate this myself (I trust the code pointers, but I like seeing numerical parity). So I put together the following script(s) to convince myself. Can confirm that on identical toy models with identical data we see (grad on N devices) == (grad on single device) / N. Let me run some more experiments to see to what extent this will affect loss curves on larger world sizes. If there is an impact, we should give people an fyi before landing. Will get back to you soon once I run the experiments!

EugenHotaj · 2025-01-07T21:01:19Z

@ebsmothers any updates on this? We've also seen this in our mulit-node runs -- our grad norms are significantly smaller than what we see from other frameworks (e.g. NeMo).

ebsmothers · 2025-01-07T21:06:59Z

Hey @EugenHotaj thanks for the bump -- yes, we plan to land this soon. Actually the main reason for being slow on this PR (besides the holidays and PSC) is that we wanna be careful about breaking people who have e.g. their LR tuned to this setting. Ultimately I think we need to just rip the bandaid off and make the fix, then put comms here and in our Discord. Let me try to review and land later today

ebsmothers · 2025-01-07T23:20:31Z

Thanks for your patience @mirceamironenco. Just ran some quick experiments on my end on a single node with 8 GPUs. Attaching some plots below, WandB project is here. There are three runs: one on main, one on this PR, and one on this PR with learning rate scaled by 1/8.

Unsurprisingly, it's similar to what @EugenHotaj mentioned -- the grad norm is off, almost exactly by a factor of 8. At least for my case the loss curves are pretty much identical too, not sure if there's a noticeable difference on multinode.

EugenHotaj · 2025-01-07T23:43:03Z

At least for my case the loss curves are pretty much identical too

@ebsmothers I've noticed this as well on my runs and found it a bit surprising. Is it expected that the losses would be identical? The gradients point in the same direction but I would have thought we'd see some divergence after taking a few hundred gradient steps. I guess gradient clipping / LR accounts for a lot of this?

codecov-commenter · 2025-01-08T00:02:30Z

Codecov Report

Attention: Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.

Project coverage is 23.95%. Comparing base (213f386) to head (a6dc03a).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
recipes/full_finetune_distributed.py	0.00%	2 Missing ⚠️
recipes/qat_distributed.py	0.00%	2 Missing ⚠️
recipes/knowledge_distillation_distributed.py	0.00%	1 Missing ⚠️
recipes/lora_finetune_distributed.py	0.00%	1 Missing ⚠️
recipes/lora_finetune_distributed_multi_dataset.py	0.00%	1 Missing ⚠️
recipes/qat_lora_finetune_distributed.py	0.00%	1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (213f386) and HEAD (a6dc03a). Click for more details.

HEAD has 6 uploads less than BASE

Flag BASE (213f386) HEAD (a6dc03a)

9 3

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2172       +/-   ##
===========================================
- Coverage   65.41%   23.95%   -41.47%     
===========================================
  Files         344      352        +8     
  Lines       20658    20847      +189     
===========================================
- Hits        13514     4993     -8521     
- Misses       7144    15854     +8710

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ebsmothers · 2025-01-08T00:58:19Z

Is it expected that the losses would be identical? The gradients point in the same direction but I would have thought we'd see some divergence after taking a few hundred gradient steps. I guess gradient clipping / LR accounts for a lot of this?

@EugenHotaj yeah this gave me a bit of a scare, especially considering that we don't enable gradient clipping by default. Actually I believe that the behavior has to do with the optimizer: I used SGD instead of AdamW and manually hacked in a really high value for the grad scaler just to make sure nothing was broken. In that case the difference is very noticeable (see below). I didn't think that momentum would result in consistent loss curves when scaling grads up and down, but maybe I just need to refresh my memory on Adam a bit.

EugenHotaj · 2025-01-08T01:51:52Z

I didn't think that momentum would result in consistent loss curves when scaling grads up and down, but maybe I just need to refresh my memory on Adam a bit.

@ebsmothers any chance we also need to do the same to adam momentum params when using FSDP? Pretty surprising to me as well that Adam would lead to identical learning curves

mirceamironenco · 2025-01-08T08:46:18Z

@EugenHotaj yeah this gave me a bit of a scare, especially considering that we don't enable gradient clipping by default. Actually I believe that the behavior has to do with the optimizer: I used SGD instead of AdamW and manually hacked in a really high value for the grad scaler just to make sure nothing was broken. In that case the difference is very noticeable (see below). I didn't think that momentum would result in consistent loss curves when scaling grads up and down, but maybe I just need to refresh my memory on Adam a bit.

Just to make sure I understand, if you only hack the grad scaler to be very large but keep AdamW, the loss curves are still basically identical? (IIUC you did both in this comparison?)

Maybe the loss curves being very similar is not so strange since the denominator will have a very large number of tokens compared to world_size, but some other ideas to battle test this more (I can implement these in a separate branch just for a comparison if you want):

Hardcoding the reduce_op when wrapping with fully_shard as mentioned in an earlier version of the PR:

# Must be done for each sharded module.
module = fully_shard(
    module,
    mesh=mesh,
    reshard_after_forward=reshard_after_forward,
    shard_placement_fn=shard_placement_fn,
    mp_policy=mp_policy,
    offload_policy=offload_policy,
)
# Change the reduce op manually
fsdp_param_group = fully_shard.state(module)._fsdp_param_group
fsdp_param_group.reduce_scatter_reduce_op = ReduceOp.SUM

this happens before the optimizer is initialized, in case anything is happening there.

Implementing a DDP variant (you could also try it with reshard_after_fwd=False, but this would only be equivalent to ZeRO-2).
Same experiment you did but with very high/very low learning rate.

Potentially getting some feedback from the FSDP2 authors just as a sanity check could be useful.

ebsmothers · 2025-01-08T16:45:03Z

Just to make sure I understand, if you only hack the grad scaler to be very large but keep AdamW, the loss curves are still basically identical? (IIUC you did both in this comparison?)

@mirceamironenco Yeah this is correct. Re your suggestions, (3) was the first one that came to my mind (also conveniently the easiest 😃) so I gave that a try on our distributed LoRA recipe. The below plot is the result of running AdamW with a much higher LR of 0.01, you can see that the two loss curves diverge (also unsurprisingly the loss blows up):

But the point is that scaling the gradients can result in different loss curves with AdamW, it just doesn't really show up under our baseline configs (which I suppose is a good thing with respect to the impact of this whole world-size-scaling bug).

Can also tag in our resident optimizer expert @janeyx99 in case she has any thoughts. TLDR for Jane is that we manually scale grads using this utility just before optimizer step, but surprisingly even scaling by a pretty large amount doesn't really mess with our loss curves when using AdamW (while for SGD there is a noticeable impact).

janeyx99 · 2025-01-08T22:04:11Z

Not sure how helpful this is, but yes, I'm not surprised the gradient changes didn't affect AdamW as much as it did SGD. The SGD update is very gradient-dependent:

whereas the Adam(W) update is scaled by momentum over rt(variance), which is like scaling by g/rt(g^2) with all minutia stripped away:

ebsmothers · 2025-01-09T18:38:25Z

Thanks @janeyx99! This is very helpful. Also I clearly should've just dug up the Adam paper. Direct quote:

Assuming $\epsilon= 0$, the effective step taken in parameter space at timestep $t$ is $$\Delta_t = \alpha * \hat{m}_t / \sqrt{\hat{v}_t}$$.
...
The effective stepsize $$\Delta_t$$ is also invariant to the scale of the gradients; rescaling the gradients with factor $c$ will scale $$\hat{m}_t$$ with a factor $c$ and $$\hat{v}_t$$ with a factor $c^2$, which will cancel out: $$(c \cdot \hat{m}_t) / (\sqrt{c^2 \cdot \hat{v}_t}) = \hat{m}_t / \sqrt{\hat{v}_t}$$.

(Please excuse my sloppy LaTeX, I swear I was good at this once..) Also thanks @mirceamironenco for mentioning this over chat and forcing me to dig it up. So actually I think we are good here -- in fact I'm no longer even worried about breaking BC with this change after actually having done my homework.

ebsmothers

Thanks for finding and fixing the bug @mirceamironenco! And thanks for your patience while we sorted out the whole Adam grad scaling thing in review. Based on our discussion, I think this is good to go.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 18, 2024

mirceamironenco added 3 commits January 7, 2025 23:56

world_size normalization fix

fe2a174

Undo formatting changes

fdeaf58

Add comments, account for optimizer_in_bwd case.

a6dc03a

mirceamironenco force-pushed the fix-world-size-normalization branch from 34906b2 to a6dc03a Compare January 7, 2025 21:56

ebsmothers self-requested a review January 7, 2025 23:25

ebsmothers approved these changes Jan 9, 2025

View reviewed changes

ebsmothers merged commit e420bc0 into pytorch:main Jan 9, 2025
17 checks passed

EugenHotaj mentioned this pull request Jan 9, 2025

Grad Norm Differences Across Nodes #2240

Open

mirceamironenco deleted the fix-world-size-normalization branch January 10, 2025 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gradient scaling to account for world_size normalization #2172

Fix gradient scaling to account for world_size normalization #2172

mirceamironenco commented Dec 18, 2024 •

edited

Loading

pytorch-bot bot commented Dec 18, 2024 •

edited

Loading

ebsmothers commented Dec 20, 2024

EugenHotaj commented Jan 7, 2025

ebsmothers commented Jan 7, 2025

ebsmothers commented Jan 7, 2025

EugenHotaj commented Jan 7, 2025

codecov-commenter commented Jan 8, 2025 •

edited

Loading

ebsmothers commented Jan 8, 2025

EugenHotaj commented Jan 8, 2025

mirceamironenco commented Jan 8, 2025 •

edited

Loading

ebsmothers commented Jan 8, 2025

janeyx99 commented Jan 8, 2025

ebsmothers commented Jan 9, 2025 •

edited

Loading

ebsmothers left a comment

Fix gradient scaling to account for world_size normalization #2172

Fix gradient scaling to account for world_size normalization #2172

Conversation

mirceamironenco commented Dec 18, 2024 • edited Loading

Context

Changelog

Test plan

UX

pytorch-bot bot commented Dec 18, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2172

✅ No Failures

ebsmothers commented Dec 20, 2024

EugenHotaj commented Jan 7, 2025

ebsmothers commented Jan 7, 2025

ebsmothers commented Jan 7, 2025

EugenHotaj commented Jan 7, 2025

codecov-commenter commented Jan 8, 2025 • edited Loading

Codecov Report

ebsmothers commented Jan 8, 2025

EugenHotaj commented Jan 8, 2025

mirceamironenco commented Jan 8, 2025 • edited Loading

ebsmothers commented Jan 8, 2025

janeyx99 commented Jan 8, 2025

ebsmothers commented Jan 9, 2025 • edited Loading

ebsmothers left a comment

Choose a reason for hiding this comment

mirceamironenco commented Dec 18, 2024 •

edited

Loading

pytorch-bot bot commented Dec 18, 2024 •

edited

Loading

codecov-commenter commented Jan 8, 2025 •

edited

Loading

mirceamironenco commented Jan 8, 2025 •

edited

Loading

ebsmothers commented Jan 9, 2025 •

edited

Loading