Refactoring class hierarchy for FSDP wrapping #34

tgale96 · 2023-10-24T20:20:35Z

No description provided.

fabianlim · 2024-03-25T13:18:47Z

@tgale96 I wanted to ask more about this PR. Im interested in using FSDP with megablocks. I havnt actually tired it yet,but was wondering if I could clarify ParallelMLP

so self.mlp.w1 for different ranks will contain different weights, corresponding to the sharded experts. However in typical FSDP use case, all ranks' parameters that are similarly named should have the same weights (they dont when they are flattened, but after the all gather they should). So in his case rank 0 and 1 will have similarly named parameters w1 but the weights are actually different. To my knowledge this is non-standard FSDP behavior, or am I understanding something wrongly ?
A follow up to number 1, would be that do you require special FSDP settings so that we can have w1 with different weights across the ranks, if so what FSDP settings (e.g., shard strategy, low_mem, etc) did you use? For example, did you specifically have to use ignore_params to skip FSDP sharding of the w1, w2?

If you have any test scripts for FSDP coudl you point me to them?

tgale96 · 2024-03-25T20:23:31Z

w1 is different across ranks in the expert model parallel group. You could choose to do FSDP around this. For example, if you have 8-way expert model parallelism and 2-way data parallelism you could use FSDP on the data parallel axis.

I think the above answers both your question, hopefully? We don't have any tests for FSDP in this repo currently unfortunately.

Refactoring class hierarchy for FSDP wrapping

b3ab365

mvpatel2000 merged commit 6a71b18 into main Oct 24, 2023

mvpatel2000 deleted the fsdp_refactor branch October 24, 2023 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring class hierarchy for FSDP wrapping #34

Refactoring class hierarchy for FSDP wrapping #34

tgale96 commented Oct 24, 2023

fabianlim commented Mar 25, 2024 •

edited

Loading

tgale96 commented Mar 25, 2024

Refactoring class hierarchy for FSDP wrapping #34

Refactoring class hierarchy for FSDP wrapping #34

Conversation

tgale96 commented Oct 24, 2023

fabianlim commented Mar 25, 2024 • edited Loading

tgale96 commented Mar 25, 2024

fabianlim commented Mar 25, 2024 •

edited

Loading