Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring class hierarchy for FSDP wrapping #34

Merged
merged 1 commit into from
Oct 24, 2023
Merged

Conversation

tgale96
Copy link
Contributor

@tgale96 tgale96 commented Oct 24, 2023

No description provided.

@mvpatel2000 mvpatel2000 merged commit 6a71b18 into main Oct 24, 2023
@mvpatel2000 mvpatel2000 deleted the fsdp_refactor branch October 24, 2023 22:40
@fabianlim
Copy link

fabianlim commented Mar 25, 2024

@tgale96 I wanted to ask more about this PR. Im interested in using FSDP with megablocks. I havnt actually tired it yet,but was wondering if I could clarify ParallelMLP

  1. so self.mlp.w1 for different ranks will contain different weights, corresponding to the sharded experts. However in typical FSDP use case, all ranks' parameters that are similarly named should have the same weights (they dont when they are flattened, but after the all gather they should). So in his case rank 0 and 1 will have similarly named parameters w1 but the weights are actually different. To my knowledge this is non-standard FSDP behavior, or am I understanding something wrongly ?
  2. A follow up to number 1, would be that do you require special FSDP settings so that we can have w1 with different weights across the ranks, if so what FSDP settings (e.g., shard strategy, low_mem, etc) did you use? For example, did you specifically have to use ignore_params to skip FSDP sharding of the w1, w2?

If you have any test scripts for FSDP coudl you point me to them?

@tgale96
Copy link
Contributor Author

tgale96 commented Mar 25, 2024

w1 is different across ranks in the expert model parallel group. You could choose to do FSDP around this. For example, if you have 8-way expert model parallelism and 2-way data parallelism you could use FSDP on the data parallel axis.

I think the above answers both your question, hopefully? We don't have any tests for FSDP in this repo currently unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants