-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update lr_schedules.py #4563
Update lr_schedules.py #4563
Conversation
add cosine annealing scheduler
@microsoft-github-policy-service agree |
@CoinCheung, thanks for the PR. A few items to address. To fix formatting issues use this guide. Please add unit test: example Inspect failing CI tests. |
@tjruwase @wjessup @dfyz @manuelciosici I have no experience with pr for deepspeed, what is the status of this now? Is there any further operation that needs me to work on? |
@CoinCheung, thanks for making the changes. We will review and merge once the CI passes. |
Hi @tjruwase , I have made some fixes, would you please help me launch CI test one more time? |
@tjruwase Would you please launch CI one more time ? |
Hi @jeffra @mrwyattii I think the problem is not with my fix, it is a inference error, but my fix is about training learning rate scheduler. Can this fix be merged ? Or is there other things that need me to commit? |
@CoinCheung, sorry for the delay. It seems the issue is with our CI system. Please bear with us while we resolve the problem. |
Hi @tjruwase , What is the status of this thread? |
@CoinCheung, I have restarted CI. Let's see how it goes. |
Hi @tjruwase , |
@CoinCheung, no I don't think it is related to your changes. |
Should WarmupCosineLR inherit from WarmupLR? |
Yes, you are correct. It should. @CoinCheung, are you able to refactor your changes? Thanks! |
Hi @tjruwase @kmn1024 , I do not think WarmupCosineLR can be interited from this WarmupLR in this case. SInce they use different methods to determine the learning rates. For WarmupCosineLR, I use "ratio of original lr values", which I think should be more scientific, while WarmupLR uses specific lr values. For example, when using WarmupLR, by setting The reason why I feel using ratio is better: we do not need to set specific lr values everywhere in both From my experience of tuning models, this method is less likely to cause mistakes that I change optimizer lr but forgot to change scheduler. Also in some paper, if I recally correctly, they claimed that they use CosineLR to train their model, and the learning rate anneals from |
@CoinCheung, thanks for your response. I agree with the differences that you identify between WarmupLR and WarmupCosineLR, but these differences are to me simply in the implementation and logic. At the high-level they are similar because of they provide two phases of lr changes: (1) initial phase of warmup/increase, and (2) final phase of no change or decay. Looking more closely we observe significant similarity or duplication in many of the methods including |
@tjruwase Would it be acceptable if we change args (init args used for define the scheduler object) of WarmupLR ? It has only one sub-class WarmupDecayLR, and I think its usage frequency is not very high. |
add cosine annealing scheduler this scheduler is widely used in image classification task, and many llm (e.g. llama) use this also. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
add cosine annealing scheduler
this scheduler is widely used in image classification task, and many llm (e.g. llama) use this also.