-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update CodeLlama configs (and fix a couple Phi3 ones) #1358
Update CodeLlama configs (and fix a couple Phi3 ones) #1358
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1358
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 32644da with merge base 367e9ab (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
batch_size: 2 | ||
gradient_accumulation_steps: 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64 feels so high
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that Salman also changed this default in some other config too.
enable_activation_checkpointing: True | ||
dtype: bf16 | ||
|
||
# Logging | ||
output_dir: /tmp/codellama_qlora_finetune_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every cfg is like this, but it doesnt make sense to me that:
metric_logger.log_dir: /tmp/CodeLlama-7b-hf/logs
profiler.output_dir: /tmp/CodeLlama-7b-hf/profiling_outputs
But the output_dir is not /tmp/CodeLlama-7b-hf/output
Its probably a conversation for another PR, but i thought it was worth the comment. Ideally we should have something like:
root_dir:/tmp/CodeLlama-7b-hf/
And every other path start with f"{root_dir}/path"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely agree, but I think this should be handled in another PR
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1358 +/- ##
===========================================
+ Coverage 27.41% 72.33% +44.92%
===========================================
Files 269 269
Lines 12598 12668 +70
===========================================
+ Hits 3454 9164 +5710
+ Misses 9144 3504 -5640 ☔ View full report in Codecov by Sentry. |
Context
What is the purpose of this PR? Is it to
Please link to any issues this PR addresses.
Changelog
Test plan
Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)
pre-commit install
)pytest tests
pytest tests -m integration_test
CodeLlama Low Memory
DEBUG:torchtune.utils.logging:Setting manual seed to local seed 3903944230. Local seed is seed + rank = 3903944230 + 0
Writing logs to /tmp/CodeLlama-7b-hf/logs/log_1723844072.txt
INFO:torchtune.utils.logging:Model is initialized with precision torch.bfloat16.
Step 1 | loss:2.085338592529297 lr:2e-05 tokens_per_second_per_gpu:57.828881833956174
Step 2 | loss:1.8718312978744507 lr:2e-05 tokens_per_second_per_gpu:405.044978758513
Step 3 | loss:1.1732845306396484 lr:2e-05 tokens_per_second_per_gpu:655.9372371931469
Step 4 | loss:1.1244843006134033 lr:2e-05 tokens_per_second_per_gpu:417.49818620797487
Step 5 | loss:0.9739330410957336 lr:2e-05 tokens_per_second_per_gpu:414.73347397639026
Step 6 | loss:0.8440025448799133 lr:2e-05 tokens_per_second_per_gpu:345.44944748351764
Step 7 | loss:0.9451338052749634 lr:2e-05 tokens_per_second_per_gpu:408.6546853024294
Step 8 | loss:0.8629082441329956 lr:2e-05 tokens_per_second_per_gpu:402.3016704011016
Step 9 | loss:0.6810362935066223 lr:2e-05 tokens_per_second_per_gpu:698.9489335304891
Step 10 | loss:1.0359036922454834 lr:2e-05 tokens_per_second_per_gpu:962.2259327743811
Step 11 | loss:1.0432602167129517 lr:2e-05 tokens_per_second_per_gpu:477.88804137904646
Step 12 | loss:1.1174635887145996 lr:2e-05 tokens_per_second_per_gpu:542.9797193194343
Step 13 | loss:1.076729655265808 lr:2e-05 tokens_per_second_per_gpu:611.9434521584917
Step 14 | loss:0.8010684847831726 lr:2e-05 tokens_per_second_per_gpu:470.17973290037787
Step 15 | loss:0.658481240272522 lr:2e-05 tokens_per_second_per_gpu:269.35653279483876
Step 16 | loss:1.1415029764175415 lr:2e-05 tokens_per_second_per_gpu:631.8567192343737
Step 17 | loss:0.8821612596511841 lr:2e-05 tokens_per_second_per_gpu:490.6493295938473
Step 18 | loss:0.8164399266242981 lr:2e-05 tokens_per_second_per_gpu:301.15358739415996
Step 19 | loss:0.7248069643974304 lr:2e-05 tokens_per_second_per_gpu:519.1962898183754
Step 20 | loss:0.9358084201812744 lr:2e-05 tokens_per_second_per_gpu:660.4597814234097
Step 21 | loss:0.8048501014709473 lr:2e-05 tokens_per_second_per_gpu:666.6296560832325
Step 22 | loss:1.143164038658142 lr:2e-05 tokens_per_second_per_gpu:769.568167197113
Step 23 | loss:1.1290922164916992 lr:2e-05 tokens_per_second_per_gpu:1219.4012156567753
Step 24 | loss:1.0697393417358398 lr:2e-05 tokens_per_second_per_gpu:545.5558051467158
Step 25 | loss:0.8063280582427979 lr:2e-05 tokens_per_second_per_gpu:359.285878285295
Step 26 | loss:1.062750220298767 lr:2e-05 tokens_per_second_per_gpu:656.1408724187218
Step 27 | loss:0.5210894346237183 lr:2e-05 tokens_per_second_per_gpu:273.5258739551979
Step 28 | loss:0.8218112587928772 lr:2e-05 tokens_per_second_per_gpu:568.6577417563365
Step 29 | loss:1.244479775428772 lr:2e-05 tokens_per_second_per_gpu:1336.4574600978465
Step 30 | loss:1.2447890043258667 lr:2e-05 tokens_per_second_per_gpu:521.095371725802
Step 31 | loss:0.8250967264175415 lr:2e-05 tokens_per_second_per_gpu:392.23145658328417
Step 32 | loss:0.3593619763851166 lr:2e-05 tokens_per_second_per_gpu:276.93849295810804
Step 33 | loss:1.2440699338912964 lr:2e-05 tokens_per_second_per_gpu:652.5498746214473
Step 34 | loss:1.4573827981948853 lr:2e-05 tokens_per_second_per_gpu:761.1334689472617
Step 35 | loss:0.5185084939002991 lr:2e-05 tokens_per_second_per_gpu:662.7216828096517
Step 36 | loss:0.5164892077445984 lr:2e-05 tokens_per_second_per_gpu:403.35727900818614
Step 37 | loss:0.8080346584320068 lr:2e-05 tokens_per_second_per_gpu:269.13169064569587
Step 38 | loss:0.7987920641899109 lr:2e-05 tokens_per_second_per_gpu:377.070483710375
Step 39 | loss:0.9643632769584656 lr:2e-05 tokens_per_second_per_gpu:592.885798581854
Step 40 | loss:0.9670675992965698 lr:2e-05 tokens_per_second_per_gpu:678.1570778088312
Step 41 | loss:0.9631630778312683 lr:2e-05 tokens_per_second_per_gpu:491.6195249397957
Step 42 | loss:1.275192141532898 lr:2e-05 tokens_per_second_per_gpu:773.7280184089923
Step 43 | loss:0.9919266104698181 lr:2e-05 tokens_per_second_per_gpu:518.0839554874535
Step 44 | loss:0.6336700320243835 lr:2e-05 tokens_per_second_per_gpu:415.81514277781224
Step 45 | loss:0.9966105818748474 lr:2e-05 tokens_per_second_per_gpu:436.701503149646
Step 46 | loss:1.2072134017944336 lr:2e-05 tokens_per_second_per_gpu:668.3765244467639
Step 47 | loss:0.5609477162361145 lr:2e-05 tokens_per_second_per_gpu:472.7946838232861
Step 48 | loss:0.6716636419296265 lr:2e-05 tokens_per_second_per_gpu:445.38167123364786
Step 49 | loss:0.41811665892601013 lr:2e-05 tokens_per_second_per_gpu:293.3939222009525
Step 50 | loss:0.9627264142036438 lr:2e-05 tokens_per_second_per_gpu:650.2035280366949
Step 51 | loss:1.1030021905899048 lr:2e-05 tokens_per_second_per_gpu:977.9826905936832
Step 52 | loss:1.1045724153518677 lr:2e-05 tokens_per_second_per_gpu:933.5107038832978
Step 53 | loss:1.2393889427185059 lr:2e-05 tokens_per_second_per_gpu:955.0236048649714
CodeLlama LoRA
DEBUG:torchtune.utils.logging:Setting manual seed to local seed 1267783501. Local seed is seed + rank = 1267783501 + 0
Step 1 | loss:1.8680740594863892 lr:2.9999999999999997e-06 tokens_per_second_per_gpu:2030.7875336165516
Step 2 | loss:1.8970650434494019 lr:5.999999999999999e-06 tokens_per_second_per_gpu:2149.330709413384
Step 3 | loss:1.9390366077423096 lr:8.999999999999999e-06 tokens_per_second_per_gpu:2085.8422521166317
Step 4 | loss:1.9162888526916504 lr:1.1999999999999999e-05 tokens_per_second_per_gpu:2114.5447195274437
Step 5 | loss:1.907753825187683 lr:1.4999999999999999e-05 tokens_per_second_per_gpu:2095.7763255960494
Step 6 | loss:1.8729959726333618 lr:1.7999999999999997e-05 tokens_per_second_per_gpu:2069.613417774976
Step 7 | loss:1.9552496671676636 lr:2.1e-05 tokens_per_second_per_gpu:1939.1585725663876
Step 8 | loss:1.9385077953338623 lr:2.3999999999999997e-05 tokens_per_second_per_gpu:1858.9629032187406
Step 9 | loss:1.8523365259170532 lr:2.6999999999999996e-05 tokens_per_second_per_gpu:2195.57581977171
Step 10 | loss:1.9546726942062378 lr:2.9999999999999997e-05 tokens_per_second_per_gpu:1994.7247123545292
Step 11 | loss:1.8740235567092896 lr:3.2999999999999996e-05 tokens_per_second_per_gpu:2118.238096912975
Step 12 | loss:1.8383359909057617 lr:3.5999999999999994e-05 tokens_per_second_per_gpu:2197.90152113794
Step 13 | loss:2.0345747470855713 lr:3.9e-05 tokens_per_second_per_gpu:1877.6869817436573
Step 14 | loss:1.911726713180542 lr:4.2e-05 tokens_per_second_per_gpu:2007.7783187019204
Step 15 | loss:1.9028035402297974 lr:4.4999999999999996e-05 tokens_per_second_per_gpu:1895.249703103275
Step 16 | loss:1.9710290431976318 lr:4.7999999999999994e-05 tokens_per_second_per_gpu:2041.1004468042258
Step 17 | loss:1.9557075500488281 lr:5.1e-05 tokens_per_second_per_gpu:1874.5153177250622
Step 18 | loss:1.9354438781738281 lr:5.399999999999999e-05 tokens_per_second_per_gpu:2050.1571504333224
Step 19 | loss:1.8736217021942139 lr:5.6999999999999996e-05 tokens_per_second_per_gpu:1978.5571510447528
Step 20 | loss:1.8476310968399048 lr:5.9999999999999995e-05 tokens_per_second_per_gpu:1896.0540308909876
Step 21 | loss:1.8846901655197144 lr:6.299999999999999e-05 tokens_per_second_per_gpu:1918.0654141767343
Step 22 | loss:1.81448495388031 lr:6.599999999999999e-05 tokens_per_second_per_gpu:2010.7343984417776
Step 23 | loss:1.826982021331787 lr:6.9e-05 tokens_per_second_per_gpu:2055.7412644287633
CodeLlama QLoRA
DEBUG:torchtune.utils.logging:Setting manual seed to local seed 2886895280. Local seed is seed + rank = 2886895280 + 0
Step 1 | loss:1.8014029264450073 lr:2.9999999999999997e-06 tokens_per_second_per_gpu:721.4060361968283
Step 1 | loss:1.8014029264450073 lr:2.9999999999999997e-06 tokens_per_second_per_gpu:721.4060361968283
Step 2 | loss:1.854410171508789 lr:5.999999999999999e-06 tokens_per_second_per_gpu:759.5298611535039
Step 3 | loss:1.9392845630645752 lr:8.999999999999999e-06 tokens_per_second_per_gpu:827.053160904784
Step 4 | loss:1.9257690906524658 lr:1.1999999999999999e-05 tokens_per_second_per_gpu:753.9394688012542
Step 5 | loss:1.8086130619049072 lr:1.4999999999999999e-05 tokens_per_second_per_gpu:828.4825061295975
Step 6 | loss:1.9061534404754639 lr:1.7999999999999997e-05 tokens_per_second_per_gpu:815.0526023027178
Step 7 | loss:1.8968294858932495 lr:2.1e-05 tokens_per_second_per_gpu:810.1216677308762
Step 8 | loss:2.0061087608337402 lr:2.3999999999999997e-05 tokens_per_second_per_gpu:727.242930325887
Step 9 | loss:1.8519814014434814 lr:2.6999999999999996e-05 tokens_per_second_per_gpu:836.9796225958347
Step 10 | loss:1.9740678071975708 lr:2.9999999999999997e-05 tokens_per_second_per_gpu:801.1279431836161
Step 11 | loss:1.9065635204315186 lr:3.2999999999999996e-05 tokens_per_second_per_gpu:663.1048502161601
Step 12 | loss:2.0135998725891113 lr:3.5999999999999994e-05 tokens_per_second_per_gpu:738.9745886313771
Step 13 | loss:1.88465416431427 lr:3.9e-05 tokens_per_second_per_gpu:786.0447719015209
Step 14 | loss:1.9898536205291748 lr:4.2e-05 tokens_per_second_per_gpu:732.6640302098344
Step 15 | loss:1.8982443809509277 lr:4.4999999999999996e-05 tokens_per_second_per_gpu:742.1825220464765
Step 16 | loss:1.7056714296340942 lr:4.7999999999999994e-05 tokens_per_second_per_gpu:871.5547839569865
Step 17 | loss:1.7609063386917114 lr:5.1e-05 tokens_per_second_per_gpu:836.1854038172031
Step 18 | loss:1.7653909921646118 lr:5.399999999999999e-05 tokens_per_second_per_gpu:771.0067845253074
Step 19 | loss:1.6984370946884155 lr:5.6999999999999996e-05 tokens_per_second_per_gpu:778.1078328988058
Step 20 | loss:1.7461347579956055 lr:5.9999999999999995e-05 tokens_per_second_per_gpu:711.8381555888702
Step 21 | loss:1.6953678131103516 lr:6.299999999999999e-05 tokens_per_second_per_gpu:692.6972659193829
Step 22 | loss:1.5796502828598022 lr:6.599999999999999e-05 tokens_per_second_per_gpu:740.7614109385659
Step 23 | loss:1.4881298542022705 lr:6.9e-05 tokens_per_second_per_gpu:843.1600175122009
UX
If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Example of docstring:
torchtune/torchtune/modules/vision_transformer.py
Line 285 in 6a7951f
Example in our docs: https://pytorch.org/torchtune/main/tutorials/qat_finetune.html#applying-qat-to-llama3-models