Update CodeLlama configs (and fix a couple Phi3 ones) #1358

joecummings · 2024-08-16T21:32:39Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

Update CodeLlama configs
Fix a couple Phi3 inconsistencies

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

CodeLlama Low Memory

DEBUG:torchtune.utils.logging:Setting manual seed to local seed 3903944230. Local seed is seed + rank = 3903944230 + 0
Writing logs to /tmp/CodeLlama-7b-hf/logs/log_1723844072.txt
INFO:torchtune.utils.logging:Model is initialized with precision torch.bfloat16.
Step 1 | loss:2.085338592529297 lr:2e-05 tokens_per_second_per_gpu:57.828881833956174
Step 2 | loss:1.8718312978744507 lr:2e-05 tokens_per_second_per_gpu:405.044978758513
Step 3 | loss:1.1732845306396484 lr:2e-05 tokens_per_second_per_gpu:655.9372371931469
Step 4 | loss:1.1244843006134033 lr:2e-05 tokens_per_second_per_gpu:417.49818620797487
Step 5 | loss:0.9739330410957336 lr:2e-05 tokens_per_second_per_gpu:414.73347397639026
Step 6 | loss:0.8440025448799133 lr:2e-05 tokens_per_second_per_gpu:345.44944748351764
Step 7 | loss:0.9451338052749634 lr:2e-05 tokens_per_second_per_gpu:408.6546853024294
Step 8 | loss:0.8629082441329956 lr:2e-05 tokens_per_second_per_gpu:402.3016704011016
Step 9 | loss:0.6810362935066223 lr:2e-05 tokens_per_second_per_gpu:698.9489335304891
Step 10 | loss:1.0359036922454834 lr:2e-05 tokens_per_second_per_gpu:962.2259327743811
Step 11 | loss:1.0432602167129517 lr:2e-05 tokens_per_second_per_gpu:477.88804137904646
Step 12 | loss:1.1174635887145996 lr:2e-05 tokens_per_second_per_gpu:542.9797193194343
Step 13 | loss:1.076729655265808 lr:2e-05 tokens_per_second_per_gpu:611.9434521584917
Step 14 | loss:0.8010684847831726 lr:2e-05 tokens_per_second_per_gpu:470.17973290037787
Step 15 | loss:0.658481240272522 lr:2e-05 tokens_per_second_per_gpu:269.35653279483876
Step 16 | loss:1.1415029764175415 lr:2e-05 tokens_per_second_per_gpu:631.8567192343737
Step 17 | loss:0.8821612596511841 lr:2e-05 tokens_per_second_per_gpu:490.6493295938473
Step 18 | loss:0.8164399266242981 lr:2e-05 tokens_per_second_per_gpu:301.15358739415996
Step 19 | loss:0.7248069643974304 lr:2e-05 tokens_per_second_per_gpu:519.1962898183754
Step 20 | loss:0.9358084201812744 lr:2e-05 tokens_per_second_per_gpu:660.4597814234097
Step 21 | loss:0.8048501014709473 lr:2e-05 tokens_per_second_per_gpu:666.6296560832325
Step 22 | loss:1.143164038658142 lr:2e-05 tokens_per_second_per_gpu:769.568167197113
Step 23 | loss:1.1290922164916992 lr:2e-05 tokens_per_second_per_gpu:1219.4012156567753
Step 24 | loss:1.0697393417358398 lr:2e-05 tokens_per_second_per_gpu:545.5558051467158
Step 25 | loss:0.8063280582427979 lr:2e-05 tokens_per_second_per_gpu:359.285878285295
Step 26 | loss:1.062750220298767 lr:2e-05 tokens_per_second_per_gpu:656.1408724187218
Step 27 | loss:0.5210894346237183 lr:2e-05 tokens_per_second_per_gpu:273.5258739551979
Step 28 | loss:0.8218112587928772 lr:2e-05 tokens_per_second_per_gpu:568.6577417563365
Step 29 | loss:1.244479775428772 lr:2e-05 tokens_per_second_per_gpu:1336.4574600978465
Step 30 | loss:1.2447890043258667 lr:2e-05 tokens_per_second_per_gpu:521.095371725802
Step 31 | loss:0.8250967264175415 lr:2e-05 tokens_per_second_per_gpu:392.23145658328417
Step 32 | loss:0.3593619763851166 lr:2e-05 tokens_per_second_per_gpu:276.93849295810804
Step 33 | loss:1.2440699338912964 lr:2e-05 tokens_per_second_per_gpu:652.5498746214473
Step 34 | loss:1.4573827981948853 lr:2e-05 tokens_per_second_per_gpu:761.1334689472617
Step 35 | loss:0.5185084939002991 lr:2e-05 tokens_per_second_per_gpu:662.7216828096517
Step 36 | loss:0.5164892077445984 lr:2e-05 tokens_per_second_per_gpu:403.35727900818614
Step 37 | loss:0.8080346584320068 lr:2e-05 tokens_per_second_per_gpu:269.13169064569587
Step 38 | loss:0.7987920641899109 lr:2e-05 tokens_per_second_per_gpu:377.070483710375
Step 39 | loss:0.9643632769584656 lr:2e-05 tokens_per_second_per_gpu:592.885798581854
Step 40 | loss:0.9670675992965698 lr:2e-05 tokens_per_second_per_gpu:678.1570778088312
Step 41 | loss:0.9631630778312683 lr:2e-05 tokens_per_second_per_gpu:491.6195249397957
Step 42 | loss:1.275192141532898 lr:2e-05 tokens_per_second_per_gpu:773.7280184089923
Step 43 | loss:0.9919266104698181 lr:2e-05 tokens_per_second_per_gpu:518.0839554874535
Step 44 | loss:0.6336700320243835 lr:2e-05 tokens_per_second_per_gpu:415.81514277781224
Step 45 | loss:0.9966105818748474 lr:2e-05 tokens_per_second_per_gpu:436.701503149646
Step 46 | loss:1.2072134017944336 lr:2e-05 tokens_per_second_per_gpu:668.3765244467639
Step 47 | loss:0.5609477162361145 lr:2e-05 tokens_per_second_per_gpu:472.7946838232861
Step 48 | loss:0.6716636419296265 lr:2e-05 tokens_per_second_per_gpu:445.38167123364786
Step 49 | loss:0.41811665892601013 lr:2e-05 tokens_per_second_per_gpu:293.3939222009525
Step 50 | loss:0.9627264142036438 lr:2e-05 tokens_per_second_per_gpu:650.2035280366949
Step 51 | loss:1.1030021905899048 lr:2e-05 tokens_per_second_per_gpu:977.9826905936832
Step 52 | loss:1.1045724153518677 lr:2e-05 tokens_per_second_per_gpu:933.5107038832978
Step 53 | loss:1.2393889427185059 lr:2e-05 tokens_per_second_per_gpu:955.0236048649714

CodeLlama LoRA

DEBUG:torchtune.utils.logging:Setting manual seed to local seed 1267783501. Local seed is seed + rank = 1267783501 + 0
Step 1 | loss:1.8680740594863892 lr:2.9999999999999997e-06 tokens_per_second_per_gpu:2030.7875336165516
Step 2 | loss:1.8970650434494019 lr:5.999999999999999e-06 tokens_per_second_per_gpu:2149.330709413384
Step 3 | loss:1.9390366077423096 lr:8.999999999999999e-06 tokens_per_second_per_gpu:2085.8422521166317
Step 4 | loss:1.9162888526916504 lr:1.1999999999999999e-05 tokens_per_second_per_gpu:2114.5447195274437
Step 5 | loss:1.907753825187683 lr:1.4999999999999999e-05 tokens_per_second_per_gpu:2095.7763255960494
Step 6 | loss:1.8729959726333618 lr:1.7999999999999997e-05 tokens_per_second_per_gpu:2069.613417774976
Step 7 | loss:1.9552496671676636 lr:2.1e-05 tokens_per_second_per_gpu:1939.1585725663876
Step 8 | loss:1.9385077953338623 lr:2.3999999999999997e-05 tokens_per_second_per_gpu:1858.9629032187406
Step 9 | loss:1.8523365259170532 lr:2.6999999999999996e-05 tokens_per_second_per_gpu:2195.57581977171
Step 10 | loss:1.9546726942062378 lr:2.9999999999999997e-05 tokens_per_second_per_gpu:1994.7247123545292
Step 11 | loss:1.8740235567092896 lr:3.2999999999999996e-05 tokens_per_second_per_gpu:2118.238096912975
Step 12 | loss:1.8383359909057617 lr:3.5999999999999994e-05 tokens_per_second_per_gpu:2197.90152113794
Step 13 | loss:2.0345747470855713 lr:3.9e-05 tokens_per_second_per_gpu:1877.6869817436573
Step 14 | loss:1.911726713180542 lr:4.2e-05 tokens_per_second_per_gpu:2007.7783187019204
Step 15 | loss:1.9028035402297974 lr:4.4999999999999996e-05 tokens_per_second_per_gpu:1895.249703103275
Step 16 | loss:1.9710290431976318 lr:4.7999999999999994e-05 tokens_per_second_per_gpu:2041.1004468042258
Step 17 | loss:1.9557075500488281 lr:5.1e-05 tokens_per_second_per_gpu:1874.5153177250622
Step 18 | loss:1.9354438781738281 lr:5.399999999999999e-05 tokens_per_second_per_gpu:2050.1571504333224
Step 19 | loss:1.8736217021942139 lr:5.6999999999999996e-05 tokens_per_second_per_gpu:1978.5571510447528
Step 20 | loss:1.8476310968399048 lr:5.9999999999999995e-05 tokens_per_second_per_gpu:1896.0540308909876
Step 21 | loss:1.8846901655197144 lr:6.299999999999999e-05 tokens_per_second_per_gpu:1918.0654141767343
Step 22 | loss:1.81448495388031 lr:6.599999999999999e-05 tokens_per_second_per_gpu:2010.7343984417776
Step 23 | loss:1.826982021331787 lr:6.9e-05 tokens_per_second_per_gpu:2055.7412644287633

CodeLlama QLoRA

DEBUG:torchtune.utils.logging:Setting manual seed to local seed 2886895280. Local seed is seed + rank = 2886895280 + 0
Step 1 | loss:1.8014029264450073 lr:2.9999999999999997e-06 tokens_per_second_per_gpu:721.4060361968283
Step 1 | loss:1.8014029264450073 lr:2.9999999999999997e-06 tokens_per_second_per_gpu:721.4060361968283
Step 2 | loss:1.854410171508789 lr:5.999999999999999e-06 tokens_per_second_per_gpu:759.5298611535039
Step 3 | loss:1.9392845630645752 lr:8.999999999999999e-06 tokens_per_second_per_gpu:827.053160904784
Step 4 | loss:1.9257690906524658 lr:1.1999999999999999e-05 tokens_per_second_per_gpu:753.9394688012542
Step 5 | loss:1.8086130619049072 lr:1.4999999999999999e-05 tokens_per_second_per_gpu:828.4825061295975
Step 6 | loss:1.9061534404754639 lr:1.7999999999999997e-05 tokens_per_second_per_gpu:815.0526023027178
Step 7 | loss:1.8968294858932495 lr:2.1e-05 tokens_per_second_per_gpu:810.1216677308762
Step 8 | loss:2.0061087608337402 lr:2.3999999999999997e-05 tokens_per_second_per_gpu:727.242930325887
Step 9 | loss:1.8519814014434814 lr:2.6999999999999996e-05 tokens_per_second_per_gpu:836.9796225958347
Step 10 | loss:1.9740678071975708 lr:2.9999999999999997e-05 tokens_per_second_per_gpu:801.1279431836161
Step 11 | loss:1.9065635204315186 lr:3.2999999999999996e-05 tokens_per_second_per_gpu:663.1048502161601
Step 12 | loss:2.0135998725891113 lr:3.5999999999999994e-05 tokens_per_second_per_gpu:738.9745886313771
Step 13 | loss:1.88465416431427 lr:3.9e-05 tokens_per_second_per_gpu:786.0447719015209
Step 14 | loss:1.9898536205291748 lr:4.2e-05 tokens_per_second_per_gpu:732.6640302098344
Step 15 | loss:1.8982443809509277 lr:4.4999999999999996e-05 tokens_per_second_per_gpu:742.1825220464765
Step 16 | loss:1.7056714296340942 lr:4.7999999999999994e-05 tokens_per_second_per_gpu:871.5547839569865
Step 17 | loss:1.7609063386917114 lr:5.1e-05 tokens_per_second_per_gpu:836.1854038172031
Step 18 | loss:1.7653909921646118 lr:5.399999999999999e-05 tokens_per_second_per_gpu:771.0067845253074
Step 19 | loss:1.6984370946884155 lr:5.6999999999999996e-05 tokens_per_second_per_gpu:778.1078328988058
Step 20 | loss:1.7461347579956055 lr:5.9999999999999995e-05 tokens_per_second_per_gpu:711.8381555888702
Step 21 | loss:1.6953678131103516 lr:6.299999999999999e-05 tokens_per_second_per_gpu:692.6972659193829
Step 22 | loss:1.5796502828598022 lr:6.599999999999999e-05 tokens_per_second_per_gpu:740.7614109385659
Step 23 | loss:1.4881298542022705 lr:6.9e-05 tokens_per_second_per_gpu:843.1600175122009

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Example of docstring:

torchtune/torchtune/modules/vision_transformer.py

Line 285 in 6a7951f

Examples:

Example in our docs: https://pytorch.org/torchtune/main/tutorials/qat_finetune.html#applying-qat-to-llama3-models

I did not change any public API;
I have added an example to docs or docstrings;

pytorch-bot · 2024-08-16T21:32:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1358

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 32644da with merge base 367e9ab ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1 · 2024-08-17T00:03:36Z

recipes/configs/phi3/mini_lora_single_device.yaml

 batch_size: 2
+gradient_accumulation_steps: 64


64 feels so high

i think that Salman also changed this default in some other config too.

felipemello1 · 2024-08-17T00:05:54Z

recipes/configs/code_llama2/7B_qlora_single_device.yaml

 enable_activation_checkpointing: True
 dtype: bf16

 # Logging
+output_dir: /tmp/codellama_qlora_finetune_output


every cfg is like this, but it doesnt make sense to me that:

metric_logger.log_dir: /tmp/CodeLlama-7b-hf/logs
profiler.output_dir: /tmp/CodeLlama-7b-hf/profiling_outputs

But the output_dir is not /tmp/CodeLlama-7b-hf/output

Its probably a conversation for another PR, but i thought it was worth the comment. Ideally we should have something like:

root_dir:/tmp/CodeLlama-7b-hf/

And every other path start with f"{root_dir}/path"

Completely agree, but I think this should be handled in another PR

codecov-commenter · 2024-08-17T00:53:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.33%. Comparing base (67f6a06) to head (32644da).
Report is 2 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1358       +/-   ##
===========================================
+ Coverage   27.41%   72.33%   +44.92%     
===========================================
  Files         269      269               
  Lines       12598    12668       +70     
===========================================
+ Hits         3454     9164     +5710     
+ Misses       9144     3504     -5640

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Update CodeLlama configs (and fix a couple Phi3 ones)

eb1021b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 16, 2024

Rearrange grad acc steps for Phi3

74c1c9b

joecummings requested review from felipemello1 and RdoubleA August 16, 2024 21:57

joecummings linked an issue Aug 16, 2024 that may be closed by this pull request

Rework code-llama configs to follow standard format #1332

Closed

felipemello1 approved these changes Aug 17, 2024

View reviewed changes

Grad acc = 16

32644da

joecummings merged commit 8bb3a6f into pytorch:main Aug 17, 2024
17 checks passed

joecummings deleted the update-code-llama-configs branch August 17, 2024 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update CodeLlama configs (and fix a couple Phi3 ones) #1358

Update CodeLlama configs (and fix a couple Phi3 ones) #1358

joecummings commented Aug 16, 2024 •

edited

Loading

pytorch-bot bot commented Aug 16, 2024 •

edited

Loading

felipemello1 Aug 17, 2024

joecummings Aug 17, 2024

felipemello1 Aug 17, 2024 •

edited

Loading

felipemello1 Aug 17, 2024

joecummings Aug 17, 2024

codecov-commenter commented Aug 17, 2024

		batch_size: 2
		gradient_accumulation_steps: 64

Update CodeLlama configs (and fix a couple Phi3 ones) #1358

Update CodeLlama configs (and fix a couple Phi3 ones) #1358

Conversation

joecummings commented Aug 16, 2024 • edited Loading

Context

Changelog

Test plan

UX

pytorch-bot bot commented Aug 16, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1358

✅ No Failures

felipemello1 Aug 17, 2024

Choose a reason for hiding this comment

joecummings Aug 17, 2024

Choose a reason for hiding this comment

felipemello1 Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

felipemello1 Aug 17, 2024

Choose a reason for hiding this comment

joecummings Aug 17, 2024

Choose a reason for hiding this comment

codecov-commenter commented Aug 17, 2024

Codecov Report

joecummings commented Aug 16, 2024 •

edited

Loading

pytorch-bot bot commented Aug 16, 2024 •

edited

Loading

felipemello1 Aug 17, 2024 •

edited

Loading