Adds max_seq_len to recipes and updates unit tests #1363

thomasjpfan · 2024-08-18T20:45:47Z

Context

add a new feature
fix a bug
update tests and/or documentation
other (Expose new feature in recipe yaml)

Addes task 1 of #1311

Changelog

This PR adds max_seq_len to the recipes and adds unit tests for loading the config

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

pytorch-bot · 2024-08-18T20:45:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1363

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4eeb1fd with merge base 8bb3a6f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

RdoubleA

Sorry for the delay, appreciate you doing this, this certainly improves the usability of all our configs. A few things:

It seems like the mistral and phi3 configs were missed entirely? Any reason for that?
For mistral/7B_full_ppo_low_memory.yaml in particular, we'd probably want to change this line:

torchtune/recipes/configs/mistral/7B_full_ppo_low_memory.yaml

Line 36 in f9f75bb

max_seq_len: null

to be consistent with the classifier, where max_seq_len=32768 (cc @SalmanMohammadi)

RdoubleA · 2024-08-23T00:03:29Z

recipes/configs/eleuther_evaluation.yaml

@@ -21,6 +21,7 @@ checkpointer:
 tokenizer:
  _component_: torchtune.models.llama2.llama2_tokenizer
  path: /tmp/Llama-2-7b-hf/tokenizer.model
+  max_seq_len: null


max_seq_length is already defined below and is used instead of the one with the tokenizer. To be consistent, let's keep the max_seq_len here in the tokenizer, set it to 4096, remove the other one, and update the eleuther_eval recipe to use the tokenizer's max_seq_len. You simply would need to update the line here:

torchtune/recipes/eleuther_eval.py

Line 246 in f9f75bb

max_seq_length=self._cfg.max_seq_length,

to use the tokenizer max_seq_len: self._cfg.tokenizer.max_seq_len

SalmanMohammadi · 2024-08-23T13:09:44Z

* For `mistral/7B_full_ppo_low_memory.yaml` in particular, we'd probably want to change this line: https://github.com/pytorch/torchtune/blob/f9f75bb563ecae371492a9d49da4a9f514c081b3/recipes/configs/mistral/7B_full_ppo_low_memory.yaml#L36
   to be consistent with the classifier, where max_seq_len=32768 (cc @SalmanMohammadi)

Text completion datasets don't defer to the tokenizer for max seq length truncation right? If this is the case it's fine to leave as null - unless I'm missing something re why it should be consistent with the classifier?

RdoubleA · 2024-08-28T19:04:57Z

Hi @thomasjpfan, let us know if you still plan to carry this through or if we should wrap it up. Appreciate your work so far!

thomasjpfan · 2024-08-28T20:05:42Z

It seems like the mistral and phi3 configs were missed entirely? Any reason for that?

I had a quick script to automate adding max_seq_len and missed mistral & phi3 😅

From #1363 (comment), I kept the tokenizer in mistral/7B_full_ppo_low_memory.yaml to max_seq_len: null.

RdoubleA

Looks great! Just one comment and it's ready to go

RdoubleA · 2024-08-28T21:18:52Z

tests/torchtune/config/test_instantiate.py

+ASSETS = Path(__file__).parent.parent.parent / "assets"
+
+
+class TestInstantiateTokenizer:


Appreciate you adding this test, but actually don't think this is necessary. This is simply testing that an object can be instantiated and has the correct parameters, but this is already covered by the other tests in this directory

I added the test to mostly check for the null case, which we couldn't do with RMSNorm.

In 4eeb1fd (#1363), I simplified the test to only check the null case.

codecov-commenter · 2024-08-28T21:26:30Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.06%. Comparing base (8bb3a6f) to head (f5da0ef).
Report is 23 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1363      +/-   ##
==========================================
- Coverage   72.33%   70.06%   -2.28%     
==========================================
  Files         269      268       -1     
  Lines       12668    12936     +268     
==========================================
- Hits         9164     9064     -100     
- Misses       3504     3872     +368

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Adds max_seq_len to recipes and updates unit tests

94db45a

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 18, 2024

felipemello1 self-requested a review August 18, 2024 23:41

RdoubleA reviewed Aug 23, 2024

View reviewed changes

thomasjpfan added 3 commits August 28, 2024 16:01

Adjust other templates

a8dda9c

Address comments

6a6734b

Go back to null

f5da0ef

RdoubleA reviewed Aug 28, 2024

View reviewed changes

Simplify test to only check for the null case

4eeb1fd

RdoubleA approved these changes Aug 28, 2024

View reviewed changes

RdoubleA merged commit ec21546 into pytorch:main Aug 28, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds max_seq_len to recipes and updates unit tests #1363

Adds max_seq_len to recipes and updates unit tests #1363

thomasjpfan commented Aug 18, 2024

pytorch-bot bot commented Aug 18, 2024 •

edited

Loading

RdoubleA left a comment

RdoubleA Aug 23, 2024

SalmanMohammadi commented Aug 23, 2024 •

edited

Loading

RdoubleA commented Aug 28, 2024

thomasjpfan commented Aug 28, 2024 •

edited

Loading

RdoubleA left a comment

RdoubleA Aug 28, 2024

thomasjpfan Aug 28, 2024

codecov-commenter commented Aug 28, 2024

		ASSETS = Path(__file__).parent.parent.parent / "assets"


		class TestInstantiateTokenizer:

Adds max_seq_len to recipes and updates unit tests #1363

Adds max_seq_len to recipes and updates unit tests #1363

Conversation

thomasjpfan commented Aug 18, 2024

Context

Changelog

Test plan

pytorch-bot bot commented Aug 18, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1363

✅ No Failures

RdoubleA left a comment

Choose a reason for hiding this comment

RdoubleA Aug 23, 2024

Choose a reason for hiding this comment

SalmanMohammadi commented Aug 23, 2024 • edited Loading

RdoubleA commented Aug 28, 2024

thomasjpfan commented Aug 28, 2024 • edited Loading

RdoubleA left a comment

Choose a reason for hiding this comment

RdoubleA Aug 28, 2024

Choose a reason for hiding this comment

thomasjpfan Aug 28, 2024

Choose a reason for hiding this comment

codecov-commenter commented Aug 28, 2024

Codecov Report

pytorch-bot bot commented Aug 18, 2024 •

edited

Loading

SalmanMohammadi commented Aug 23, 2024 •

edited

Loading

thomasjpfan commented Aug 28, 2024 •

edited

Loading