-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds max_seq_len to recipes and updates unit tests #1363
Adds max_seq_len to recipes and updates unit tests #1363
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1363
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 4eeb1fd with merge base 8bb3a6f (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, appreciate you doing this, this certainly improves the usability of all our configs. A few things:
- It seems like the mistral and phi3 configs were missed entirely? Any reason for that?
- For
mistral/7B_full_ppo_low_memory.yaml
in particular, we'd probably want to change this line:max_seq_len: null
@@ -21,6 +21,7 @@ checkpointer: | |||
tokenizer: | |||
_component_: torchtune.models.llama2.llama2_tokenizer | |||
path: /tmp/Llama-2-7b-hf/tokenizer.model | |||
max_seq_len: null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_seq_length
is already defined below and is used instead of the one with the tokenizer. To be consistent, let's keep the max_seq_len here in the tokenizer, set it to 4096, remove the other one, and update the eleuther_eval recipe to use the tokenizer's max_seq_len. You simply would need to update the line here:
torchtune/recipes/eleuther_eval.py
Line 246 in f9f75bb
max_seq_length=self._cfg.max_seq_length, |
to use the tokenizer max_seq_len: self._cfg.tokenizer.max_seq_len
Text completion datasets don't defer to the tokenizer for max seq length truncation right? If this is the case it's fine to leave as null - unless I'm missing something re why it should be consistent with the classifier? |
Hi @thomasjpfan, let us know if you still plan to carry this through or if we should wrap it up. Appreciate your work so far! |
I had a quick script to automate adding From #1363 (comment), I kept the tokenizer in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just one comment and it's ready to go
ASSETS = Path(__file__).parent.parent.parent / "assets" | ||
|
||
|
||
class TestInstantiateTokenizer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate you adding this test, but actually don't think this is necessary. This is simply testing that an object can be instantiated and has the correct parameters, but this is already covered by the other tests in this directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the test to mostly check for the null case, which we couldn't do with RMSNorm
.
In 4eeb1fd
(#1363), I simplified the test to only check the null case.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1363 +/- ##
==========================================
- Coverage 72.33% 70.06% -2.28%
==========================================
Files 269 268 -1
Lines 12668 12936 +268
==========================================
- Hits 9164 9064 -100
- Misses 3504 3872 +368 ☔ View full report in Codecov by Sentry. |
Context
Addes task 1 of #1311
Changelog
This PR adds
max_seq_len
to the recipes and adds unit tests for loading the configTest plan
Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)
pre-commit install
)pytest tests
pytest tests -m integration_test