Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds max_seq_len to recipes and updates unit tests #1363

Merged
merged 5 commits into from
Aug 28, 2024

Conversation

thomasjpfan
Copy link
Contributor

Context

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (Expose new feature in recipe yaml)

Addes task 1 of #1311

Changelog

This PR adds max_seq_len to the recipes and adds unit tests for loading the config

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Copy link

pytorch-bot bot commented Aug 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1363

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4eeb1fd with merge base 8bb3a6f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 18, 2024
@felipemello1 felipemello1 self-requested a review August 18, 2024 23:41
Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, appreciate you doing this, this certainly improves the usability of all our configs. A few things:

@@ -21,6 +21,7 @@ checkpointer:
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: /tmp/Llama-2-7b-hf/tokenizer.model
max_seq_len: null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_seq_length is already defined below and is used instead of the one with the tokenizer. To be consistent, let's keep the max_seq_len here in the tokenizer, set it to 4096, remove the other one, and update the eleuther_eval recipe to use the tokenizer's max_seq_len. You simply would need to update the line here:

max_seq_length=self._cfg.max_seq_length,

to use the tokenizer max_seq_len: self._cfg.tokenizer.max_seq_len

@SalmanMohammadi
Copy link
Collaborator

SalmanMohammadi commented Aug 23, 2024

* For `mistral/7B_full_ppo_low_memory.yaml` in particular, we'd probably want to change this line: https://github.com/pytorch/torchtune/blob/f9f75bb563ecae371492a9d49da4a9f514c081b3/recipes/configs/mistral/7B_full_ppo_low_memory.yaml#L36
   to be consistent with the classifier, where max_seq_len=32768 (cc @SalmanMohammadi)

Text completion datasets don't defer to the tokenizer for max seq length truncation right? If this is the case it's fine to leave as null - unless I'm missing something re why it should be consistent with the classifier?

@RdoubleA
Copy link
Contributor

Hi @thomasjpfan, let us know if you still plan to carry this through or if we should wrap it up. Appreciate your work so far!

@thomasjpfan
Copy link
Contributor Author

thomasjpfan commented Aug 28, 2024

It seems like the mistral and phi3 configs were missed entirely? Any reason for that?

I had a quick script to automate adding max_seq_len and missed mistral & phi3 😅

From #1363 (comment), I kept the tokenizer in mistral/7B_full_ppo_low_memory.yaml to max_seq_len: null.

Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just one comment and it's ready to go

ASSETS = Path(__file__).parent.parent.parent / "assets"


class TestInstantiateTokenizer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate you adding this test, but actually don't think this is necessary. This is simply testing that an object can be instantiated and has the correct parameters, but this is already covered by the other tests in this directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the test to mostly check for the null case, which we couldn't do with RMSNorm.

In 4eeb1fd (#1363), I simplified the test to only check the null case.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.06%. Comparing base (8bb3a6f) to head (f5da0ef).
Report is 23 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1363      +/-   ##
==========================================
- Coverage   72.33%   70.06%   -2.28%     
==========================================
  Files         269      268       -1     
  Lines       12668    12936     +268     
==========================================
- Hits         9164     9064     -100     
- Misses       3504     3872     +368     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@RdoubleA RdoubleA merged commit ec21546 into pytorch:main Aug 28, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants