-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interleaved image support in tokenizers #1138
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1138
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit fca9031 with merge base f158577 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
recipes/generate.py
Outdated
@@ -14,7 +14,7 @@ | |||
|
|||
from torchtune import config, utils | |||
from torchtune.config._utils import _get_component_from_path | |||
from torchtune.data import ChatFormat, InstructTemplate, Message | |||
from torchtune.data import apply_chat_format, ChatFormat, InstructTemplate, Message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copycat copycat copycat
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1138 +/- ##
==========================================
+ Coverage 65.98% 66.14% +0.15%
==========================================
Files 194 194
Lines 9023 9042 +19
==========================================
+ Hits 5954 5981 +27
+ Misses 3069 3061 -8 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the design based on our discussions. Have a handful of comments but no huge concerns. Can you also fix the doc build job failure and update the PR summary to reflect the design we settled on here?
Adding support for image special tokens in tokenizers and Message dataclass across the library. Starting with Llama3Tokenizer as an example, as multimodal Flamingo will be using this.
This PR will enabled support for multiple images per sample, interleaved throughout the text. This is done by overhauling how content is represented in the
Message
class:For backwards compatibility, passing a string into the content field will automatically get casted into the list of dictionaries. Using dictionaries is visually clunky, but it enables the most flexibility for any non-text content that needs to be tokenized differently and future metadata fields that may affect tokenization.
We must also use a different paradigm for chat formats. Now, treat it as extra tags that are prepended/appended to the Message content list based on the role:
Changelog
Message
content field to be multimodal - can have text or image content in a list of dictionaries. Added properties for easily checking if there's image content or getting the text only contentMessage
content contains type==image, then add image special token in Llama3TokenizerTest Plan
Compared to a reference implementation token by token. Made this into new unit tests for text only messages, text + image, text + interleaved image, tool calls
Other considered approaches
There are other approaches I considered for enabling interleaved images in a message. The important workflow to consider is that users will have to use our utilities or create their own transforms to get their datasets into our Message format with interleaved image content. We should make this workflow as easy as possible.
1
Use an enum,
Media.IMAGE
, and makeMessage.content
aList[Union[str, Media]]
that combines texts and places images in between stringsPros: slightly less tedious to convert user datasets into this structure and also process this (versus dictionaries)
Cons: not easy to support image metadata or other fields that may affect image tokenization in the future
2
Keep everything as a string and allow special tokens in the text itself. Tokenizer splits on the indicated image token and adds the correct special token id and encodes the strings normally
Pros: More prompt template / chat format friendly, as we can still format the string as is. In other approaches, since the text is broken up, prompt templating is considerably more difficult. Easier to process string only content, no impact on text-only workflows
Cons: we are breaking our rule of no special tokens in the text