Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimodal Granite Support #35579

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

alex-jw-brooks
Copy link

What does this PR do?

This PR adds compatibility for IBM's upcoming multimodal granite models (which are based on LLava Next). The main changes here are:

  • The vision feature layer, which is currently expected to be an integer can now also be a list of integers; if a list of integers are provided, the image features are the concatenated before applying the feature selection strategy
  • The validation which breaks visual encoders with no CLS (discussed a bit here) is removed. I did add a warning if the feature packing explodes with the default strategy as well.

This change was applied in a lot of places to make the checks for repository consistency happy + the config consistent, but the multimodal granite models are instances of LlavaNextForConditionalGeneration. I added a test for each changed model to ensure that things don't blow up if a list of vision feature layers is provided, but if there is another path forward that is preferred to changing several models at the same time to add compatibility with llava next, I'm happy to revise this PR as needed.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts, @qubvel, @zucchini-nlp

Signed-off-by: Alex-Brooks <[email protected]>

Support multiple image feature layres

Signed-off-by: Alex-Brooks <[email protected]>
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for propagating the changes to all llava models. I think we can also modify vipllava for the sake of consistency

Also, left one question for cases when we have a list vision_feature_select_strategy with default strategy. WDYT, since that seems to be the most intuitive behavior? Or is the Multimodal Granite not cropping CLS tokens?

Comment on lines +313 to +314
hs_pool = [image_outputs.hidden_states[layer_idx] for layer_idx in vision_feature_layer]
selected_image_feature = torch.cat(hs_pool, dim=-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think in this case when one has several layers indices and "default" feature selection strategy, one wants to crop CLS token of each layer. I realize this is not a feature used in any of official checkpoints, but if we want to standardize vision_feature_layer to be a Union[int, List[int]] I think that is the expected behavior

WDYT?

Comment on lines +686 to 696
try:
image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1)
except RuntimeError as e:
if vision_feature_select_strategy == "default":
logger.warning_once(
"Image feature shape does not line up with the provided patch size. "
"You may be using the `default` vision_feature_select_strategy with a"
" visual encoder that does not have CLS."
)
raise e
image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't think try/expect is smth we want in modeling code. Can we frame it as if condition: raise Error, to catch unwanted behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants