Multimodal Granite Support #35579

alex-jw-brooks · 2025-01-09T10:37:58Z

What does this PR do?

This PR adds compatibility for IBM's upcoming multimodal granite models (which are based on LLava Next). The main changes here are:

The vision feature layer, which is currently expected to be an integer can now also be a list of integers; if a list of integers are provided, the image features are the concatenated before applying the feature selection strategy
The validation which breaks visual encoders with no CLS (discussed a bit here) is removed. I did add a warning if the feature packing explodes with the default strategy as well.

This change was applied in a lot of places to make the checks for repository consistency happy + the config consistent, but the multimodal granite models are instances of LlavaNextForConditionalGeneration. I added a test for each changed model to ensure that things don't blow up if a list of vision feature layers is provided, but if there is another path forward that is preferred to changing several models at the same time to add compatibility with llava next, I'm happy to revise this PR as needed.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts, @qubvel, @zucchini-nlp

Signed-off-by: Alex-Brooks <[email protected]> Support multiple image feature layres Signed-off-by: Alex-Brooks <[email protected]>

Signed-off-by: Alex-Brooks <[email protected]>

zucchini-nlp

Great, thanks for propagating the changes to all llava models. I think we can also modify vipllava for the sake of consistency

Also, left one question for cases when we have a list vision_feature_select_strategy with default strategy. WDYT, since that seems to be the most intuitive behavior? Or is the Multimodal Granite not cropping CLS tokens?

zucchini-nlp · 2025-01-10T10:44:28Z

src/transformers/models/llava/modeling_llava.py

+            hs_pool = [image_outputs.hidden_states[layer_idx] for layer_idx in vision_feature_layer]
+            selected_image_feature = torch.cat(hs_pool, dim=-1)


hmm, I think in this case when one has several layers indices and "default" feature selection strategy, one wants to crop CLS token of each layer. I realize this is not a feature used in any of official checkpoints, but if we want to standardize vision_feature_layer to be a Union[int, List[int]] I think that is the expected behavior

WDYT?

zucchini-nlp · 2025-01-10T10:48:04Z

src/transformers/models/llava_next/modeling_llava_next.py

+                try:
+                    image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1)
+                except RuntimeError as e:
+                    if vision_feature_select_strategy == "default":
+                        logger.warning_once(
+                            "Image feature shape does not line up with the provided patch size. "
+                            "You may be using the `default` vision_feature_select_strategy with a"
+                            " visual encoder that does not have CLS."
+                        )
+                    raise e
                image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous()


Personally, I don't think try/expect is smth we want in modeling code. Can we frame it as if condition: raise Error, to catch unwanted behavior?

alex-jw-brooks requested review from Rocketknight1 and ArthurZucker as code owners January 9, 2025 10:37

alex-jw-brooks added 3 commits January 9, 2025 05:42

Add multimodal granite support

4bfdcee

Signed-off-by: Alex-Brooks <[email protected]> Support multiple image feature layres Signed-off-by: Alex-Brooks <[email protected]>

Remove failing validation for visual encoders with no cls

af3173b

Signed-off-by: Alex-Brooks <[email protected]>

Update llava based models / configs to support list of feature layers

7ba00cb

Signed-off-by: Alex-Brooks <[email protected]>

qubvel added New model Multimodal labels Jan 9, 2025

Add tests for multiple feature layers

a7c495f

Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks force-pushed the mm_granite branch from 85173c7 to a7c495f Compare January 9, 2025 10:56

ArthurZucker requested a review from zucchini-nlp January 9, 2025 12:07

zucchini-nlp reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal Granite Support #35579

Multimodal Granite Support #35579

alex-jw-brooks commented Jan 9, 2025

zucchini-nlp left a comment

zucchini-nlp Jan 10, 2025

zucchini-nlp Jan 10, 2025

		hs_pool = [image_outputs.hidden_states[layer_idx] for layer_idx in vision_feature_layer]
		selected_image_feature = torch.cat(hs_pool, dim=-1)

Multimodal Granite Support #35579

Are you sure you want to change the base?

Multimodal Granite Support #35579

Conversation

alex-jw-brooks commented Jan 9, 2025

What does this PR do?

Before submitting

Who can review?

zucchini-nlp left a comment

Choose a reason for hiding this comment

zucchini-nlp Jan 10, 2025

Choose a reason for hiding this comment

zucchini-nlp Jan 10, 2025

Choose a reason for hiding this comment