Fix Llava Next Check for Visual Encoders Without CLS #35262

alex-jw-brooks · 2024-12-13T16:21:10Z

What does this PR do?

Hello! I am working on a few PRs to enable some upcoming multimodal (granite) models from IBM, which are based on Llava Next.

Currently, the visual encoder in Llava Next is abstracted to allow for alternate visual encoders besides CLIP, but there is some validation here that causes issues if the choice of visual encoder does not have CLS in its features, e.g., Siglip, because the number of features is off by one, regardless of selection strategy. I.e., if it's full, it's off by one because it's got no CLS, and if it's default, it's off by one because we are slicing one of the non-CLS features when applying the selection strategy.

This PR updates the feature selection strategy logic in llava next such that if there is no CLS token, then the selection strategy is always treated as full, since default doesn't make sense in this case (no CLS to remove).

I've added some unit tests that run the validation here on mocked outputs from the vision tower with the combinations of feature selection strategies + whether or not there should be a cis, and have also run our upcoming Siglip + llava next model to validate that this works as expected.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts, @qubvel

qubvel · 2024-12-13T17:14:20Z

I suppose @zucchini-nlp was working on something relevant

alex-jw-brooks · 2024-12-16T07:52:18Z

Sounds good, thanks! I went ahead and made the corresponding changes for llava next video and the video selection strategy so that the behavior in this PR is consistent

Signed-off-by: Alex-Brooks <[email protected]>

zucchini-nlp

Thanks for submitting PR! Indeed I noticed same thing when adding a new llava-next-video checkpoint based on SigLIP.

From one side the changes makes sense because llava-next packing strategy expects images of certain shape, but also this adds much more complexity in the code for the sake of checking image features shape. That said, I would say what if we just do not check base_image_feature shape since we can be sure that vision backbones will return the required shapes. Anyone who wants to change the backbone or its config, would have to make sure themselves that packing works for their use case

That is the decision I made for my PR adding the new model

Also excited to see IBM multimodal models, looking forward for a PR!

alex-jw-brooks · 2024-12-17T11:49:35Z

Cool, thanks @zucchini-nlp, that all makes sense and I agree. Do you have any thoughts on how the default``vision_feature_select_strategy should be handled for Visual Encoders Without CLS? This PR changes the feature selection to always act as full if there seems to be no CLS to prevent the removal of non-CLS features by mistake.

I'm happy to make any changes, update the PR to just remove the validation for image features to decouple it from the model PR you are working on, or close this one if you'd prefer. Thank you for the guidance and awesome work!

zucchini-nlp · 2024-12-17T12:08:48Z

@alex-jw-brooks

Do you have any thoughts on how the default``vision_feature_select_strategy should be handled for Visual Encoders Without CLS?

Yes, I agree that in models like SigLIP it makes no sense to remove 1 token but I believe we should leave it to user's responsibility. What we can do is to leave the explicit docstring from this PR, to make it clear how vision selection works

Are you planning to integrate IBM granite in transformers natively? If yes, I think it's better to make the necessary changes while adding the model (if still required). Otherwise if you have a code hosted on the hub which calls LlavaNext class, feel free to simply remove the check and refine docstring in this PR :)

alex-jw-brooks · 2024-12-19T22:49:30Z

Hi @zucchini-nlp, apologies for the delayed response. Yes, I am planning to integrate IBM multimodal granite models into Transformers natively! I had only split this apart because I thought it might be easier to review as a more atomic change 🙂

The change for multimodal granite models is a pretty simple one - for now, I will close this PR, and will make the docstring update + remove the validation in the upcoming one, which I plan to open in the next day or so! I will tag you on it as well, thanks for the quick feedback!

alex-jw-brooks force-pushed the llava_has_cls branch from fa65d63 to fab6743 Compare December 13, 2024 16:27

alex-jw-brooks changed the title ~~Fix Llava Cext Check for Visual Encoders Without CLS~~ Fix Llava Next Check for Visual Encoders Without CLS Dec 13, 2024

alex-jw-brooks force-pushed the llava_has_cls branch from 85d770c to 3525b6a Compare December 16, 2024 07:42

alex-jw-brooks added 2 commits December 16, 2024 03:01

Fix llava check for visual encoders with no cls

8202e79

Signed-off-by: Alex-Brooks <[email protected]>

Fix CLS handling for llava next video

4858bbb

Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks force-pushed the llava_has_cls branch from 3525b6a to 4858bbb Compare December 16, 2024 08:04

zucchini-nlp reviewed Dec 16, 2024

View reviewed changes

alex-jw-brooks closed this Dec 19, 2024

alex-jw-brooks mentioned this pull request Jan 9, 2025

Multimodal Granite Support #35579

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Llava Next Check for Visual Encoders Without CLS #35262

Fix Llava Next Check for Visual Encoders Without CLS #35262

alex-jw-brooks commented Dec 13, 2024 •

edited

Loading

qubvel commented Dec 13, 2024

alex-jw-brooks commented Dec 16, 2024

zucchini-nlp left a comment •

edited

Loading

alex-jw-brooks commented Dec 17, 2024

zucchini-nlp commented Dec 17, 2024

alex-jw-brooks commented Dec 19, 2024

Fix Llava Next Check for Visual Encoders Without CLS #35262

Fix Llava Next Check for Visual Encoders Without CLS #35262

Conversation

alex-jw-brooks commented Dec 13, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

qubvel commented Dec 13, 2024

alex-jw-brooks commented Dec 16, 2024

zucchini-nlp left a comment • edited Loading

Choose a reason for hiding this comment

alex-jw-brooks commented Dec 17, 2024

zucchini-nlp commented Dec 17, 2024

alex-jw-brooks commented Dec 19, 2024

alex-jw-brooks commented Dec 13, 2024 •

edited

Loading

zucchini-nlp left a comment •

edited

Loading