Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Llava Next Check for Visual Encoders Without CLS #35262

Closed
wants to merge 2 commits into from

Conversation

alex-jw-brooks
Copy link

@alex-jw-brooks alex-jw-brooks commented Dec 13, 2024

What does this PR do?

Hello! I am working on a few PRs to enable some upcoming multimodal (granite) models from IBM, which are based on Llava Next.

Currently, the visual encoder in Llava Next is abstracted to allow for alternate visual encoders besides CLIP, but there is some validation here that causes issues if the choice of visual encoder does not have CLS in its features, e.g., Siglip, because the number of features is off by one, regardless of selection strategy. I.e., if it's full, it's off by one because it's got no CLS, and if it's default, it's off by one because we are slicing one of the non-CLS features when applying the selection strategy.

This PR updates the feature selection strategy logic in llava next such that if there is no CLS token, then the selection strategy is always treated as full, since default doesn't make sense in this case (no CLS to remove).

I've added some unit tests that run the validation here on mocked outputs from the vision tower with the combinations of feature selection strategies + whether or not there should be a cis, and have also run our upcoming Siglip + llava next model to validate that this works as expected.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts, @qubvel

@alex-jw-brooks alex-jw-brooks changed the title Fix Llava Cext Check for Visual Encoders Without CLS Fix Llava Next Check for Visual Encoders Without CLS Dec 13, 2024
@qubvel
Copy link
Member

qubvel commented Dec 13, 2024

I suppose @zucchini-nlp was working on something relevant

@alex-jw-brooks
Copy link
Author

Sounds good, thanks! I went ahead and made the corresponding changes for llava next video and the video selection strategy so that the behavior in this PR is consistent

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting PR! Indeed I noticed same thing when adding a new llava-next-video checkpoint based on SigLIP.

From one side the changes makes sense because llava-next packing strategy expects images of certain shape, but also this adds much more complexity in the code for the sake of checking image features shape. That said, I would say what if we just do not check base_image_feature shape since we can be sure that vision backbones will return the required shapes. Anyone who wants to change the backbone or its config, would have to make sure themselves that packing works for their use case

That is the decision I made for my PR adding the new model

Also excited to see IBM multimodal models, looking forward for a PR!

@alex-jw-brooks
Copy link
Author

Cool, thanks @zucchini-nlp, that all makes sense and I agree. Do you have any thoughts on how the default``vision_feature_select_strategy should be handled for Visual Encoders Without CLS? This PR changes the feature selection to always act as full if there seems to be no CLS to prevent the removal of non-CLS features by mistake.

I'm happy to make any changes, update the PR to just remove the validation for image features to decouple it from the model PR you are working on, or close this one if you'd prefer. Thank you for the guidance and awesome work!

@zucchini-nlp
Copy link
Member

@alex-jw-brooks

Do you have any thoughts on how the default``vision_feature_select_strategy should be handled for Visual Encoders Without CLS?

Yes, I agree that in models like SigLIP it makes no sense to remove 1 token but I believe we should leave it to user's responsibility. What we can do is to leave the explicit docstring from this PR, to make it clear how vision selection works

Are you planning to integrate IBM granite in transformers natively? If yes, I think it's better to make the necessary changes while adding the model (if still required). Otherwise if you have a code hosted on the hub which calls LlavaNext class, feel free to simply remove the check and refine docstring in this PR :)

@alex-jw-brooks
Copy link
Author

Hi @zucchini-nlp, apologies for the delayed response. Yes, I am planning to integrate IBM multimodal granite models into Transformers natively! I had only split this apart because I thought it might be easier to review as a more atomic change 🙂

The change for multimodal granite models is a pretty simple one - for now, I will close this PR, and will make the docstring update + remove the validation in the upcoming one, which I plan to open in the next day or so! I will tag you on it as well, thanks for the quick feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants