-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Llava Next Check for Visual Encoders Without CLS #35262
Conversation
fa65d63
to
fab6743
Compare
I suppose @zucchini-nlp was working on something relevant |
85d770c
to
3525b6a
Compare
Sounds good, thanks! I went ahead and made the corresponding changes for llava next video and the video selection strategy so that the behavior in this PR is consistent |
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
3525b6a
to
4858bbb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting PR! Indeed I noticed same thing when adding a new llava-next-video checkpoint based on SigLIP.
From one side the changes makes sense because llava-next packing strategy expects images of certain shape, but also this adds much more complexity in the code for the sake of checking image features shape. That said, I would say what if we just do not check base_image_feature
shape since we can be sure that vision backbones will return the required shapes. Anyone who wants to change the backbone or its config, would have to make sure themselves that packing works for their use case
That is the decision I made for my PR adding the new model
Also excited to see IBM multimodal models, looking forward for a PR!
Cool, thanks @zucchini-nlp, that all makes sense and I agree. Do you have any thoughts on how the I'm happy to make any changes, update the PR to just remove the validation for image features to decouple it from the model PR you are working on, or close this one if you'd prefer. Thank you for the guidance and awesome work! |
Yes, I agree that in models like SigLIP it makes no sense to remove 1 token but I believe we should leave it to user's responsibility. What we can do is to leave the explicit docstring from this PR, to make it clear how vision selection works Are you planning to integrate IBM granite in transformers natively? If yes, I think it's better to make the necessary changes while adding the model (if still required). Otherwise if you have a code hosted on the hub which calls |
Hi @zucchini-nlp, apologies for the delayed response. Yes, I am planning to integrate IBM multimodal granite models into Transformers natively! I had only split this apart because I thought it might be easier to review as a more atomic change 🙂 The change for multimodal granite models is a pretty simple one - for now, I will close this PR, and will make the docstring update + remove the validation in the upcoming one, which I plan to open in the next day or so! I will tag you on it as well, thanks for the quick feedback! |
What does this PR do?
Hello! I am working on a few PRs to enable some upcoming multimodal (granite) models from IBM, which are based on Llava Next.
Currently, the visual encoder in Llava Next is abstracted to allow for alternate visual encoders besides CLIP, but there is some validation here that causes issues if the choice of visual encoder does not have CLS in its features, e.g., Siglip, because the number of features is off by one, regardless of selection strategy. I.e., if it's full, it's off by one because it's got no CLS, and if it's default, it's off by one because we are slicing one of the non-CLS features when applying the selection strategy.
This PR updates the feature selection strategy logic in llava next such that if there is no CLS token, then the selection strategy is always treated as
full
, sincedefault
doesn't make sense in this case (no CLS to remove).I've added some unit tests that run the validation here on mocked outputs from the vision tower with the combinations of feature selection strategies + whether or not there should be a cis, and have also run our upcoming Siglip + llava next model to validate that this works as expected.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@amyeroberts, @qubvel