-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multimodal Granite Support #35579
base: main
Are you sure you want to change the base?
Multimodal Granite Support #35579
Conversation
Signed-off-by: Alex-Brooks <[email protected]> Support multiple image feature layres Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
85173c7
to
a7c495f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for propagating the changes to all llava models. I think we can also modify vipllava for the sake of consistency
Also, left one question for cases when we have a list vision_feature_select_strategy
with default
strategy. WDYT, since that seems to be the most intuitive behavior? Or is the Multimodal Granite not cropping CLS
tokens?
hs_pool = [image_outputs.hidden_states[layer_idx] for layer_idx in vision_feature_layer] | ||
selected_image_feature = torch.cat(hs_pool, dim=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think in this case when one has several layers indices and "default" feature selection strategy, one wants to crop CLS
token of each layer. I realize this is not a feature used in any of official checkpoints, but if we want to standardize vision_feature_layer
to be a Union[int, List[int]]
I think that is the expected behavior
WDYT?
try: | ||
image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1) | ||
except RuntimeError as e: | ||
if vision_feature_select_strategy == "default": | ||
logger.warning_once( | ||
"Image feature shape does not line up with the provided patch size. " | ||
"You may be using the `default` vision_feature_select_strategy with a" | ||
" visual encoder that does not have CLS." | ||
) | ||
raise e | ||
image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I don't think try/expect
is smth we want in modeling code. Can we frame it as if condition: raise Error
, to catch unwanted behavior?
What does this PR do?
This PR adds compatibility for IBM's upcoming multimodal granite models (which are based on LLava Next). The main changes here are:
default
strategy as well.This change was applied in a lot of places to make the checks for repository consistency happy + the config consistent, but the multimodal granite models are instances of
LlavaNextForConditionalGeneration
. I added a test for each changed model to ensure that things don't blow up if a list of vision feature layers is provided, but if there is another path forward that is preferred to changing several models at the same time to add compatibility with llava next, I'm happy to revise this PR as needed.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@amyeroberts, @qubvel, @zucchini-nlp