Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for nested images to LLava and VipLLava #35558

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

yonigozlan
Copy link
Member

@yonigozlan yonigozlan commented Jan 7, 2025

What does this PR do?

This PR adds the functions make_flat_list_of_images , make_nested_list_of_images and make_batched_videos to image_utils, removing some unnecessarily duplicated code.
make_flat_list_of_images also replaces make_list_of_images in clip, blip, and siglip image processors, as it allows image-text-to-text models which use these image processors to support nested images inputs, while preserving BC.

Partially addresses #34545

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1
Copy link
Member

Flagging this PR too - I made some changes to the Llava/Pixtral processing for nested images here, so there might be some conflicts! #34801

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool, thanks for cleaning this up. Looks much better now

Comment on lines 263 to 264
if is_valid_image(images):
output_images = [[images]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be that we get a 4D tensor as a batch of images?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right! Made some changes that should account for that

@yonigozlan yonigozlan force-pushed the uniformize-image-text-to-text-inputs-processing branch from 0319100 to 6f595da Compare January 9, 2025 17:10
@@ -209,6 +213,107 @@ def make_list_of_images(images, expected_ndims: int = 3) -> List[ImageInput]:
)


def make_flat_list_of_images(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also return a 4d array/tensor, as that's how it was originally implemented in processors that use this function, so the name might be a bit misleading?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if a 4D array is really necessary in processors where it is called? AFAIK we always iterate over each image, which mean in the end we'll anyway process one 3D image

In that case, we can only return an actual list of images

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants