Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add qwen2.5vl #35569

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

add qwen2.5vl #35569

wants to merge 3 commits into from

Conversation

ShuaiBai623
Copy link
Contributor

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Copy link
Contributor

@minostauros minostauros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brought changes from #35466
Thanks for the new model!


def get_rope_index(
self,
input_ids: torch.LongTensor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
input_ids: torch.LongTensor,
input_ids: Optional[torch.LongTensor] = None,

if attention_mask is not None:
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
position_ids = position_ids.unsqueeze(0).expand(3, -1, -1).to(input_ids.device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
position_ids = position_ids.unsqueeze(0).expand(3, -1, -1).to(input_ids.device)
position_ids = position_ids.unsqueeze(0).expand(3, -1, -1).to(attention_mask.device)

attention_mask = attention_mask.to(inputs_embeds.device)

# if we get 4D attention mask we cannot calculate rope deltas anymore. TODO @raushan fixme
if position_ids is None and input_ids is not None and (attention_mask is None or attention_mask.ndim == 2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if position_ids is None and input_ids is not None and (attention_mask is None or attention_mask.ndim == 2):
if position_ids is None and (attention_mask is None or attention_mask.ndim == 2):

@molbap
Copy link
Contributor

molbap commented Jan 9, 2025

Hey @ShuaiBai623 , thanks for the addition! 🎉 before reviewing, the main thing here is that since Qwen2.5VL is very similar to Qwen2VL, it's best to use modular transformers, that is, building a shorter modular_qwen_2_5_vl.py file and calling python utils/modular_model_converter.py --files_to_parse src/transformers/models/qwen_2_5_vl/modular_qwen_2_5_vl.py that will automatically generate the modeling, configuration and processing file, based on the inheriting modules. See #34157 for instance, or https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next_video/modular_llava_next_video.py for recent examples!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants