Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new models (Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR) & refactor processors. #1001

Merged
merged 38 commits into from
Nov 26, 2024

Conversation

xenova
Copy link
Collaborator

@xenova xenova commented Oct 30, 2024

New models

Janus (any-to-any)

This PR adds support for deepseek-ai/Janus-1.3B, a novel autoregressive framework that unifies multimodal understanding and generation.

In particular, it can do the following:

  • text+image to text:

    // Example code based on https://github.com/deepseek-ai/Janus/blob/main/inference.py
    import { AutoProcessor, MultiModalityCausalLM } from '@huggingface/transformers';
    
    // Load processor and model
    const model_id = 'onnx-community/Janus-1.3B-ONNX';
    const processor = await AutoProcessor.from_pretrained(model_id);
    const model = await MultiModalityCausalLM.from_pretrained(model_id, {
        dtype: {
            prepare_inputs_embeds: 'q4',
            language_model: 'q4f16',
            lm_head: 'fp16',
            gen_head: 'fp16',
            gen_img_embeds: 'fp16',
            image_decode: 'fp32',
        },
    });
    
    // Prepare inputs
    const conversation = [
        {
            role: "User",
            content: "<image_placeholder>\nConvert the formula into latex code.",
            images: ["https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/quadratic_formula.png"],
        },
    ]
    const inputs = await processor(conversation);
    
    // Generate response
    const outputs = await model.generate({
        ...inputs,
        max_new_tokens: 150,
        do_sample: false,
    });
    
    // Decode output
    const new_tokens = outputs.slice(null, [inputs.input_ids.dims.at(-1), null])
    const decoded = processor.tokenizer.batch_decode(new_tokens, { skip_special_tokens: true });
    console.log(decoded);

    Example output:

    Sure, here is the LaTeX code for the given formula:
    
    ```
    x = \frac{-b \pm \sqrt{b^2 - 4a c}}{2a}
    ```
    
    This code represents the mathematical expression for the variable \( x \).
    
  • image-to-text:

    // Example code based on https://github.com/deepseek-ai/Janus/blob/main/generation_inference.py
    import { AutoProcessor, MultiModalityCausalLM } from '@huggingface/transformers';
    
    // Load processor and model
    const model_id = 'onnx-community/Janus-1.3B-ONNX';
    const processor = await AutoProcessor.from_pretrained(model_id);
    const model = await MultiModalityCausalLM.from_pretrained(model_id, {
        dtype: {
            prepare_inputs_embeds: 'fp32',
            language_model: 'q4',
            lm_head: 'fp32',
            gen_head: 'fp32',
            gen_img_embeds: 'fp32',
            image_decode: 'fp32',
        },
    });
    
    // Prepare inputs
    const conversation = [
        {
            role: "User",
            content: "A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors.",
        },
    ]
    const inputs = await processor(conversation, {
        chat_template: "text_to_image",
    });
    
    // Generate response
    const num_image_tokens = processor.num_image_tokens;
    const outputs = await model.generate_images({
        ...inputs,
        min_new_tokens: num_image_tokens,
        max_new_tokens: num_image_tokens,
        do_sample: true,
    });
    
    // Save the generated image
    await outputs[0].save('test.png');

    Example outputs:

    fox_1 fox_2 fox_3 fox_4
    fox_5 fox_6 fox_7 fox_8

This PR also refactors the way that processor classes load image/text pre-preprocessors, aligning better with the python transformers library.

@xenova xenova marked this pull request as draft October 30, 2024 16:19
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@kungfooman
Copy link
Contributor

This is an amazing PR and I thought I give it a test in Chrome on Linux. I see randomly these two errors:

image

Or simply:

image

Is there anything special I need to do to test this in the browser? So far I couldn't get it to work.

By default it seems to pick WebGPU, but WebGPU is known to work very poorly on Linux, so I tried every other possibility:

const model = await MultiModalityCausalLM.from_pretrained(model_id, {
    dtype: {
        prepare_inputs_embeds: 'fp32',
        language_model: 'q4',
        lm_head: 'fp32',
        gen_head: 'fp32',
        gen_img_embeds: 'fp32',
        image_decode: 'fp32',
    },
    // Pick one: webnn-npu, webnn-gpu, webnn-cpu, webnn, webgpu, wasm
    device: 'wasm',
});

None worked 🙈

Thank you anyway, looking forward getting this to work somehow!

@pdufour
Copy link
Contributor

pdufour commented Nov 13, 2024

This is great, will this PR support Qwen2-VL? 🙏

@xenova xenova mentioned this pull request Nov 20, 2024
@xenova
Copy link
Collaborator Author

xenova commented Nov 20, 2024

This is great, will this PR support Qwen2-VL? 🙏

Hey @pdufour, I was originally planning on doing this in a separate PR, but I've been following your work on getting it running (great work BTW!) and so it might be possible to squeeze into this PR! 👀

@xenova xenova changed the title [WIP] Add support for deepseek-ai/Janus-1.3B (any-to-any) Add new models (Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, MGP-STR) & refactor processors. Nov 26, 2024
@xenova xenova changed the title Add new models (Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, MGP-STR) & refactor processors. Add new models (Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR) & refactor processors. Nov 26, 2024
@xenova xenova marked this pull request as ready for review November 26, 2024 12:57
@xenova xenova merged commit e848907 into main Nov 26, 2024
4 checks passed
@xenova
Copy link
Collaborator Author

xenova commented Nov 26, 2024

Merging to put out Transformers.js v3.1. Follow-up patches may be needed, but it's good to go for now imo!

@xenova xenova deleted the add-janus branch November 27, 2024 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants