Skip to content

Latest commit

 

History

History
158 lines (116 loc) · 11.5 KB

README.md

File metadata and controls

158 lines (116 loc) · 11.5 KB

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Code License Model License Python 3.10+

Paper Huggingface Models

💡 Introduction

Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently parallelizes long video training and inference, enabling 2M context length training on 256 GPUs without any gradient checkpointing. LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack. LongVILA-7B demonstrates strong accuracy on 9 popular video benchmarks, e.g., 65.1% VideoMME with subtitle. Besides, MM-SP is 2.1x - 5.7x faster than ring style sequence parallelism and 1.1x - 1.4x faster than Megatron with a hybrid context and tensor parallelism. Moreover, it seamlessly integrates with Hugging Face Transformers.

Training Pipeline

Long video SFT Dataset

Multi-Modal Sequence Parallelism System

6000-frame Needle in the Haystack (More than 1M context)

Results on 9 benchmarks

Installation

./environment_setup.sh vila

Models

Model LLM Size Context Training frames Link
LongVILA-1.5B-256f 1.5B 65536 256 qwen2-1.5b-longvila-256f
LongVILA-7B-256f 7B 131072 256 qwen2-7b-longvila-256f
LongVILA-7B-1M 7B 1048576 2048 qwen2-7b-longvila-1M

Datasets

Dataset Usage Link Comments
Stage4-LLM Context Extension 64k 256k 512k 1M Encode SlimPajama via Qwen2 tokenizer
Stage5-LongVideo SFT Data Source long videos from Shot2Story

Training

We conduct continued training (Stage4 and Stage5) based on an VILA model as following.

Stage4: LLM Context Extension

This is the first stage of LongVILA training, in which we tune the LLM in the VILA model to long context using SlimPajama dataset. For 7B model, this stage runs on a 8xA100 node for 64k context extension and at leat two 8xA100 node for 256k context extension.

bash longvila/train/4_extend_llm_64k.sh [STAGE3_PATH] [OUTPUT_NAME] [DATA_FILE]

The script takes in three arguments. STAGE3_PATH points to the trained VILA model. OUTPUT_NAME is the desired folder name under checkpoints that stores the final checkpoint. DATA_FILE is the data that store 64k-context Slimpajama data.

bash longvila/train/4_extend_llm_256k.sh [EXTENDED_64k_PATH] [OUTPUT_NAME] [DATA_FILE]

The script is a progressive training from 64k context to 256 context. EXTENDED_64k_PATH points to the OUTPUT_NAME of 4_extend_llm_64k.sh. DATA_FILE is the data that store 256k-context Slimpajama data. If you do not need to train models longer than 256 frames (e.g., 512 or 1024 frames), you do not need to train this 256k context step.

Similar steps for 512k and 1M training scripts.

Stage5: Long Supervised fine-tuning

This is the last stage of LongVILA training, in which we tune the model to follow long videos instructions. This stage runs on 32 8xH100 nodes for all different configurations (i.e. 256 frames, and 512 frames).

bash longvila/train/5_long_sft_256frames.sh [EXTENDED_64k_PATH]
[OUTPUT_NAME]
bash longvila/train/5_long_sft_512frames.sh [EXTENDED_256k_PATH]
[OUTPUT_NAME]

The scripts takes in two arguments. EXTENDED_64k_PATH and EXTENDED_256k_PATH points to the OUTPUT_NAME of the stage 4 script. OUTPUT_NAME is the desired folder name under checkpoints that stores the final checkpoint.

Similar steps for 1024-frame and 2048-frame training scripts.

Note

💡Sequence Parallelism Configuration

To enable sequence parallelism, you can set the following parameters in the training script:

seq_parallel_size:The degree of sequence parallelism (SP). SP is disabled by default (value: -1).

seq_parallel_ring_size: The communication process group size using optimized Ring Attention approach in SP. Ring Attention approach is disabled by default in SP.

seq_parallel_ring_type: Ring Attention implementation. Support ['ring_varlen', 'zigzag_ring_varlen'] in 2D attention. Only works when seq_parallel_ring_size > 1.

Please note that when SP is enabled, we treat each group of seq_parallel_size GPUs as a single device, with the global batch size calculated as the product of the per-device batch size and the data parallelism size.

Evaluations

Needle in the Haystack Experiments

bash scripts/eval/needle.sh LongVILA-7B-1M Efficient-Large-Model/qwen2-7b-longvila-1M $VIDEO_PATH 6000 300

Benchmarks

vila-eval -m Efficient-Large-Model/LongVILA-7B-256f -c auto -nf $NUM_VIDEO_FRAMES -t $TASKS

TASKS can be from {lmms-videomme-256,lmms-videomme_w_subtitle-256,vnbench_val,lmms-activitynetqa,egoschema_test,egoschema_val,eventbench_val,lmms-longvideobench_val_v,lmms-perceptiontest_val_mc,lmms-mvbench,lmms-nextqa_mc_test}. We set NUM_VIDEO_FRAMES as 256 for videomme, 128 for vnbench and 32 for others.

🔒 License

  • The code is released under the Apache 2.0 license as found in the LICENSE file.
  • The pretrained weights are released under the CC-BY-NC-SA-4.0 license.
  • The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:

Citations

@article{longvila,
      title={LongVILA: Scaling Long-Context Visual Language Models for Long Videos},
      author={Yukang Chen and Fuzhao Xue and Dacheng Li and Qinghao Hu and Ligeng Zhu and Xiuyu Li and Yunhao Fang and Haotian Tang and Shang Yang and Zhijian Liu and Yihui He and Hongxu Yin and Pavlo Molchanov and Jan Kautz and Linxi Fan and Yuke Zhu and Yao Lu and Song Han},
      year={2024},
      eprint={2408.10188},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement