Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to quantize NVILA with awq? #174

Open
Kibry-spin opened this issue Dec 30, 2024 · 2 comments
Open

How to quantize NVILA with awq? #174

Kibry-spin opened this issue Dec 30, 2024 · 2 comments

Comments

@Kibry-spin
Copy link

(vila) kirdo@kirdo-System-Product-Name:~/LLM/llm-awq$ python -m awq.entry --model_path /home/kirdo/LLM/NVILA-8B-Video/ --w_bit 4 --q_group_size 128 --run_awq --dump_awq awq_cache/$MODEL-w4-g128.pt
Quantization config: {'zero_point': True, 'q_group_size': 128}

  • Building model /home/kirdo/LLM/NVILA-8B-Video/
    [2024-12-30 19:26:13,027] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.05it/s]
    You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
    Repo card metadata block was not found. Setting CardData to empty.
    Token indices sequence length is longer than the specified maximum sequence length for this model (57053 > 16384). Running this sequence through the model will result in indexing errors
  • Split into 59 blocks
    Traceback (most recent call last):
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
    File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 352, in
    main()
    File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 293, in main
    model, enc = build_model_and_enc(args.model_path)
    File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 199, in build_model_and_enc
    awq_results = run_awq(
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
    File "/home/kirdo/LLM/llm-awq/awq/quantize/pre_quant.py", line 136, in run_awq
    model.llm(samples.to(next(model.parameters()).device))
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1164, in forward
    outputs = self.model(
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 871, in forward
    position_embeddings = self.rotary_emb(hidden_states, position_ids)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
    File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 163, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

I just follow the https://github.com/mit-han-lab/llm-awq.git to quantize NVILA-8B-Video,but when running awq search, i got an problem shown above.

@ys-2020
Copy link
Collaborator

ys-2020 commented Jan 8, 2025

Looks like some components of the model were allocated on CPU unexpectedly. @Louym Could you please take a look at your convenience? Thanks.

@Louym
Copy link

Louym commented Jan 10, 2025

Thank you for reaching out. It seems a bit unusual. If you're looking to obtain the quantized weights or AWQ scales of LLM part, you might need to add /llm to the --model_path. You shoould also use --vila-20 for NVILA models. You can refer to the instructions here to easily quantize NVILA.

I’ve tested our commands, and they appear to work without any errors. Could you provide more details about the issue you’re encountering?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants