How to quantize NVILA with awq？ #174

Kibry-spin · 2024-12-30T11:46:13Z

(vila) kirdo@kirdo-System-Product-Name:~/LLM/llm-awq$ python -m awq.entry --model_path /home/kirdo/LLM/NVILA-8B-Video/ --w_bit 4 --q_group_size 128 --run_awq --dump_awq awq_cache/$MODEL-w4-g128.pt
Quantization config: {'zero_point': True, 'q_group_size': 128}

Building model /home/kirdo/LLM/NVILA-8B-Video/
[2024-12-30 19:26:13,027] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.05it/s]
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
Repo card metadata block was not found. Setting CardData to empty.
Token indices sequence length is longer than the specified maximum sequence length for this model (57053 > 16384). Running this sequence through the model will result in indexing errors
Split into 59 blocks
Traceback (most recent call last):
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 352, in
main()
File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 293, in main
model, enc = build_model_and_enc(args.model_path)
File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 199, in build_model_and_enc
awq_results = run_awq(
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/kirdo/LLM/llm-awq/awq/quantize/pre_quant.py", line 136, in run_awq
model.llm(samples.to(next(model.parameters()).device))
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1164, in forward
outputs = self.model(
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 871, in forward
position_embeddings = self.rotary_emb(hidden_states, position_ids)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 163, in forward
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

I just follow the https://github.com/mit-han-lab/llm-awq.git to quantize NVILA-8B-Video，but when running awq search, i got an problem shown above.

The text was updated successfully, but these errors were encountered:

ys-2020 · 2025-01-08T02:25:26Z

Looks like some components of the model were allocated on CPU unexpectedly. @Louym Could you please take a look at your convenience? Thanks.

Louym · 2025-01-10T08:53:11Z

Thank you for reaching out. It seems a bit unusual. If you're looking to obtain the quantized weights or AWQ scales of LLM part, you might need to add /llm to the --model_path. You shoould also use --vila-20 for NVILA models. You can refer to the instructions here to easily quantize NVILA.

I’ve tested our commands, and they appear to work without any errors. Could you provide more details about the issue you’re encountering?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to quantize NVILA with awq？ #174

How to quantize NVILA with awq？ #174

Kibry-spin commented Dec 30, 2024

ys-2020 commented Jan 8, 2025

Louym commented Jan 10, 2025

How to quantize NVILA with awq？ #174

How to quantize NVILA with awq？ #174

Comments

Kibry-spin commented Dec 30, 2024

ys-2020 commented Jan 8, 2025

Louym commented Jan 10, 2025