You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Building model /home/kirdo/LLM/NVILA-8B-Video/
[2024-12-30 19:26:13,027] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.05it/s]
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
Repo card metadata block was not found. Setting CardData to empty.
Token indices sequence length is longer than the specified maximum sequence length for this model (57053 > 16384). Running this sequence through the model will result in indexing errors
Split into 59 blocks
Traceback (most recent call last):
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 352, in
main()
File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 293, in main
model, enc = build_model_and_enc(args.model_path)
File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 199, in build_model_and_enc
awq_results = run_awq(
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/kirdo/LLM/llm-awq/awq/quantize/pre_quant.py", line 136, in run_awq
model.llm(samples.to(next(model.parameters()).device))
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1164, in forward
outputs = self.model(
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 871, in forward
position_embeddings = self.rotary_emb(hidden_states, position_ids)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 163, in forward
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
Thank you for reaching out. It seems a bit unusual. If you're looking to obtain the quantized weights or AWQ scales of LLM part, you might need to add /llm to the --model_path. You shoould also use --vila-20 for NVILA models. You can refer to the instructions here to easily quantize NVILA.
I’ve tested our commands, and they appear to work without any errors. Could you provide more details about the issue you’re encountering?
(vila) kirdo@kirdo-System-Product-Name:~/LLM/llm-awq$ python -m awq.entry --model_path /home/kirdo/LLM/NVILA-8B-Video/ --w_bit 4 --q_group_size 128 --run_awq --dump_awq awq_cache/$MODEL-w4-g128.pt
Quantization config: {'zero_point': True, 'q_group_size': 128}
[2024-12-30 19:26:13,027] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.05it/s]
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with
model.to('cuda')
.Repo card metadata block was not found. Setting CardData to empty.
Token indices sequence length is longer than the specified maximum sequence length for this model (57053 > 16384). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 352, in
main()
File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 293, in main
model, enc = build_model_and_enc(args.model_path)
File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 199, in build_model_and_enc
awq_results = run_awq(
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/kirdo/LLM/llm-awq/awq/quantize/pre_quant.py", line 136, in run_awq
model.llm(samples.to(next(model.parameters()).device))
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1164, in forward
outputs = self.model(
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 871, in forward
position_embeddings = self.rotary_emb(hidden_states, position_ids)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 163, in forward
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
I just follow the https://github.com/mit-han-lab/llm-awq.git to quantize NVILA-8B-Video,but when running awq search, i got an problem shown above.
The text was updated successfully, but these errors were encountered: