Skip to content

[Usage]: Failure to Init Qwen2.5-VL-7B-Instruct with inflight bnb quantization #12900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
MotorBottle opened this issue Feb 7, 2025 · 11 comments · Fixed by #12905
Closed
1 task done

[Usage]: Failure to Init Qwen2.5-VL-7B-Instruct with inflight bnb quantization #12900

MotorBottle opened this issue Feb 7, 2025 · 11 comments · Fixed by #12905
Assignees
Labels
usage How to use vllm

Comments

@MotorBottle
Copy link

MotorBottle commented Feb 7, 2025

Your current environment

docker vllm-openai:v0.7.2 with latest transformers installed

How would you like to use vllm

Hi and I'm trying to launch qwen2.5-vl-7b-instruct in bnb inflight quanization but got error

 (AssertionError: param_data.shape == loaded_weight.shape) 

I was able to run this model at full precision with docker. Below is how I init the full precision one:

sudo docker run --runtime nvidia --gpus '"device=0,1"' --ipc=host -p 18434:8000 \
   -v hf_cache:/root/.cache/huggingface -d \
   --name qwen2.5-vl-7b \
   --entrypoint "python3" qwen-vl-fixed \ # I installed new transformer and commited into a new image. 
   -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-VL-7B-Instruct \
   --tensor-parallel-size 2 --trust-remote-code --max-model-len 18000 --dtype half

When I added --quantization bitsandbytes --load-format bitsandbytes into the docker command, the launch of the model in bnb 4bit inflight quantization failed. #12604 have said supporting this model and I wonder if the dtype cause of the error (but my 2080ti Turing GPUs only support float16 instead of bfloat16 here)

Below is the full error log:

INFO 02-07 05:08:11 __init__.py:190] Automatically detected platform cuda.

INFO 02-07 05:08:13 api_server.py:840] vLLM API server version 0.7.2

�
uINFO 02-07 05:08:13 api_server.py:841] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2.5-VL-7B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='bitsandbytes', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', kv_cache_dtype='auto', max_model_len=12000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization='bitsandbytes', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)

INFO 02-07 05:08:13 api_server.py:206] Started engine process with PID 77

WARNING 02-07 05:08:17 config.py:2386] Casting torch.bfloat16 to torch.float16.

INFO 02-07 05:08:18 __init__.py:190] Automatically detected platform cuda.

WARNING 02-07 05:08:23 config.py:2386] Casting torch.bfloat16 to torch.float16.

INFO 02-07 05:08:24 config.py:542] This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'.

WARNING 02-07 05:08:24 config.py:621] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.

INFO 02-07 05:08:31 config.py:542] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.

WARNING 02-07 05:08:31 config.py:621] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.

INFO 02-07 05:08:33 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='Qwen/Qwen2.5-VL-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=12000, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-VL-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 

INFO 02-07 05:08:34 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.

INFO 02-07 05:08:34 cuda.py:227] Using XFormers backend.

INFO 02-07 05:08:34 model_runner.py:1110] Starting to load model Qwen/Qwen2.5-VL-7B-Instruct...

INFO 02-07 05:08:35 config.py:2992] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]

INFO 02-07 05:08:35 loader.py:1102] Loading weights with BitsAndBytes quantization.  May take a while ...

INFO 02-07 05:08:36 weight_utils.py:252] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]

ERROR 02-07 05:08:37 engine.py:389] 

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine

    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args

    return cls(ipc_path=ipc_path,

           ^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 75, in __init__

    self.engine = LLMEngine(*args, **kwargs)

                  ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 273, in __init__

    self.model_executor = executor_class(vllm_config=vllm_config, )

                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 51, in __init__

    self._init_executor()

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 42, in _init_executor

    self.collective_rpc("load_model")

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 51, in collective_rpc

    answer = run_method(self.driver_worker, method, args, kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2220, in run_method

    return func(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model

    self.model_runner.load_model()

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1112, in load_model

    self.model = get_model(vllm_config=self.vllm_config)

Process SpawnProcess-1:

                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model

    return loader.load_model(vllm_config=vllm_config)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1225, in load_model

    self._load_weights(model_config, model)

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1135, in _load_weights

    loaded_weights = model.load_weights(qweight_iterator)

                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1124, in load_weights

    return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 235, in load_weights

    autoloaded_weights = set(self._load_module("", self.module, weights))

                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 196, in _load_module

    yield from self._load_module(prefix,

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 173, in _load_module

    loaded_params = module_load_weights(weights)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 515, in load_weights

    return loader.load_weights(weights)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 235, in load_weights

    autoloaded_weights = set(self._load_module("", self.module, weights))

                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 196, in _load_module

    yield from self._load_module(prefix,

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 173, in _load_module

    loaded_params = module_load_weights(weights)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 400, in load_weights

    weight_loader(param, loaded_weight, shard_id)

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 589, in weight_loader

    assert param_data.shape == loaded_weight.shape

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AssertionError

Traceback (most recent call last):

  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap

    self.run()

  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run

    self._target(*self._args, **self._kwargs)

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine

    raise e

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine

    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args

    return cls(ipc_path=ipc_path,

           ^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 75, in __init__

    self.engine = LLMEngine(*args, **kwargs)

                  ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 273, in __init__

    self.model_executor = executor_class(vllm_config=vllm_config, )

                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 51, in __init__

    self._init_executor()

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 42, in _init_executor

    self.collective_rpc("load_model")

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 51, in collective_rpc

    answer = run_method(self.driver_worker, method, args, kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2220, in run_method

    return func(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model

    self.model_runner.load_model()

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1112, in load_model

    self.model = get_model(vllm_config=self.vllm_config)

                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model

    return loader.load_model(vllm_config=vllm_config)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1225, in load_model

    self._load_weights(model_config, model)

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1135, in _load_weights

    loaded_weights = model.load_weights(qweight_iterator)

                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1124, in load_weights

    return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 235, in load_weights

    autoloaded_weights = set(self._load_module("", self.module, weights))

                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 196, in _load_module

    yield from self._load_module(prefix,

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 173, in _load_module

    loaded_params = module_load_weights(weights)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 515, in load_weights

    return loader.load_weights(weights)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 235, in load_weights

    autoloaded_weights = set(self._load_module("", self.module, weights))

                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 196, in _load_module

    yield from self._load_module(prefix,

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 173, in _load_module

    loaded_params = module_load_weights(weights)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 400, in load_weights

    weight_loader(param, loaded_weight, shard_id)

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 589, in weight_loader

    assert param_data.shape == loaded_weight.shape

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AssertionError


Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]

Traceback (most recent call last):

  File "<frozen runpy>", line 198, in _run_module_as_main

  File "<frozen runpy>", line 88, in _run_code

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in <module>

    uvloop.run(run_server(args))

  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run

    return __asyncio.run(

           ^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run

    return runner.run(main)

           ^^^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run

    return self._loop.run_until_complete(task)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete

  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper

    return await main

           ^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server

    async with build_async_engine_client(args) as engine_client:

               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__

    return await anext(self.gen)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client

    async with build_async_engine_client_from_engine_args(

               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__

    return await anext(self.gen)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args

    raise RuntimeError(

RuntimeError: Engine process failed to start. See stack trace for the root cause.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@MotorBottle MotorBottle added the usage How to use vllm label Feb 7, 2025
@MotorBottle MotorBottle mentioned this issue Feb 7, 2025
4 tasks
@MotorBottle MotorBottle marked this as a duplicate of #12899 Feb 7, 2025
@jeejeelee jeejeelee self-assigned this Feb 7, 2025
@jeejeelee
Copy link
Collaborator

I will try fix this asap

@jlia0
Copy link

jlia0 commented Feb 7, 2025

@jeejeelee A relevant issue here as well. #12902

@jeejeelee
Copy link
Collaborator

@jlia0 @MotorBottle Could you please verify if #12905 can resolve your issue?

@jlia0
Copy link

jlia0 commented Feb 7, 2025

@jlia0 @MotorBottle Could you please verify if #12905 can resolve your issue?

Much appreciated for quick response and action. Tested working for the 7B modelimage

Is stream=true working for you? It seems like it only streams the output once the completion is done.

@MotorBottle
Copy link
Author

@jlia0 @MotorBottle Could you please verify if #12905 can resolve your issue?

Sorry I made a mistake with depolyment (missed the bnb qunat flags). The deployment is still unsuccessful. New console log below:

INFO 02-07 10:45:23 __init__.py:190] Automatically detected platform cuda.

INFO 02-07 10:45:24 api_server.py:840] vLLM API server version 0.7.2

�
uINFO 02-07 10:45:24 api_server.py:841] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2.5-VL-7B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='bitsandbytes', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization='bitsandbytes', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)

INFO 02-07 10:45:24 api_server.py:206] Started engine process with PID 77

WARNING 02-07 10:45:28 config.py:2386] Casting torch.bfloat16 to torch.float16.

INFO 02-07 10:45:29 __init__.py:190] Automatically detected platform cuda.

WARNING 02-07 10:45:34 config.py:2386] Casting torch.bfloat16 to torch.float16.

INFO 02-07 10:45:37 config.py:542] This model supports multiple tasks: {'generate', 'classify', 'embed', 'reward', 'score'}. Defaulting to 'generate'.

WARNING 02-07 10:45:37 config.py:621] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.

INFO 02-07 10:45:37 config.py:1401] Defaulting to use mp for distributed inference

INFO 02-07 10:45:42 config.py:542] This model supports multiple tasks: {'embed', 'generate', 'reward', 'classify', 'score'}. Defaulting to 'generate'.

WARNING 02-07 10:45:42 config.py:621] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.

INFO 02-07 10:45:42 config.py:1401] Defaulting to use mp for distributed inference

INFO 02-07 10:45:44 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='Qwen/Qwen2.5-VL-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-VL-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 

WARNING 02-07 10:45:46 multiproc_worker_utils.py:300] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.

INFO 02-07 10:45:46 custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager

(VllmWorkerProcess pid=352) INFO 02-07 10:45:46 multiproc_worker_utils.py:229] Worker ready; awaiting tasks

INFO 02-07 10:45:46 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.

INFO 02-07 10:45:46 cuda.py:227] Using XFormers backend.

(VllmWorkerProcess pid=352) INFO 02-07 10:45:46 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.

(VllmWorkerProcess pid=352) INFO 02-07 10:45:46 cuda.py:227] Using XFormers backend.

INFO 02-07 10:45:47 utils.py:950] Found nccl from library libnccl.so.2

(VllmWorkerProcess pid=352) INFO 02-07 10:45:47 utils.py:950] Found nccl from library libnccl.so.2

INFO 02-07 10:45:47 pynccl.py:69] vLLM is using nccl==2.21.5

(VllmWorkerProcess pid=352) INFO 02-07 10:45:47 pynccl.py:69] vLLM is using nccl==2.21.5

INFO 02-07 10:45:47 custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json

INFO 02-07 10:46:04 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json

WARNING 02-07 10:46:04 custom_all_reduce.py:145] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.

(VllmWorkerProcess pid=352) INFO 02-07 10:46:04 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json

(VllmWorkerProcess pid=352) WARNING 02-07 10:46:04 custom_all_reduce.py:145] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.

INFO 02-07 10:46:04 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_e9df16b1'), local_subscribe_port=56677, remote_subscribe_port=None)

INFO 02-07 10:46:04 model_runner.py:1110] Starting to load model Qwen/Qwen2.5-VL-7B-Instruct...

(VllmWorkerProcess pid=352) INFO 02-07 10:46:04 model_runner.py:1110] Starting to load model Qwen/Qwen2.5-VL-7B-Instruct...

INFO 02-07 10:46:04 config.py:2992] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]

(VllmWorkerProcess pid=352) INFO 02-07 10:46:04 config.py:2992] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]

INFO 02-07 10:46:04 loader.py:1102] Loading weights with BitsAndBytes quantization.  May take a while ...

(VllmWorkerProcess pid=352) INFO 02-07 10:46:04 loader.py:1102] Loading weights with BitsAndBytes quantization.  May take a while ...

INFO 02-07 10:46:05 weight_utils.py:252] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]

(VllmWorkerProcess pid=352) INFO 02-07 10:46:06 weight_utils.py:252] Using model weights format ['*.safetensors']

ERROR 02-07 10:46:07 engine.py:389] shape '[3, 16, 80, 1280]' is invalid for input of size 1228800

ERROR 02-07 10:46:07 engine.py:389] Traceback (most recent call last):

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine

ERROR 02-07 10:46:07 engine.py:389]     engine = MQLLMEngine.from_engine_args(engine_args=engine_args,

ERROR 02-07 10:46:07 engine.py:389]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args

ERROR 02-07 10:46:07 engine.py:389]     return cls(ipc_path=ipc_path,

ERROR 02-07 10:46:07 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 75, in __init__

ERROR 02-07 10:46:07 engine.py:389]     self.engine = LLMEngine(*args, **kwargs)

ERROR 02-07 10:46:07 engine.py:389]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 273, in __init__

ERROR 02-07 10:46:07 engine.py:389]     self.model_executor = executor_class(vllm_config=vllm_config, )

ERROR 02-07 10:46:07 engine.py:389]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 262, in __init__

ERROR 02-07 10:46:07 engine.py:389]     super().__init__(*args, **kwargs)

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 51, in __init__

ERROR 02-07 10:46:07 engine.py:389]     self._init_executor()

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 125, in _init_executor

ERROR 02-07 10:46:07 engine.py:389]     self._run_workers("load_model",

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers

ERROR 02-07 10:46:07 engine.py:389]     driver_worker_output = run_method(self.driver_worker, sent_method,

ERROR 02-07 10:46:07 engine.py:389]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2220, in run_method

ERROR 02-07 10:46:07 engine.py:389]     return func(*args, **kwargs)

ERROR 02-07 10:46:07 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model

ERROR 02-07 10:46:07 engine.py:389]     self.model_runner.load_model()

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1112, in load_model

ERROR 02-07 10:46:07 engine.py:389]     self.model = get_model(vllm_config=self.vllm_config)

ERROR 02-07 10:46:07 engine.py:389]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model

ERROR 02-07 10:46:07 engine.py:389]     return loader.load_model(vllm_config=vllm_config)

ERROR 02-07 10:46:07 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1225, in load_model

ERROR 02-07 10:46:07 engine.py:389]     self._load_weights(model_config, model)

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1135, in _load_weights

ERROR 02-07 10:46:07 engine.py:389]     loaded_weights = model.load_weights(qweight_iterator)

ERROR 02-07 10:46:07 engine.py:389]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1128, in load_weights

ERROR 02-07 10:46:07 engine.py:389]     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)

ERROR 02-07 10:46:07 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 235, in load_weights

ERROR 02-07 10:46:07 engine.py:389]     autoloaded_weights = set(self._load_module("", self.module, weights))

ERROR 02-07 10:46:07 engine.py:389]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 196, in _load_module

ERROR 02-07 10:46:07 engine.py:389]     yield from self._load_module(prefix,

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 173, in _load_module

ERROR 02-07 10:46:07 engine.py:389]     loaded_params = module_load_weights(weights)

ERROR 02-07 10:46:07 engine.py:389]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 672, in load_weights

ERROR 02-07 10:46:07 engine.py:389]     loaded_weight = loaded_weight.view(3, visual_num_heads,

ERROR 02-07 10:46:07 engine.py:389]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 106, in __torch_function__

ERROR 02-07 10:46:07 engine.py:389]     return func(*args, **kwargs)

ERROR 02-07 10:46:07 engine.py:389]            ^^^^^^^^^^^^^^^^^^^^^

ERROR 02-07 10:46:07 engine.py:389] RuntimeError: shape '[3, 16, 80, 1280]' is invalid for input of size 1228800

Process SpawnProcess-1:

ERROR 02-07 10:46:07 multiproc_worker_utils.py:124] Worker VllmWorkerProcess pid 352 died, exit code: -15

INFO 02-07 10:46:07 multiproc_worker_utils.py:128] Killing local vLLM worker processes

Traceback (most recent call last):

  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap

    self.run()

  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run

    self._target(*self._args, **self._kwargs)

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine

    raise e

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine

    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args

    return cls(ipc_path=ipc_path,

           ^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 75, in __init__

    self.engine = LLMEngine(*args, **kwargs)

                  ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 273, in __init__

    self.model_executor = executor_class(vllm_config=vllm_config, )

                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 262, in __init__

    super().__init__(*args, **kwargs)

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 51, in __init__

    self._init_executor()

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 125, in _init_executor

    self._run_workers("load_model",

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers

    driver_worker_output = run_method(self.driver_worker, sent_method,

                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2220, in run_method

    return func(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model

    self.model_runner.load_model()

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1112, in load_model

    self.model = get_model(vllm_config=self.vllm_config)

                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model

    return loader.load_model(vllm_config=vllm_config)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1225, in load_model

    self._load_weights(model_config, model)

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1135, in _load_weights

    loaded_weights = model.load_weights(qweight_iterator)

                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1128, in load_weights

    return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 235, in load_weights

    autoloaded_weights = set(self._load_module("", self.module, weights))

                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 196, in _load_module

    yield from self._load_module(prefix,

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 173, in _load_module

    loaded_params = module_load_weights(weights)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 672, in load_weights

    loaded_weight = loaded_weight.view(3, visual_num_heads,

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 106, in __torch_function__

    return func(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^

RuntimeError: shape '[3, 16, 80, 1280]' is invalid for input of size 1228800


Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:01<?, ?it/s]

[rank0]:[W207 10:46:08.409297655 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())

Traceback (most recent call last):

  File "<frozen runpy>", line 198, in _run_module_as_main

  File "<frozen runpy>", line 88, in _run_code

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in <module>

    uvloop.run(run_server(args))

  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run

    return __asyncio.run(

           ^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run

    return runner.run(main)

           ^^^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run

    return self._loop.run_until_complete(task)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete

  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper

    return await main

           ^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server

    async with build_async_engine_client(args) as engine_client:

               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__

    return await anext(self.gen)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client

    async with build_async_engine_client_from_engine_args(

               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__

    return await anext(self.gen)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args

    raise RuntimeError(

RuntimeError: Engine process failed to start. See stack trace for the root cause.

@MotorBottle MotorBottle reopened this Feb 7, 2025
@MotorBottle
Copy link
Author

@jlia0 @MotorBottle Could you please verify if #12905 can resolve your issue?

Full precision runs well with the modded code but quant still does not.

@MotorBottle
Copy link
Author

@jlia0 @MotorBottle Could you please verify if #12905 can resolve your issue?

Much appreciated for quick response and action. Tested working for the 7B modelimage

Is stream=true working for you? It seems like it only streams the output once the completion is done.

Full precision model, yes.

@jeejeelee
Copy link
Collaborator

jeejeelee commented Feb 8, 2025

Oh, I just remembered, we also need to modify something, see: #12604 (comment)

@MotorBottle
Copy link
Author

Oh, I just remembered, we also need to modify something, see: #12604 (comment)

Could you specify what else to modify based on the 0.7.2 version other than modifications in #12905?

@jeejeelee
Copy link
Collaborator

#12944 can resolve the remaining issue

@MotorBottle
Copy link
Author

#12944 can resolve the remaining issue

Confirmed working. Appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants