[Bug]: Expected there to be 4 prompt updates corresponding to 4 image items, but instead found 3 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs #15338

moshilangzi · 2025-03-22T16:58:54Z

Your current environment

The output of `python collect_env.py`

PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64)
GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.34

Python version: 3.10.16 | packaged by conda-forge | (main, Dec  5 2024, 14:16:10) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.14.0-570.el9.x86_64-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 12.8.61
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090
GPU 2: NVIDIA GeForce RTX 4090
GPU 3: NVIDIA GeForce RTX 4090
GPU 4: NVIDIA GeForce RTX 4090
GPU 5: NVIDIA GeForce RTX 4090
GPU 6: NVIDIA GeForce RTX 4090
GPU 7: NVIDIA GeForce RTX 4090

Nvidia driver version: 570.86.16
cuDNN version: Probably one of the following:
/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9
/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn_adv.so.9
/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn_cnn.so.9
/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn_engines_precompiled.so.9
/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn_engines_runtime_compiled.so.9
/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn_graph.so.9
/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn_heuristic.so.9
/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn_ops.so.9
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
架构：                                x86_64
CPU 运行模式：                        32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
字节序：                              Little Endian
CPU:                                  128
在线 CPU 列表：                       0-127
厂商 ID：                             GenuineIntel
BIOS Vendor ID:                       Intel(R) Corporation
型号名称：                            Intel(R) Xeon(R) Gold 6430
BIOS Model name:                      Intel(R) Xeon(R) Gold 6430
CPU 系列：                            6
型号：                                143
每个核的线程数：                      2
每个座的核数：                        32
座：                                  2
步进：                                8
CPU(s) scaling MHz:                   24%
CPU 最大 MHz：                        3400.0000
CPU 最小 MHz：                        800.0000
BogoMIPS：                            4200.00
标记：                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
虚拟化：                              VT-x
L1d 缓存：                            3 MiB (64 instances)
L1i 缓存：                            2 MiB (64 instances)
L2 缓存：                             128 MiB (64 instances)
L3 缓存：                             120 MiB (2 instances)
NUMA 节点：                           2
NUMA 节点0 CPU：                      0-31,64-95
NUMA 节点1 CPU：                      32-63,96-127
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] jj-pytorchvideo==0.1.5
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] onnxruntime==1.21.0
[pip3] pynvml==12.0.0
[pip3] pytorch-lightning==2.5.1
[pip3] pytorch-wpe==0.0.1
[pip3] pyzmq==26.3.0
[pip3] sentence-transformers==3.4.1
[pip3] torch==2.6.0
[pip3] torch-complex==0.4.4
[pip3] torchao==0.9.0
[pip3] torchaudio==2.6.0
[pip3] torchdiffeq==0.2.5
[pip3] torchmetrics==1.7.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.50.0.dev0
[pip3] transformers-stream-generator==0.0.5
[pip3] triton==3.2.0
[pip3] vector-quantize-pytorch==1.17.3
[pip3] x-transformers==2.1.37
[conda] jj-pytorchvideo           0.1.5                    pypi_0    pypi
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
[conda] nvidia-ml-py              12.570.86                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] pynvml                    12.0.0                   pypi_0    pypi
[conda] pytorch-lightning         2.5.1                    pypi_0    pypi
[conda] pytorch-wpe               0.0.1                    pypi_0    pypi
[conda] pyzmq                     26.3.0                   pypi_0    pypi
[conda] sentence-transformers     3.4.1                    pypi_0    pypi
[conda] torch                     2.6.0                    pypi_0    pypi
[conda] torch-complex             0.4.4                    pypi_0    pypi
[conda] torchao                   0.9.0                    pypi_0    pypi
[conda] torchaudio                2.6.0                    pypi_0    pypi
[conda] torchdiffeq               0.2.5                    pypi_0    pypi
[conda] torchmetrics              1.7.0                    pypi_0    pypi
[conda] torchvision               0.21.0                   pypi_0    pypi
[conda] transformers              4.50.0.dev0              pypi_0    pypi
[conda] transformers-stream-generator 0.0.5                    pypi_0    pypi
[conda] triton                    3.2.0                    pypi_0    pypi
[conda] vector-quantize-pytorch   1.17.3                   pypi_0    pypi
[conda] x-transformers            2.1.37                   pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    0-31,64-95      0               N/A
GPU1    NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    0-31,64-95      0               N/A
GPU2    NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     NODE    NODE    0-31,64-95      0               N/A
GPU3    NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     NODE    NODE    0-31,64-95      0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     32-63,96-127    1               N/A
GPU5    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     SYS     32-63,96-127    1               N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS     SYS     32-63,96-127    1               N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     32-63,96-127    1               N/A
NIC0    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      PIX
NIC1    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

CUDA_MODULE_LOADING=LAZY
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1

🐛 Describe the bug

CUDA_VISIBLE_DEVICES=0,1 vllm serve abhishekchohan/gemma-3-27b-it-quantized-W4A16 --limit-mm-per-prompt 'image=4' --max-model-len 16384 --port 11455 --tensor-parallel-size 2 --disable-frontend-multiprocessing

when t 'image=3' its OK，but when image>3，ERROR occurs:
/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/transformers/utils/hub.py:106: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
INFO 03-23 00:51:57 [init.py:256] Automatically detected platform cuda.
INFO 03-23 00:51:58 [api_server.py:977] vLLM API server version 0.8.1
INFO 03-23 00:51:58 [api_server.py:978] args: Namespace(subparser='serve', model_tag='abhishekchohan/gemma-3-27b-it-quantized-W4A16', config='', host=None, port=11455, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=True, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='abhishekchohan/gemma-3-27b-it-quantized-W4A16', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=16384, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 4}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f1180b35120>)
INFO 03-23 00:52:05 [config.py:583] This model supports multiple tasks: {'embed', 'generate', 'score', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 03-23 00:52:06 [config.py:1515] Defaulting to use mp for distributed inference
INFO 03-23 00:52:06 [config.py:1693] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 03-23 00:52:07 [api_server.py:166] V1 is enabled, but got --disable-frontend-multiprocessing. To disable frontend multiprocessing, set VLLM_USE_V1=0.
/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/transformers/utils/hub.py:106: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
INFO 03-23 00:52:12 [init.py:256] Automatically detected platform cuda.
INFO 03-23 00:52:15 [core.py:53] Initializing a V1 LLM engine (v0.8.1) with config: model='abhishekchohan/gemma-3-27b-it-quantized-W4A16', speculative_config=None, tokenizer='abhishekchohan/gemma-3-27b-it-quantized-W4A16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=abhishekchohan/gemma-3-27b-it-quantized-W4A16, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 03-23 00:52:15 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-23 00:52:15 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_fb3fe6ca'), local_subscribe_addr='ipc:///tmp/16744176-fc2a-4c3b-b41d-d61384e41cc5', remote_subscribe_addr=None, remote_addr_ipv6=False)
/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/transformers/utils/hub.py:106: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
INFO 03-23 00:52:19 [init.py:256] Automatically detected platform cuda.
WARNING 03-23 00:52:21 [logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 03-23 00:52:21 [logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-8a254/VLLM_TRACE_FUNCTION_for_process_1044842_thread_140250240029696_at_2025-03-23_00:52:21.843371.log
WARNING 03-23 00:52:24 [utils.py:2282] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f8d621d54e0>
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:24 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_6488a66e'), local_subscribe_addr='ipc:///tmp/b32a810d-c827-4a21-b4ee-ec4a4256fc43', remote_subscribe_addr=None, remote_addr_ipv6=False)
/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/transformers/utils/hub.py:106: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
INFO 03-23 00:52:28 [init.py:256] Automatically detected platform cuda.
WARNING 03-23 00:52:31 [logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 03-23 00:52:31 [logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-8a254/VLLM_TRACE_FUNCTION_for_process_1045201_thread_140466852819968_at_2025-03-23_00:52:31.020890.log
WARNING 03-23 00:52:33 [utils.py:2282] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fbfd13ed210>
(VllmWorker rank=1 pid=1045201) INFO 03-23 00:52:33 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_879ea3f8'), local_subscribe_addr='ipc:///tmp/78c07136-1218-4cb4-9b82-8731982b2ce7', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=1045201) INFO 03-23 00:52:34 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=1045201) INFO 03-23 00:52:34 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:34 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:34 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=1 pid=1045201) INFO 03-23 00:52:34 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:34 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=1045201) WARNING 03-23 00:52:34 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=1044842) WARNING 03-23 00:52:34 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:34 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_0f0d6ce7'), local_subscribe_addr='ipc:///tmp/f1147929-3b1b-44ba-afcd-fb71098476b8', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=1045201) INFO 03-23 00:52:34 [parallel_state.py:967] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=1 pid=1045201) INFO 03-23 00:52:34 [cuda.py:215] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:34 [parallel_state.py:967] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:34 [cuda.py:215] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=1045201) Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
(VllmWorker rank=0 pid=1044842) Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
(VllmWorker rank=1 pid=1045201) INFO 03-23 00:52:41 [gpu_model_runner.py:1164] Starting to load model abhishekchohan/gemma-3-27b-it-quantized-W4A16...
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:41 [gpu_model_runner.py:1164] Starting to load model abhishekchohan/gemma-3-27b-it-quantized-W4A16...
(VllmWorker rank=1 pid=1045201) INFO 03-23 00:52:42 [config.py:3222] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:42 [config.py:3222] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=1 pid=1045201) INFO 03-23 00:52:42 [compressed_tensors_wNa16.py:85] Using MarlinLinearKernel for CompressedTensorsWNA16
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:42 [compressed_tensors_wNa16.py:85] Using MarlinLinearKernel for CompressedTensorsWNA16
(VllmWorker rank=0 pid=1044842) 2025-03-23 00:52:48,586 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
(VllmWorker rank=1 pid=1045201) 2025-03-23 00:52:48,591 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:48 [topk_topp_sampler.py:38] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling.
(VllmWorker rank=1 pid=1045201) INFO 03-23 00:52:48 [topk_topp_sampler.py:38] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling.
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.03it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:02<00:02, 1.06s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:03<00:01, 1.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00, 1.26s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00, 1.23s/it]
(VllmWorker rank=0 pid=1044842)
(VllmWorker rank=1 pid=1045201) INFO 03-23 00:52:54 [loader.py:429] Loading weights took 5.12 seconds
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:54 [loader.py:429] Loading weights took 5.15 seconds
(VllmWorker rank=1 pid=1045201) INFO 03-23 00:52:56 [gpu_model_runner.py:1176] Model loading took 8.2413 GB and 14.465126 seconds
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:56 [gpu_model_runner.py:1176] Model loading took 8.2413 GB and 14.544127 seconds
(VllmWorker rank=0 pid=1044842) INFO 03-23 00:52:56 [gpu_model_runner.py:1421] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
(VllmWorker rank=1 pid=1045201) INFO 03-23 00:52:56 [gpu_model_runner.py:1421] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 8 image items of the maximum feature size.
(VllmWorker rank=1 pid=1045201) Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.
(VllmWorker rank=0 pid=1044842) Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.
ERROR 03-23 00:52:57 [core.py:340] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-23 00:52:57 [core.py:340] File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 332, in run_engine_core
ERROR 03-23 00:52:57 [core.py:340] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 03-23 00:52:57 [core.py:340] File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 287, in init
ERROR 03-23 00:52:57 [core.py:340] super().init(vllm_config, executor_class, log_stats)
ERROR 03-23 00:52:57 [core.py:340] File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 62, in init
ERROR 03-23 00:52:57 [core.py:340] num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
ERROR 03-23 00:52:57 [core.py:340] File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 121, in _initialize_kv_caches
ERROR 03-23 00:52:57 [core.py:340] available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 03-23 00:52:57 [core.py:340] File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
ERROR 03-23 00:52:57 [core.py:340] output = self.collective_rpc("determine_available_memory")
ERROR 03-23 00:52:57 [core.py:340] File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 133, in collective_rpc
ERROR 03-23 00:52:57 [core.py:340] raise e
ERROR 03-23 00:52:57 [core.py:340] File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 122, in collective_rpc
ERROR 03-23 00:52:57 [core.py:340] raise result
ERROR 03-23 00:52:57 [core.py:340] RuntimeError: Expected there to be 4 prompt updates corresponding to 4 image items, but instead found 3 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs, or there is a problem with your implementation of merged multi-modal processor for this model (usually arising from an inconsistency between _call_hf_processor and _get_prompt_updates).
ERROR 03-23 00:52:57 [core.py:340]
CRITICAL 03-23 00:52:57 [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
已杀死
`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2025-03-24T05:46:26Z

Can you pull the latest code? It should be fixed by #14980

moshilangzi · 2025-03-27T06:56:32Z

Can you pull the latest code? It should be fixed by #14980

still error:
Traceback (most recent call last): File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xinference/api/restful_api.py", line 1002, in launch_model model_uid = await (await self._get_supervisor_ref()).launch_builtin_model( File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/backends/pool.py", line 667, in send result = await self._run_coro(message.message_id, coro) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__ return await super().__on_receive__(message) # type: ignore File "xoscar/core.pyx", line 558, in __on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__ async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__ with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__ result = await result File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1190, in launch_builtin_model await _launch_model() File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1125, in _launch_model subpool_address = await _launch_one_model( File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1083, in _launch_one_model subpool_address = await worker_ref.launch_builtin_model( File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/backends/pool.py", line 667, in send result = await self._run_coro(message.message_id, coro) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__ return await super().__on_receive__(message) # type: ignore File "xoscar/core.pyx", line 558, in __on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__ async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__ with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__ result = await result File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xinference/core/utils.py", line 93, in wrapped ret = await func(*args, **kwargs) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xinference/core/worker.py", line 926, in launch_builtin_model await model_ref.load() File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/backends/pool.py", line 667, in send result = await self._run_coro(message.message_id, coro) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__ return await super().__on_receive__(message) # type: ignore File "xoscar/core.pyx", line 558, in __on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__ async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__ with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__ result = await result File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xinference/core/model.py", line 466, in load self._model.load() File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 330, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 677, in from_engine_args return async_engine_cls.from_vllm_config( File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 650, in from_vllm_config return cls( File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 605, in __init__ self.engine = self._engine_class(*args, **kwargs) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 267, in __init__ super().__init__(*args, **kwargs) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 283, in __init__ self._initialize_kv_caches() File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 432, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 102, in determine_num_available_blocks results = self.collective_rpc("determine_num_available_blocks") File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 316, in collective_rpc return self._run_workers(method, *args, **(kwargs or {})) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers driver_worker_output = run_method(self.driver_worker, sent_method, File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/utils.py", line 2255, in run_method return func(*args, **kwargs) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks self.model_runner.profile_run() File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/worker/multi_step_model_runner.py", line 669, in profile_run return self._base_model_runner.profile_run() File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run self._dummy_run(max_num_batched_tokens, max_num_seqs) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1308, in _dummy_run .dummy_data_for_profiling(self.model_config, File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/inputs/registry.py", line 342, in dummy_data_for_profiling dummy_data = dummy_data_factory(seq_len) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/multimodal/profiling.py", line 214, in get_decoder_dummy_data ) = self.get_and_validate_mm_inputs(seq_len) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/multimodal/profiling.py", line 163, in get_and_validate_mm_inputs mm_inputs = self._get_dummy_mm_inputs(seq_len, mm_counts) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/multimodal/profiling.py", line 140, in _get_dummy_mm_inputs return self.processor.apply( File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1613, in apply self._validate_mm_placeholders(mm_placeholders, mm_item_counts) File "/home/anaconda3/envs/xinference14/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1536, in _validate_mm_placeholders raise RuntimeError( RuntimeError: [address=0.0.0.0:38319, pid=3969381] Expected there to be 100 prompt updates corresponding to 100 image items, but instead found 3 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs, or there is a problem with your implementation of merged multi-modal processor for this model (usually arising from an inconsistency between _call_hf_processorand_get_prompt_updates).

DarkLight1337 · 2025-03-27T06:57:38Z

Can you show your code? How did you send the prompt to the model?

rattandeep1998 · 2025-03-27T22:26:25Z

@DarkLight1337 I am also seeing a similar issue:

RuntimeError: Expected there to be 1 prompt updates corresponding to 1 image items, but instead found 0 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs, or there is a problem with your implementation of merged multi-modal processor for this model (usually arising from an inconsistency between `_call_hf_processor` and `_get_prompt_updates`).

I'm using the code here for prompting gemma3:

# Gemma 3
def run_gemma3(questions: list[str], modality: str) -> ModelRequestData:
    assert modality == "image"
    model_name = "google/gemma-3-4b-it"

    engine_args = EngineArgs(
        model=model_name,
        max_model_len=2048,
        max_num_seqs=2,
        mm_processor_kwargs={"do_pan_and_scan": True},
        disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
    )

    prompts = [("<bos><start_of_turn>user\n"
                f"<start_of_image>{question}<end_of_turn>\n"
                "<start_of_turn>model\n") for question in questions]

    return ModelRequestData(
        engine_args=engine_args,
        prompts=prompts,
    )

DarkLight1337 · 2025-03-28T05:33:18Z

Are you directly running the example script or did you adapt it into your own code? If it's the latter case, can you show your code? In particular, what does questions look like and how did you use ModelRequestData?

rattandeep1998 · 2025-03-30T00:42:21Z

I got the issue. It was breaking for me because of how the prompt was written in my code.
I used textwrap.dedent(prompt).strip() to remove extra spaces and clean the string.
Then, this error stopped coming for me.
Thank you, @DarkLight1337!

Cloudcatcher888 · 2025-05-23T06:21:51Z

I met the same problem in verl when calling vllm:

  File "/mnt/zhangh/wangzhikai/verl/verl/trainer/main_ppo.py", line 64, in main
    run_ppo(config)
  File "/mnt/zhangh/wangzhikai/verl/verl/trainer/main_ppo.py", line 76, in run_ppo
    ray.get(runner.run.remote(config))
  File "/usr/local/conda/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/conda/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/conda/lib/python3.10/site-packages/ray/_private/worker.py", line 930, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::TaskRunner.run() (pid=140431, ip=33.202.81.243, actor_id=9905c7938f5aac694dedcd2701000000, repr=<main_ppo.TaskRunner object at 0x7f62e318dc60>)
  File "/mnt/zhangh/wangzhikai/verl/verl/trainer/main_ppo.py", line 183, in run
    trainer.fit()
  File "/mnt/zhangh/wangzhikai/verl/verl/trainer/ppo/ray_trainer.py", line 908, in fit
    gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
  File "/mnt/zhangh/wangzhikai/verl/verl/single_controller/ray/base.py", line 49, in func
    output = ray.get(output)
ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=142937, ip=33.202.81.243, actor_id=a50fead65c3605c4742ae7ba01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f8752abc3d0>)
  File "/mnt/zhangh/wangzhikai/verl/verl/single_controller/ray/base.py", line 466, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/mnt/zhangh/wangzhikai/verl/verl/single_controller/base/decorator.py", line 501, in inner
    return func(*args, **kwargs)
  File "/mnt/zhangh/wangzhikai/verl/verl/workers/fsdp_workers.py", line 614, in generate_sequences
    output = self.rollout.generate_sequences(prompts=prompts)
  File "/mnt/zhangh/wangzhikai/verl/verl/utils/debug/performance.py", line 78, in f
    return self.log(decorated_function, *args, **kwargs)
  File "/mnt/zhangh/wangzhikai/verl/verl/utils/debug/performance.py", line 88, in log
    output = func(*args, **kwargs)
  File "/usr/local/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/zhangh/wangzhikai/verl/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 256, in generate_sequences
    outputs = self.inference_engine.generate(
  File "/usr/local/conda/lib/python3.10/site-packages/vllm/utils.py", line 1131, in inner
    return fn(*args, **kwargs)
  File "/usr/local/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 457, in generate
    self._validate_and_add_requests(
  File "/usr/local/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1317, in _validate_and_add_requests
    self._add_request(
  File "/usr/local/conda/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1335, in _add_request
    self.llm_engine.add_request(
  File "/usr/local/conda/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 186, in add_request
    request = self.processor.process_inputs(request_id, prompt, params,
  File "/usr/local/conda/lib/python3.10/site-packages/vllm/v1/engine/processor.py", line 201, in process_inputs
    processed_inputs: ProcessorInputs = self.input_preprocessor.preprocess(
  File "/usr/local/conda/lib/python3.10/site-packages/vllm/inputs/preprocess.py", line 750, in preprocess
    return self._process_decoder_only_prompt(
  File "/usr/local/conda/lib/python3.10/site-packages/vllm/inputs/preprocess.py", line 699, in _process_decoder_only_prompt
    prompt_comps = self._prompt_to_llm_inputs(
  File "/usr/local/conda/lib/python3.10/site-packages/vllm/inputs/preprocess.py", line 347, in _prompt_to_llm_inputs
    return self._process_multimodal(
  File "/usr/local/conda/lib/python3.10/site-packages/vllm/inputs/preprocess.py", line 275, in _process_multimodal
    return mm_processor.apply(prompt, mm_data, mm_processor_kwargs,
  File "/usr/local/conda/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1626, in apply
    self._validate_mm_placeholders(mm_placeholders, mm_item_counts)
  File "/usr/local/conda/lib/python3.10/site-packages/vllm/multimodal/processing.py", line 1535, in _validate_mm_placeholders
    raise RuntimeError(
RuntimeError: Expected there to be 4 prompt updates corresponding to 4 image items, but instead found 3 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs, or there is a problem with your implementation of merged multi-modal processor for this model (usually arising from an inconsistency between `_call_hf_processor` and `_get_prompt_updates`).

the row of data is look like:

(TaskRunner pid=140431) {'data_source': 'PeijieWang/MV-MATH', 'ability': 'Metric Geometry', 'reward_model': {'ground_truth': '$\\sqrt{2}+1 \\# \\# 1+\\sqrt{2}$', 'style': 'rule'}, 'extra_info': {'analysis': 'Let the side length of the octagon be $a$.\n\nAccording to the problem, we have:\n\n\\[\n4 \\times \\frac{1}{2} \\cdot a \\cdot a + (2a + \\sqrt{2}a)^{2} = 8 + 4\\sqrt{2}\n\\]\n\nSimplifying, we get:\n\n\\[\n2a^{2} + (2a + \\sqrt{2}a)^{2} = 8 + 4\\sqrt{2}\n\\]\n\nExpanding the square:\n\n\\[\n2a^{2} + 4a^{2} + 4\\sqrt{2}a^{2} + 2a^{2} = 8 + 4\\sqrt{2}\n\\]\n\nCombining like terms:\n\n\\[\n8a^{2} + 4\\sqrt{2}a^{2} = 8 + 4\\sqrt{2}\n\\]\n\nFactoring out $a^{2}$:\n\n\\[\na^{2}(8 + 4\\sqrt{2}) = 8 + 4\\sqrt{2}\n\\]\n\nDividing both sides by $(8 + 4\\sqrt{2})$:\n\n\\[\na^{2} = 1\n\\]\n\nSince $a > 0$, we have:\n\n\\[\na = 1\n\\]\n\nTherefore, the length of $AB$ is:\n\n\\[\nAB = a + \\sqrt{2}a = 1 + \\sqrt{2}\n\\]\n\nThus, the answer is:\n\n\\[\n\\boxed{\\sqrt{2} + 1}\n\\]\n\n**Note:** This problem tests knowledge of geometric shapes and their properties, particularly the concept of area conservation when shapes are rearranged. The key to solving it is to set up an equation based on the area before and after the rearrangement, which is a common type of problem in middle school mathematics competitions.', 'difficulty': 'Medium', 'image_relavance': '1', 'is_multi_img': True}, 'images': [<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=299x294 at 0x7F61BC573EE0>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=306x291 at 0x7F61BC573820>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=277x266 at 0x7F61BC5735E0>]}
(TaskRunner pid=140431) {'data_source': 'PeijieWang/MV-MATH', 'ability': 'Descriptive Geometry', 'reward_model': {'ground_truth': 'A', 'style': 'rule'}, 'extra_info': {'analysis': 'Find the shape obtained from the top view, ensuring that all visible edges are represented in the top view.  \nSolution: From the top view, this geometric figure has only one layer and consists of 3 small squares, so the correct choice is A.', 'difficulty': 'Low', 'image_relavance': '1', 'is_multi_img': True}, 'images': [<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=254x195 at 0x7F61BC573700>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=206x80 at 0x7F61BC573280>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=192x128 at 0x7F61BC573160>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=75x129 at 0x7F61BC572EF0>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=205x132 at 0x7F61BC5702E0>]}
(TaskRunner pid=140431) {'data_source': 'PeijieWang/MV-MATH', 'ability': 'Combinatorial Geometry', 'reward_model': {'ground_truth': 'A', 'style': 'rule'}, 'extra_info': {'analysis': 'Solution: Since points \\( E, F, G, H \\) are the midpoints of the sides of rectangle \\( ABCD \\),\n\nTherefore, \\( AH = DH = BF = CF \\), \\( AE = BE = DG = CG \\), and \\( \\angle A = \\angle B = \\angle C = \\angle D = 90^\\circ \\),\n\nThus, triangles \\( \\triangle AEH \\cong \\triangle CGF \\cong \\triangle BEF \\cong \\triangle DGH \\) (by SAS),\n\nHence, \\( EH = EF = FG = GH \\),\n\nTherefore, quadrilateral \\( EFGH \\) is a rhombus, so option A is correct;\n\nSince quadrilateral \\( ABCD \\) is a rectangle,\n\nTherefore, \\( AD \\parallel BC \\),\n\nThus, \\( \\angle ECA = \\angle CAD \\),\n\nSince \\( \\angle CAE = \\angle CAD \\),\n\nTherefore, \\( \\angle CAE = \\angle ECA \\),\n\nHence, \\( EA = EC \\),\n\nIn triangles \\( \\triangle EAC \\) and \\( \\triangle FAC \\),\n\n\\[\n\\left\\{\n\\begin{array}{c}\n\\angle EAC = \\angle CAF \\\\\nAC = AC \\\\\n\\angle ACE = \\angle ACF\n\\end{array}\n\\right.\n\\]\n\nThus, \\( \\triangle EAC \\cong \\triangle FAC \\) (by ASA),\n\nTherefore, \\( AE = AF \\),\n\nHence, \\( AF = EC \\),\n\nSince \\( AF \\parallel EC \\),\n\nTherefore, quadrilateral \\( AECF \\) is a parallelogram,\n\nSince \\( AE = AF \\),\n\nTherefore, quadrilateral \\( AECF \\) is a rhombus, so option B is incorrect;\n\n(3) Since in rectangle \\( ABCD \\), \\( AB = 5 \\), \\( AD = 12 \\),\n\nTherefore, the area of rectangle \\( ABCD = AB \\cdot AD = 60 \\),\n\nAs shown in the figure:\n\n<image_4>\n\nSince \\( \\triangle AEH \\cong \\triangle CGF \\cong \\triangle BEF \\cong \\triangle DGH \\),\n\nTherefore, \\( S_{\\triangle AEH} = \\frac{1}{2} \\times \\frac{1}{2} \\times AB \\times \\frac{1}{2} \\times AD = \\frac{15}{2} \\),\n\nThus, the area of rhombus \\( EFGH = 60 - 4 \\times \\frac{15}{2} = 30 \\);\n\nAs shown in the figure:\n\n<image_5>\n\nLet \\( BE = x \\), then \\( AE = CE = BC - BE = 12 - x \\), in right triangle \\( \\triangle ABE \\), by the Pythagorean theorem:\n\n\\[\nx^{2} + 5^{2} = (12 - x)^{2},\n\\]\n\nSolving gives \\( x = \\frac{119}{24} \\),\n\nTherefore, \\( CE = 12 - \\frac{119}{24} = \\frac{169}{24} \\),\n\nThus, the area of rhombus \\( AECF = CE \\cdot AB = \\frac{169}{24} \\times 5 \\approx 35.21 \\),\n\nTherefore, the area of rhombus \\( EFGH \\) < the area of rhombus \\( AECF \\),\n\nHence, options C and D are both incorrect.\n\nTherefore, the correct answer is: A.\n\n【Key Insight】This question examines the determination and properties of a rhombus, the properties of a rectangle, the transformation of folding, the determination and properties of congruent triangles, and the Pythagorean theorem. The key to solving this problem lies in mastering the determination and properties of a rhombus.', 'difficulty': 'High', 'image_relavance': '1', 'is_multi_img': True}, 'images': [<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=418x243 at 0x7F61BC573B80>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=417x268 at 0x7F61BC573010>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=411x268 at 0x7F61BC5714E0>]}

the shell is like:

ENGINE=${1:-vllm}
# If you are using vllm<=0.6.3, you might need to set the following environment variable to avoid bugs:
# export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=/mnt/zhangh/wangzhikai/verl/wzk_data/MV_MATH_train.parquet \
    data.val_files=/mnt/zhangh/wangzhikai/verl/wzk_data/MV_MATH_test.parquet \
    data.train_batch_size=64 \
    data.max_prompt_length=4096 \
    data.max_response_length=2048 \
    data.filter_overlong_prompts=False \
    data.truncation='error' \
    data.image_key=images \
    actor_rollout_ref.model.path=/mnt/zhangh/wangzhikai/9cdata/huggingface_model/hub/Qwen2.5-VL-7B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=16 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.01 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=$ENGINE \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.rollout.n=2 \
    +actor_rollout_ref.rollout.limit_images=20 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    custom_reward_function.path=/mnt/zhangh/wangzhikai/verl/wzk_data/custom_reward.py \
    custom_reward_function.name=my_reward_fn \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_grpo_example_mvmath' \
    trainer.experiment_name='qwen2_5_vl_7b_function_rm' \
    trainer.n_gpus_per_node=2 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=5 \
    trainer.total_epochs=15 $@

moshilangzi added the bug Something isn't working label Mar 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Expected there to be 4 prompt updates corresponding to 4 image items, but instead found 3 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs #15338

[Bug]: Expected there to be 4 prompt updates corresponding to 4 image items, but instead found 3 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs #15338

moshilangzi commented Mar 22, 2025

DarkLight1337 commented Mar 24, 2025

Uh oh!

moshilangzi commented Mar 27, 2025

Uh oh!

DarkLight1337 commented Mar 27, 2025

Uh oh!

rattandeep1998 commented Mar 27, 2025

Uh oh!

DarkLight1337 commented Mar 28, 2025 •

edited

Loading

Uh oh!

rattandeep1998 commented Mar 30, 2025

Uh oh!

Cloudcatcher888 commented May 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

[Bug]: Expected there to be 4 prompt updates corresponding to 4 image items, but instead found 3 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs #15338

[Bug]: Expected there to be 4 prompt updates corresponding to 4 image items, but instead found 3 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs #15338

Comments

moshilangzi commented Mar 22, 2025

Your current environment

🐛 Describe the bug

Before submitting a new issue...

DarkLight1337 commented Mar 24, 2025

Uh oh!

moshilangzi commented Mar 27, 2025

Uh oh!

DarkLight1337 commented Mar 27, 2025

Uh oh!

rattandeep1998 commented Mar 27, 2025

Uh oh!

DarkLight1337 commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rattandeep1998 commented Mar 30, 2025

Uh oh!

Cloudcatcher888 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Mar 28, 2025 •

edited

Loading

Cloudcatcher888 commented May 23, 2025 •

edited

Loading