Skip to content

[Bug]: vLLM CPU with PyTorch 2.7.0 crashes with RuntimeError: "reshape_and_cache_cpu_impl" not implemented for 'Half' #17225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
huydhn opened this issue Apr 26, 2025 · 3 comments · May be fixed by #18430
Open
1 task done
Labels
bug Something isn't working

Comments

@huydhn
Copy link
Contributor

huydhn commented Apr 26, 2025

Your current environment

The output of `python collect_env.py`
ERROR! Intel® Extension for PyTorch* needs to work with PyTorch 2.6.*, but PyTorch 2.7.0+cpu is found. Please switch to the matching version and run again.
INFO 04-26 09:26:16 [__init__.py:239] Automatically detected platform cpu.
Collecting environment information...
PyTorch version: 2.7.0+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.12.10 (main, Apr  9 2025, 04:03:51) [Clang 20.1.0 ] (64-bit runtime)
Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               16
On-line CPU(s) list:                  0-15
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   8
Socket(s):                            1
Stepping:                             4
BogoMIPS:                             5999.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            256 KiB (8 instances)
L1i cache:                            256 KiB (8 instances)
L2 cache:                             8 MiB (8 instances)
L3 cache:                             24.8 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-15
Vulnerability Gather data sampling:   Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:          KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Vulnerable
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] intel_extension_for_pytorch==2.6.0
[pip3] numpy==2.2.5
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0+cpu
[pip3] torchaudio==2.7.0+cpu
[pip3] torchvision==0.22.0+cpu
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.1.dev6078+g06b877e.d20250426 (git sha: 06b877e, date: 20250426)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

🐛 Describe the bug

This issue is discovered when updating PyTorch to its latest 2.7.0 release. When serving the example model facebook/opt-125m on CPU, the server crashes with the following error:

ERROR! Intel® Extension for PyTorch* needs to work with PyTorch 2.6.*, but PyTorch 2.7.0+cpu is found. Please switch to the matching version and run again.
INFO 04-26 04:06:06 [__init__.py:239] Automatically detected platform cpu.
INFO 04-26 04:06:13 [api_server.py:1043] vLLM API server version 0.1.dev6077+g1a4cc8c.d20250426
INFO 04-26 04:06:13 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='facebook/opt-125m', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='facebook/opt-125m', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=None, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7fccec77bce0>)
INFO 04-26 04:06:23 [config.py:716] This model supports multiple tasks: {'classify', 'embed', 'reward', 'generate', 'score'}. Defaulting to 'generate'.
WARNING 04-26 04:06:23 [arg_utils.py:1688] device type=cpu is not supported by the V1 Engine. Falling back to V0.
INFO 04-26 04:06:23 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 04-26 04:06:23 [cpu.py:106] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
WARNING 04-26 04:06:23 [cpu.py:119] uni is not supported on CPU, fallback to mp distributed executor backend.
INFO 04-26 04:06:23 [api_server.py:246] Started engine process with PID 49
ERROR! Intel® Extension for PyTorch* needs to work with PyTorch 2.6.*, but PyTorch 2.7.0+cpu is found. Please switch to the matching version and run again.
INFO 04-26 04:06:26 [__init__.py:239] Automatically detected platform cpu.
INFO 04-26 04:06:29 [llm_engine.py:242] Initializing a V0 LLM engine (v0.1.dev6077+g1a4cc8c.d20250426) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 04-26 04:06:29 [cpu.py:45] Using Torch SDPA backend.
ERROR! Intel® Extension for PyTorch* needs to work with PyTorch 2.6.*, but PyTorch 2.7.0+cpu is found. Please switch to the matching version and run again.
INFO 04-26 04:06:29 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-26 04:06:29 [weight_utils.py:265] Using model weights format ['*.bin']
INFO 04-26 04:06:30 [weight_utils.py:281] Time spent downloading weights for facebook/opt-125m: 0.978391 seconds
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  8.88it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  8.86it/s]

INFO 04-26 04:06:30 [loader.py:458] Loading weights took 0.11 seconds
INFO 04-26 04:06:30 [executor_base.py:112] # cpu blocks: 910, # CPU blocks: 0
INFO 04-26 04:06:30 [executor_base.py:117] Maximum concurrency for 2048 tokens per request: 56.88x
INFO 04-26 04:06:31 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 0.42 seconds
INFO 04-26 04:06:31 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:8000
INFO 04-26 04:06:31 [launcher.py:28] Available routes are:
INFO 04-26 04:06:31 [launcher.py:36] Route: /openapi.json, Methods: GET, HEAD
INFO 04-26 04:06:31 [launcher.py:36] Route: /docs, Methods: GET, HEAD
INFO 04-26 04:06:31 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 04-26 04:06:31 [launcher.py:36] Route: /redoc, Methods: GET, HEAD
INFO 04-26 04:06:31 [launcher.py:36] Route: /health, Methods: GET
INFO 04-26 04:06:31 [launcher.py:36] Route: /load, Methods: GET
INFO 04-26 04:06:31 [launcher.py:36] Route: /ping, Methods: GET, POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 04-26 04:06:31 [launcher.py:36] Route: /version, Methods: GET
INFO 04-26 04:06:31 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /pooling, Methods: POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /score, Methods: POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /rerank, Methods: POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /invocations, Methods: POST
INFO 04-26 04:06:31 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [10]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO 04-26 04:07:15 [logger.py:39] Received request cmpl-06b4b51b5ef149d0982220686fabc781-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [2, 16033, 2659, 16, 10], lora_request: None, prompt_adapter_request: None.
INFO 04-26 04:07:15 [engine.py:310] Added request cmpl-06b4b51b5ef149d0982220686fabc781-0.
INFO:     127.0.0.1:52734 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR 04-26 04:07:15 [engine.py:160] RuntimeError('"reshape_and_cache_cpu_impl" not implemented for \'Half\'')
ERROR 04-26 04:07:15 [engine.py:160] Traceback (most recent call last):
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 158, in start
ERROR 04-26 04:07:15 [engine.py:160]     self.run_engine_loop()
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 221, in run_engine_loop
ERROR 04-26 04:07:15 [engine.py:160]     request_outputs = self.engine_step()
ERROR 04-26 04:07:15 [engine.py:160]                       ^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 247, in engine_step
ERROR 04-26 04:07:15 [engine.py:160]     raise e
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 230, in engine_step
ERROR 04-26 04:07:15 [engine.py:160]     return self.engine.step()
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 1419, in step
ERROR 04-26 04:07:15 [engine.py:160]     outputs = self.model_executor.execute_model(
ERROR 04-26 04:07:15 [engine.py:160]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 299, in execute_model
ERROR 04-26 04:07:15 [engine.py:160]     driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 04-26 04:07:15 [engine.py:160]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 144, in _driver_execute_model
ERROR 04-26 04:07:15 [engine.py:160]     return self.driver_worker.execute_model(execute_model_req)
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 420, in execute_model
ERROR 04-26 04:07:15 [engine.py:160]     output = self.model_runner.execute_model(
ERROR 04-26 04:07:15 [engine.py:160]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-26 04:07:15 [engine.py:160]     return func(*args, **kwargs)
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/worker/cpu_model_runner.py", line 659, in execute_model
ERROR 04-26 04:07:15 [engine.py:160]     hidden_states = model_executable(
ERROR 04-26 04:07:15 [engine.py:160]                     ^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 04-26 04:07:15 [engine.py:160]     return self._call_impl(*args, **kwargs)
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 04-26 04:07:15 [engine.py:160]     return forward_call(*args, **kwargs)
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/models/opt.py", line 390, in forward
ERROR 04-26 04:07:15 [engine.py:160]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-26 04:07:15 [engine.py:160]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-26 04:07:15 [engine.py:160]     return self.forward(*args, **kwargs)
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/models/opt.py", line 310, in forward
ERROR 04-26 04:07:15 [engine.py:160]     return self.decoder(input_ids,
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 04-26 04:07:15 [engine.py:160]     return self._call_impl(*args, **kwargs)
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 04-26 04:07:15 [engine.py:160]     return forward_call(*args, **kwargs)
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/models/opt.py", line 271, in forward
ERROR 04-26 04:07:15 [engine.py:160]     hidden_states = layer(hidden_states)
ERROR 04-26 04:07:15 [engine.py:160]                     ^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 04-26 04:07:15 [engine.py:160]     return self._call_impl(*args, **kwargs)
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 04-26 04:07:15 [engine.py:160]     return forward_call(*args, **kwargs)
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/models/opt.py", line 170, in forward
ERROR 04-26 04:07:15 [engine.py:160]     hidden_states = self.self_attn(hidden_states=hidden_states)
ERROR 04-26 04:07:15 [engine.py:160]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 04-26 04:07:15 [engine.py:160]     return self._call_impl(*args, **kwargs)
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 04-26 04:07:15 [engine.py:160]     return forward_call(*args, **kwargs)
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/models/opt.py", line 112, in forward
ERROR 04-26 04:07:15 [engine.py:160]     attn_output = self.attn(q, k, v)
ERROR 04-26 04:07:15 [engine.py:160]                   ^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 04-26 04:07:15 [engine.py:160]     return self._call_impl(*args, **kwargs)
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 04-26 04:07:15 [engine.py:160]     return forward_call(*args, **kwargs)
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/attention/layer.py", line 233, in forward
ERROR 04-26 04:07:15 [engine.py:160]     return torch.ops.vllm.unified_attention(
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
ERROR 04-26 04:07:15 [engine.py:160]     return self._op(*args, **(kwargs or {}))
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/attention/layer.py", line 379, in unified_attention
ERROR 04-26 04:07:15 [engine.py:160]     output = self.impl.forward(self, query, key, value, kv_cache,
ERROR 04-26 04:07:15 [engine.py:160]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/attention/backends/torch_sdpa.py", line 510, in forward
ERROR 04-26 04:07:15 [engine.py:160]     PagedAttention.write_to_paged_cache(
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/attention/ops/ipex_attn.py", line 62, in write_to_paged_cache
ERROR 04-26 04:07:15 [engine.py:160]     ops.reshape_and_cache(
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/vllm/_custom_ops.py", line 1343, in reshape_and_cache
ERROR 04-26 04:07:15 [engine.py:160]     torch.ops._C_cache_ops.reshape_and_cache(key, value, key_cache,
ERROR 04-26 04:07:15 [engine.py:160]   File "/opt/venv/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
ERROR 04-26 04:07:15 [engine.py:160]     return self._op(*args, **(kwargs or {}))
ERROR 04-26 04:07:15 [engine.py:160]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-26 04:07:15 [engine.py:160] RuntimeError: "reshape_and_cache_cpu_impl" not implemented for 'Half'
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [10]

This might be related to #11327 (comment). On the other hand, float32 seems to be ok as vllm serve facebook/opt-125m --dtype float32 --block-size 16 works.

I'm not exactly sure if this is related to the fact that XPU will be updated later per the comment on #16859 (comment)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@huydhn huydhn added the bug Something isn't working label Apr 26, 2025
@bigPYJ1151
Copy link
Contributor

@huydhn Thanks for your help to update!

I think this problem is due to the ipex version, as the first line of the log.

In fact, I'm afraid the CPU backend has to skip torch 2.7 due to a performance issue on x86 CPU. Torch 2.7 disabled a optimized random generator for some accuracy problems (PR), but the fix PR hasn't be merged in 2.7 and targets in 2.8.

@zzzyq
Copy link

zzzyq commented Apr 29, 2025

My torch version is 2.6.0+cpu and I'm running on X86, but I'm also seeing this error.

@Thiago-Reis-Porto
Copy link

Thiago-Reis-Porto commented May 8, 2025

Hi! I had the same problem, I solved it by changing intel_extension_for_pytorch version in Dockerfile.cpu from 2.6.0 to 2.7.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants