[Bug]: RuntimeError: Engine loop has died with larger context lengths (>32k) #10002

sam-huang1223 · 2024-11-04T21:00:39Z

Your current environment

running via k8s (EKS) v0.6.3 on g6e.12xlarge instances (aws GPU AMI) with a llama-based model (72B params, FP8 weights+activation quantized)

Model Input Dumps

No response

🐛 Describe the bug

even with

    VLLM_WORKER_MULTIPROC_METHOD: "spawn"
    VLLM_LOGGING_LEVEL: "DEBUG"
    VLLM_TRACE_FUNCTION: "1"
    NCCL_DEBUG: "TRACE"

i could not collect more logs than

ERROR 11-04 12:53:08 client.py:250] RuntimeError('Engine loop has died')
ERROR 11-04 12:53:08 client.py:250] Traceback (most recent call last):
ERROR 11-04 12:53:08 client.py:250]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 150, in run_heartbeat_loop
ERROR 11-04 12:53:08 client.py:250]     await self._check_success(
ERROR 11-04 12:53:08 client.py:250]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 314, in _check_success
ERROR 11-04 12:53:08 client.py:250]     raise response
ERROR 11-04 12:53:08 client.py:250] RuntimeError: Engine loop has died
INFO:     10.9.147.84:47210 - "GET /metrics HTTP/1.1" 200 OK
INFO:     10.9.147.84:47210 - "GET /metrics HTTP/1.1" 200 OK
INFO:     10.9.147.84:47210 - "GET /metrics HTTP/1.1" 200 OK
CRITICAL 11-04 12:53:11 launcher.py:72] AsyncLLMEngine has failed, terminating server process
INFO:     10.9.150.232:38400 - "GET /health HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for connections to close. (CTRL+C to force quit)

it was working immediately before with an 18k context length prompt, failed with a 38k context length. Would appreciate some pointers on how to debug more here.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

sam-huang1223 · 2024-11-04T21:03:14Z

vllm args

    - --kv-cache-dtype=fp8
    - --max-num-seqs=128
    - --max-num-batched-tokens=128000
    - --max-model-len=128000
    - --max-seq-len-to-capture=128000
    - --tensor-parallel-size=4
    - --enable-chunked-prefill
    - --enable-prefix-caching
    - --gpu-memory-utilization=0.9

sam-huang1223 · 2024-11-05T14:25:41Z

seems like we can work around this issue by using a non FP8 model - however a basic quantization config shouldn't be causing issues like this

from llmcompressor.modifiers.quantization import QuantizationModifier

# Configure the simple PTQ quantization
recipe = QuantizationModifier(
  targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])

jikunshang · 2024-11-07T03:10:43Z

maybe this is due to some steps are extremely slow, MQLLMEngineClient didn't get response for a while, so throw this error. Please try to increase VLLM_RPC_TIMEOUT (default value is 10000).

Leon-Sander · 2024-11-13T17:41:08Z

maybe this is due to some steps are extremely slow, MQLLMEngineClient didn't get response for a while, so throw this error. Please try to increase VLLM_RPC_TIMEOUT (default value is 10000).

This solved it for me

github-actions · 2025-02-12T01:58:38Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2025-03-14T02:02:23Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

sam-huang1223 added the bug Something isn't working label Nov 4, 2024

Leon-Sander mentioned this issue Nov 13, 2024

[Usage]: Adaptive Batching and number of concurrent requests #10269

Closed

1 task

github-actions bot added the stale Over 90 days of inactivity label Feb 12, 2025

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: RuntimeError: Engine loop has died with larger context lengths (>32k) #10002

[Bug]: RuntimeError: Engine loop has died with larger context lengths (>32k) #10002

sam-huang1223 commented Nov 4, 2024

sam-huang1223 commented Nov 4, 2024

sam-huang1223 commented Nov 5, 2024

jikunshang commented Nov 7, 2024

Leon-Sander commented Nov 13, 2024

github-actions bot commented Feb 12, 2025

github-actions bot commented Mar 14, 2025

[Bug]: RuntimeError: Engine loop has died with larger context lengths (>32k) #10002

[Bug]: RuntimeError: Engine loop has died with larger context lengths (>32k) #10002

Comments

sam-huang1223 commented Nov 4, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

sam-huang1223 commented Nov 4, 2024

sam-huang1223 commented Nov 5, 2024

jikunshang commented Nov 7, 2024

Leon-Sander commented Nov 13, 2024

github-actions bot commented Feb 12, 2025

github-actions bot commented Mar 14, 2025