Skip to content

[Bug]: RuntimeError: Engine loop has died with larger context lengths (>32k) #10002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
sam-huang1223 opened this issue Nov 4, 2024 · 6 comments
Closed
1 task done
Labels
bug Something isn't working stale Over 90 days of inactivity

Comments

@sam-huang1223
Copy link

Your current environment

running via k8s (EKS) v0.6.3 on g6e.12xlarge instances (aws GPU AMI) with a llama-based model (72B params, FP8 weights+activation quantized)

Model Input Dumps

No response

🐛 Describe the bug

even with

    VLLM_WORKER_MULTIPROC_METHOD: "spawn"
    VLLM_LOGGING_LEVEL: "DEBUG"
    VLLM_TRACE_FUNCTION: "1"
    NCCL_DEBUG: "TRACE"

i could not collect more logs than

ERROR 11-04 12:53:08 client.py:250] RuntimeError('Engine loop has died')
ERROR 11-04 12:53:08 client.py:250] Traceback (most recent call last):
ERROR 11-04 12:53:08 client.py:250]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 150, in run_heartbeat_loop
ERROR 11-04 12:53:08 client.py:250]     await self._check_success(
ERROR 11-04 12:53:08 client.py:250]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/client.py", line 314, in _check_success
ERROR 11-04 12:53:08 client.py:250]     raise response
ERROR 11-04 12:53:08 client.py:250] RuntimeError: Engine loop has died
INFO:     10.9.147.84:47210 - "GET /metrics HTTP/1.1" 200 OK
INFO:     10.9.147.84:47210 - "GET /metrics HTTP/1.1" 200 OK
INFO:     10.9.147.84:47210 - "GET /metrics HTTP/1.1" 200 OK
CRITICAL 11-04 12:53:11 launcher.py:72] AsyncLLMEngine has failed, terminating server process
INFO:     10.9.150.232:38400 - "GET /health HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for connections to close. (CTRL+C to force quit)

it was working immediately before with an 18k context length prompt, failed with a 38k context length. Would appreciate some pointers on how to debug more here.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@sam-huang1223 sam-huang1223 added the bug Something isn't working label Nov 4, 2024
@sam-huang1223
Copy link
Author

vllm args

    - --kv-cache-dtype=fp8
    - --max-num-seqs=128
    - --max-num-batched-tokens=128000
    - --max-model-len=128000
    - --max-seq-len-to-capture=128000
    - --tensor-parallel-size=4
    - --enable-chunked-prefill
    - --enable-prefix-caching
    - --gpu-memory-utilization=0.9

@sam-huang1223
Copy link
Author

seems like we can work around this issue by using a non FP8 model - however a basic quantization config shouldn't be causing issues like this

from llmcompressor.modifiers.quantization import QuantizationModifier

# Configure the simple PTQ quantization
recipe = QuantizationModifier(
  targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])

@jikunshang
Copy link
Contributor

maybe this is due to some steps are extremely slow, MQLLMEngineClient didn't get response for a while, so throw this error. Please try to increase VLLM_RPC_TIMEOUT (default value is 10000).

@Leon-Sander
Copy link

maybe this is due to some steps are extremely slow, MQLLMEngineClient didn't get response for a while, so throw this error. Please try to increase VLLM_RPC_TIMEOUT (default value is 10000).

This solved it for me

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Feb 12, 2025
Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Over 90 days of inactivity
Projects
None yet
Development

No branches or pull requests

3 participants