You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, there appears to be no mechanism in vLLM to reject incoming requests based on the waiting queue length. Instead, all incoming requests are added to the queue. The waiting queue is implemented as an unbounded deque residing in CPU memory, where each element represents a pending request. In scenarios of service overload or when using low-performance GPUs, the queue length may grow indefinitely if users do not actively cancel their requests. This can lead to excessive memory consumption and eventually result in an out-of-memory (OOM) failure.
To address this issue, we propose introducing a new mechanism to control the maximum length of the waiting queue. Once the queue reaches a specified threshold, any new incoming requests will be rejected immediately with an HTTP 503 (Service Unavailable) response.
Additionally, at a higher level, such as in Kubernetes, we can use Istio, Envoy, and other tools to perform load balancing based on whether the request returns a 503 status code.
Proposed Change.
Introduce a parameter --max-waiting-queue-length. When the scheduler attempts to add a request to the waiting queue, it first checks whether the queue has reached its maximum length. If the queue is full, the request is rejected with an HTTP 503 error.
defadd_request(self, request: Request, dummy=False) ->None:
ifself.max_waiting_queue_lengthand \
len(self.waiting) >=self.max_waiting_queue_length:
raiseSchedulerWaitingQueueFullError(
f"Scheduler waiting queue is full ({len(self.waiting)} >= "f"{self.max_waiting_queue_length}). "f"Cannot add request {request.request_id}.")
"""Adds a request to the waiting queue."""self.waiting.append(request)
self.requests[request.request_id] =requestifself.log_stats:
request.record_event(EngineCoreEventType.QUEUED)
When the waiting queue is full and the scheduler attempts to add a new request to it, a SchedulerWaitingQueueFullError is raised. The EngineCore sends this error to the API server via ZMQ. Upon detecting this error, the API server sets the HTTP status code to 503.
In DP (Data Parallel) mode, there are multiple EngineCore processes, each maintaining its own waiting queue. Maybe this can be handled in the DP Coordinator? (Just a guess)
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Motivation.
Currently, there appears to be no mechanism in vLLM to reject incoming requests based on the waiting queue length. Instead, all incoming requests are added to the queue. The waiting queue is implemented as an unbounded deque residing in CPU memory, where each element represents a pending request. In scenarios of service overload or when using low-performance GPUs, the queue length may grow indefinitely if users do not actively cancel their requests. This can lead to excessive memory consumption and eventually result in an out-of-memory (OOM) failure.
Waiting Queue:
vllm/vllm/v1/core/sched/scheduler.py
Lines 90 to 92 in 6825d9a
To address this issue, we propose introducing a new mechanism to control the maximum length of the waiting queue. Once the queue reaches a specified threshold, any new incoming requests will be rejected immediately with an HTTP 503 (Service Unavailable) response.
Additionally, at a higher level, such as in
Kubernetes
, we can use Istio, Envoy, and other tools to perform load balancing based on whether the request returns a 503 status code.Proposed Change.
Introduce a parameter
--max-waiting-queue-length
. When the scheduler attempts to add a request to the waiting queue, it first checks whether the queue has reached its maximum length. If the queue is full, the request is rejected with an HTTP 503 error.vllm/vllm/v1/core/sched/scheduler.py
Lines 842 to 846 in 6825d9a
For example
When the waiting queue is full and the scheduler attempts to add a new request to it, a
SchedulerWaitingQueueFullError
is raised. The EngineCore sends this error to the API server via ZMQ. Upon detecting this error, the API server sets the HTTP status code to 503.Feedback Period.
No response
CC List.
@DarkLight1337 @simon-mo @njhill
Any Other Things.
get_request_counts
in the scheduler. Perhaps we can reuse this interface.EngineCore
processes, each maintaining its own waiting queue. Maybe this can be handled in theDP Coordinator
? (Just a guess)Before submitting a new issue...
The text was updated successfully, but these errors were encountered: