[RFC]: Controlling the maximum length of the waiting queue #18826

chaunceyjiang · 2025-05-28T09:15:49Z

Motivation.

Currently, there appears to be no mechanism in vLLM to reject incoming requests based on the waiting queue length. Instead, all incoming requests are added to the queue. The waiting queue is implemented as an unbounded deque residing in CPU memory, where each element represents a pending request. In scenarios of service overload or when using low-performance GPUs, the queue length may grow indefinitely if users do not actively cancel their requests. This can lead to excessive memory consumption and eventually result in an out-of-memory (OOM) failure.

Waiting Queue:

vllm/vllm/v1/core/sched/scheduler.py

Lines 90 to 92 in 6825d9a

    
           # Priority queues for requests. 
        
           self.waiting: deque[Request] = deque() 
        
           self.running: list[Request] = []

To address this issue, we propose introducing a new mechanism to control the maximum length of the waiting queue. Once the queue reaches a specified threshold, any new incoming requests will be rejected immediately with an HTTP 503 (Service Unavailable) response.

Additionally, at a higher level, such as in Kubernetes, we can use Istio, Envoy, and other tools to perform load balancing based on whether the request returns a 503 status code.

Proposed Change.

Introduce a parameter --max-waiting-queue-length. When the scheduler attempts to add a request to the waiting queue, it first checks whether the queue has reached its maximum length. If the queue is full, the request is rejected with an HTTP 503 error.

vllm/vllm/v1/core/sched/scheduler.py

Lines 842 to 846 in 6825d9a

    
           def add_request(self, request: Request) -> None: 
        
               self.waiting.append(request) 
        
               self.requests[request.request_id] = request 
        
               if self.log_stats: 
        
                   request.record_event(EngineCoreEventType.QUEUED)

For example

    def add_request(self, request: Request, dummy=False) -> None:

        if self.max_waiting_queue_length and \
            len(self.waiting) >= self.max_waiting_queue_length:

            raise SchedulerWaitingQueueFullError(
                f"Scheduler waiting queue is full ({len(self.waiting)} >= "
                f"{self.max_waiting_queue_length}). "
                f"Cannot add request {request.request_id}.")

        """Adds a request to the waiting queue."""
        self.waiting.append(request)
        self.requests[request.request_id] = request
        if self.log_stats:
            request.record_event(EngineCoreEventType.QUEUED)

When the waiting queue is full and the scheduler attempts to add a new request to it, a SchedulerWaitingQueueFullError is raised. The EngineCore sends this error to the API server via ZMQ. Upon detecting this error, the API server sets the HTTP status code to 503.

Feedback Period.

No response

CC List.

@DarkLight1337 @simon-mo @njhill

Any Other Things.

PR [Perf] API-server scaleout with many-to-many server-engine comms #17546 introduced a new interface get_request_counts in the scheduler. Perhaps we can reuse this interface.
In DP (Data Parallel) mode, there are multiple EngineCore processes, each maintaining its own waiting queue. Maybe this can be handled in the DP Coordinator? (Just a guess)
There are also multiple issues requesting support for this feature, such as Controlling max queue time #2901.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

chaunceyjiang added the RFC label May 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Controlling the maximum length of the waiting queue #18826

[RFC]: Controlling the maximum length of the waiting queue #18826

chaunceyjiang commented May 28, 2025 •

edited

Loading

Uh oh!

[RFC]: Controlling the maximum length of the waiting queue #18826

[RFC]: Controlling the maximum length of the waiting queue #18826

Comments

chaunceyjiang commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

chaunceyjiang commented May 28, 2025 •

edited

Loading