Skip to content

[RFC]: Controlling the maximum length of the waiting queue #18826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
chaunceyjiang opened this issue May 28, 2025 · 0 comments
Open
1 task done

[RFC]: Controlling the maximum length of the waiting queue #18826

chaunceyjiang opened this issue May 28, 2025 · 0 comments
Labels

Comments

@chaunceyjiang
Copy link
Contributor

chaunceyjiang commented May 28, 2025

Motivation.

Currently, there appears to be no mechanism in vLLM to reject incoming requests based on the waiting queue length. Instead, all incoming requests are added to the queue. The waiting queue is implemented as an unbounded deque residing in CPU memory, where each element represents a pending request. In scenarios of service overload or when using low-performance GPUs, the queue length may grow indefinitely if users do not actively cancel their requests. This can lead to excessive memory consumption and eventually result in an out-of-memory (OOM) failure.

Waiting Queue:

# Priority queues for requests.
self.waiting: deque[Request] = deque()
self.running: list[Request] = []

To address this issue, we propose introducing a new mechanism to control the maximum length of the waiting queue. Once the queue reaches a specified threshold, any new incoming requests will be rejected immediately with an HTTP 503 (Service Unavailable) response.

Additionally, at a higher level, such as in Kubernetes, we can use Istio, Envoy, and other tools to perform load balancing based on whether the request returns a 503 status code.

Proposed Change.

Introduce a parameter --max-waiting-queue-length. When the scheduler attempts to add a request to the waiting queue, it first checks whether the queue has reached its maximum length. If the queue is full, the request is rejected with an HTTP 503 error.

def add_request(self, request: Request) -> None:
self.waiting.append(request)
self.requests[request.request_id] = request
if self.log_stats:
request.record_event(EngineCoreEventType.QUEUED)

For example

    def add_request(self, request: Request, dummy=False) -> None:

        if self.max_waiting_queue_length and \
            len(self.waiting) >= self.max_waiting_queue_length:

            raise SchedulerWaitingQueueFullError(
                f"Scheduler waiting queue is full ({len(self.waiting)} >= "
                f"{self.max_waiting_queue_length}). "
                f"Cannot add request {request.request_id}.")

        """Adds a request to the waiting queue."""
        self.waiting.append(request)
        self.requests[request.request_id] = request
        if self.log_stats:
            request.record_event(EngineCoreEventType.QUEUED)

When the waiting queue is full and the scheduler attempts to add a new request to it, a SchedulerWaitingQueueFullError is raised. The EngineCore sends this error to the API server via ZMQ. Upon detecting this error, the API server sets the HTTP status code to 503.

Feedback Period.

No response

CC List.

@DarkLight1337 @simon-mo @njhill

Any Other Things.

  1. PR [Perf] API-server scaleout with many-to-many server-engine comms  #17546 introduced a new interface get_request_counts in the scheduler. Perhaps we can reuse this interface.
  2. In DP (Data Parallel) mode, there are multiple EngineCore processes, each maintaining its own waiting queue. Maybe this can be handled in the DP Coordinator? (Just a guess)
  3. There are also multiple issues requesting support for this feature, such as Controlling max queue time #2901.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant