Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should encourage all InferencePool deployments to gracefully rollout and drain #549

Open
smarterclayton opened this issue Mar 20, 2025 · 2 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@smarterclayton
Copy link
Contributor

smarterclayton commented Mar 20, 2025

While we depend on upstream model servers to support proper graceful drain (moving to a mode where the server terminates once all requests are completed, probably with a timeout although not always on very very long running servers), our examples and our docs should clearly indicate and configure the pool members for graceful drain.

I.e. the classic:

  • Use a preStop hook to wait for load balancers to stop sending traffic (depends on the config of the fronting LB)
  • Respond to SIGTERM in the model server process (e.g. vLLM) to begin draining and exit when completed
    • Optionally letting the drain be unbounded for extremely long requests or cases where LB may have extremely long drain periods
    • Write good log messages
  • Ensure the readiness probe continues firing as long as the model server is accepting requests (for scenarios where the service is taking requests)

We should work with upstream vLLM to ensure they gracefully shut down and out of the box examples show it.

EDIT: vLLM does support drain on TERM

INFO 03-20 14:21:01 [launcher.py:74] Shutting down FastAPI HTTP server.
INFO:     Shutting down
INFO:     Waiting for connections to close. (CTRL+C to force quit)

So we are missing preStop in our examples (will test).

@smarterclayton smarterclayton added the kind/bug Categorizes issue or PR as related to a bug. label Mar 20, 2025
@smarterclayton
Copy link
Contributor Author

/assign

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Mar 20, 2025

Just some notes as I go:

  1. vLLM rejects connections immediately, so we should be sleeping in a preStop until the request finishes.
  2. We should recommend gateways probe model servers aggressively, but the correct sleep for preStop is that probe interval + propagation delay (some load balancers take extra time to propagate a probe failure, and the value may change unexpectedly or require experimentation). Each gateway impl will have to recommend the right sleep interval, but ootb we should be correct for the set of recommended deployments.

I will add a PR with an annotated gpu-deployment that serves as a reference for correct behavior in upstreams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

1 participant