WIP - Configure the vllm deployment with best practices for startup

smarterclayton · smarterclayton · commit 0bdaaf8b22b0 · 2025-03-20T15:24:29.000-04:00
We want to recommend best practices for deployments of model servers
under an InferencePool. Use the need to gracefully drain without
client visible errors during rollout ("hitless" updates) to
annotate the yaml with strong opinions on best practices.
diff --git a/config/manifests/vllm/gpu-deployment.yaml b/config/manifests/vllm/gpu-deployment.yaml
@@ -46,26 +46,83 @@ spec:
             - containerPort: 8000
               name: http
               protocol: TCP
+          lifecycle:
+            preStop:
+              # vLLM stops accepting connections when it receives SIGTERM, so we need to sleep
+              # to give upstream gateways a chance to take us out of rotation. The time we wait
+              # is dependent on the time it takes for all upstreams to completely remove us from
+              # rotation. Older or simpler load balancers might take upwards of 30s, but we expect
+              # our deployment to run behind a modern gateway like Envoy which is designed to 
+              # probe for readiness aggressively.
+              sleep:
+                # Upstream gateway probers for health should be set on a low period, such as 5s
+                # and the shorter we can tighten that bound the faster that we release
+                # accelerators during controlled shutdowns.
+                seconds: 7
           livenessProbe:
-            failureThreshold: 240
             httpGet:
               path: /health
               port: http
               scheme: HTTP
-            initialDelaySeconds: 5
-            periodSeconds: 5
+            # vLLM's health check is simple, so we can more aggressively probe it.  Liveness
+            # check endpoints should always be suitable for aggressive probing.
+            periodSeconds: 1
             successThreshold: 1
+            # vLLM has a very simple health implementation, which means that any failure is
+            # likely significant. However, any liveness triggered restart requires the very
+            # large core model to be reloaded, and so we should bias towards ensuring the
+            # server is definitely unhealthy vs immediately restarting. Use 5 attempts as
+            # evidence of a serious problem.
+            failureThreshold: 5
             timeoutSeconds: 1
           readinessProbe:
-            failureThreshold: 600
             httpGet:
               path: /health
               port: http
               scheme: HTTP
-            initialDelaySeconds: 5
-            periodSeconds: 5
+            # vLLM's health check is simple, so we can more aggressively probe it.  Readiness
+            # check endpoints should always be suitable for aggressive probing, but may be
+            # slightly more expensive than readiness probes.
+            periodSeconds: 1
             successThreshold: 1
+            # vLLM has a very simple health implementation, which means that any failure is
+            # likely significant,
+            failureThreshold: 1
             timeoutSeconds: 1
+          # We set a startup probe so that we don't begin directing traffic to this instance
+          # until the model is loaded.
+          startupProbe:
+            # Failure threshold is when we believe startup will not happen at all, and is set
+            # to the maximum possible time we believe loading a model will take. In our
+            # default configuration we are downloading a model from HuggingFace, which may
+            # take a long time, then the model must load into the accelerator. We choose
+            # 10 minutes as a reasonable maximum startup time before giving up and attempting
+            # to restart the pod.
+            #
+            # IMPORTANT: If the core model takes more than 10 minutes to load, pods will crash
+            # loop.
+            failureThreshold: 600
+            # Set delay to start low so that if the base model changes to something smaller
+            # or an optimization is deployed, we don't wait unneccesarily.
+            initialDelaySeconds: 2
+            # As a startup probe, this stops running and so we can more aggressively probe
+            # even a moderately complex startup - this is a very important workload.
+            periodSeconds: 1
+            exec:
+              # Verify that our core model is loaded before we consider startup successful
+              command:
+              - /bin/bash
+              - -c
+              - set -eu
+                models=$( curl http://localhost:8000/v1/models )
+                echo ${models} | grep -q "$1"
+                if [[ $? -ne 0 ]]; then
+                  echo "model not found"
+                  exit 1
+                fi
+                echo "ok"
+              - ''
+              - '"id":"meta-llama/Llama-2-7b-hx"'
           resources:
             limits:
               nvidia.com/gpu: 1