This user guide shows how to run benchmarks against a vLLM deployment, by using both the Gateway API inference extension, and a Kubernetes service as the load balancing strategy. The benchmark uses the Latency Profile Generator (LPG) tool to generate load and collect results.
Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the sample vLLM application, and the inference extension.
You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
As the baseline, let's also expose the vLLM deployment as a k8s service:
kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer
The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets.
-
Check out the repo.
git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension cd gateway-api-inference-extension
-
Get the target IP. Examples below show how to get the IP of a gateway or a LoadBalancer k8s service.
# Get gateway IP GW_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}') # Get LoadBalancer k8s service IP SVC_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}') echo $GW_IP echo $SVC_IP
-
Then update the
<target-ip>
in./config/manifests/benchmark/benchmark.yaml
to your target IP. Feel free to adjust other parameters such as request_rates as well. For a complete list of LPG configurations, pls refer to the LPG user guide. -
Start the benchmark tool.
kubectl apply -f ./config/manifests/benchmark/benchmark.yaml
-
Wait for benchmark to finish and download the results. Use the
benchmark_id
environment variable to specify what this benchmark is for. For instance,inference-extension
ork8s-svc
. When the LPG tool finishes benchmarking, it will print a log lineLPG_FINISHED
, the script below will watch for that log line and then start downloading results.benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash
-
After the script finishes, you should see benchmark results under
./tools/benchmark/output/default-run/my-benchmark/results/json
folder.
- You can specify
run_id="runX"
environment variable when running the./download-benchmark-results.bash
script. This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly. - Update the
request_rates
that best suit your benchmark environment.
Pls refer to the LPG user guide for a detailed list of configuration knobs.
This guide shows how to run the jupyter notebook using vscode.