Skip to content

Commit 64ba0c6

Browse files
authored
Add instructions to run benchmarks (#480)
* Add instructions to run benchmarks * Address comments * Move benchmark guide to site-src and other cleanups * Add source code link for the benchmark tool image * Address nit
1 parent 950e036 commit 64ba0c6

File tree

9 files changed

+564
-0
lines changed

9 files changed

+564
-0
lines changed

benchmark/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
This folder contains resources to run performance benchmarks. Pls follow the benchmark guide here https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark.

benchmark/benchmark.ipynb

+358
Large diffs are not rendered by default.
+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
#!/bin/bash
2+
3+
# Downloads the benchmark result files from the benchmark tool pod.
4+
download_benchmark_results() {
5+
until echo $(kubectl logs deployment/benchmark-tool -n ${namespace}) | grep -q -m 1 "LPG_FINISHED"; do sleep 30 ; done;
6+
benchmark_pod=$(kubectl get pods -l app=benchmark-tool -n ${namespace} -o jsonpath="{.items[0].metadata.name}")
7+
echo "Downloading JSON results from pod ${benchmark_pod}"
8+
kubectl exec ${benchmark_pod} -n ${namespace} -- rm -f ShareGPT_V3_unfiltered_cleaned_split.json
9+
for f in $(kubectl exec ${benchmark_pod} -n ${namespace} -- /bin/sh -c ls -l | grep json); do
10+
echo "Downloading json file ${f}"
11+
kubectl cp -n ${namespace} ${benchmark_pod}:$f ${benchmark_output_dir}/results/json/$f;
12+
done
13+
}
14+
15+
# Env vars to be passed when calling this script.
16+
# The id of the benchmark. This is needed to identify what the benchmark is for.
17+
# It decides the filepath to save the results, which later is used by the jupyter notebook to assign
18+
# the benchmark_id as data labels for plotting.
19+
benchmark_id=${benchmark_id:-"inference-extension"}
20+
# run_id can be used to group different runs of the same benchmarks for comparison.
21+
run_id=${run_id:-"default-run"}
22+
namespace=${namespace:-"default"}
23+
output_dir=${output_dir:-'output'}
24+
25+
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
26+
benchmark_output_dir=${SCRIPT_DIR}/${output_dir}/${run_id}/${benchmark_id}
27+
28+
echo "Saving benchmark results to ${benchmark_output_dir}/results/json/"
29+
download_benchmark_results
30+
kubectl delete -f ${SCRIPT_DIR}/../config/manifests/benchmark/benchmark.yaml

benchmark/requirements.txt

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
pandas
2+
numpy
3+
matplotlib
+60
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
labels:
5+
app: benchmark-tool
6+
name: benchmark-tool
7+
spec:
8+
replicas: 1
9+
selector:
10+
matchLabels:
11+
app: benchmark-tool
12+
template:
13+
metadata:
14+
labels:
15+
app: benchmark-tool
16+
spec:
17+
containers:
18+
# The following image was built from this source https://github.com/AI-Hypercomputer/inference-benchmark/tree/07628c9fe01b748f5a4cc9e5c2ee4234aaf47699
19+
- image: 'us-docker.pkg.dev/cloud-tpu-images/inference/inference-benchmark@sha256:1c100b0cc949c7df7a2db814ae349c790f034b4b373aaad145e77e815e838438'
20+
imagePullPolicy: Always
21+
name: benchmark-tool
22+
command:
23+
- bash
24+
- -c
25+
- ./latency_throughput_curve.sh
26+
env:
27+
- name: IP
28+
value: '<target-ip>'
29+
- name: REQUEST_RATES
30+
value: '10,20,30'
31+
- name: BENCHMARK_TIME_SECONDS
32+
value: '60'
33+
- name: TOKENIZER
34+
value: 'meta-llama/Llama-2-7b-hf'
35+
- name: MODELS
36+
value: 'meta-llama/Llama-2-7b-hf'
37+
- name: BACKEND
38+
value: vllm
39+
- name: PORT
40+
value: "8081"
41+
- name: INPUT_LENGTH
42+
value: "1024"
43+
- name: OUTPUT_LENGTH
44+
value: '2048'
45+
- name: FILE_PREFIX
46+
value: benchmark
47+
- name: PROMPT_DATASET_FILE
48+
value: ShareGPT_V3_unfiltered_cleaned_split.json
49+
- name: HF_TOKEN
50+
valueFrom:
51+
secretKeyRef:
52+
key: token
53+
name: hf-token
54+
resources:
55+
limits:
56+
cpu: "2"
57+
memory: 20Gi
58+
requests:
59+
cpu: "2"
60+
memory: 20Gi
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: v1
2+
kind: Service
3+
metadata:
4+
name: my-pool-service
5+
spec:
6+
ports:
7+
- port: 8081
8+
protocol: TCP
9+
targetPort: 8000
10+
selector:
11+
app: my-pool
12+
type: LoadBalancer

mkdocs.yml

+2
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,8 @@ nav:
6262
- Adapter Rollout: guides/adapter-rollout.md
6363
- Metrics: guides/metrics.md
6464
- Implementer's Guide: guides/implementers.md
65+
- Performance:
66+
- Benchmark: performance/benchmark/index.md
6567
- Reference:
6668
- API Reference: reference/spec.md
6769
- API Types:
Loading
+98
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Benchmark
2+
3+
This user guide shows how to run benchmarks against a vLLM deployment, by using both the Gateway API
4+
inference extension, and a Kubernetes service as the load balancing strategy. The
5+
benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG)
6+
tool to generate load and collect results.
7+
8+
## Prerequisites
9+
10+
### Deploy the inference extension and sample model server
11+
12+
Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the
13+
sample vLLM application, and the inference extension.
14+
15+
### [Optional] Scale the sample vLLM deployment
16+
17+
You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
18+
19+
```bash
20+
kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
21+
```
22+
23+
### Expose the model server via a k8s service
24+
25+
As the baseline, let's also expose the vLLM deployment as a k8s service:
26+
27+
```bash
28+
kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer
29+
```
30+
31+
## Run benchmark
32+
33+
The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets.
34+
35+
1. Check out the repo.
36+
37+
```bash
38+
git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension
39+
cd gateway-api-inference-extension
40+
```
41+
42+
1. Get the target IP. Examples below show how to get the IP of a gateway or a LoadBalancer k8s service.
43+
44+
```bash
45+
# Get gateway IP
46+
GW_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
47+
# Get LoadBalancer k8s service IP
48+
SVC_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
49+
50+
echo $GW_IP
51+
echo $SVC_IP
52+
```
53+
54+
1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to your target IP. Feel free to adjust other parameters such as request_rates as well. For a complete list of LPG configurations, pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark).
55+
56+
1. Start the benchmark tool. `kubectl apply -f ./config/manifests/benchmark/benchmark.yaml`
57+
58+
1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable
59+
to specify what this benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print a log line `LPG_FINISHED`,
60+
the script below will watch for that log line and then start downloading results.
61+
62+
```bash
63+
benchmark_id='my-benchmark' ./benchmark/download-benchmark-results.bash
64+
```
65+
66+
1. After the script finishes, you should see benchmark results under `./benchmark/output/default-run/my-benchmark/results/json` folder.
67+
68+
### Tips
69+
70+
* You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script.
71+
This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly.
72+
* Update the `request_rates` that best suit your benchmark environment.
73+
74+
### Advanced Benchmark Configurations
75+
76+
Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs.
77+
78+
## Analyze the results
79+
80+
This guide shows how to run the jupyter notebook using vscode.
81+
82+
1. Create a python virtual environment.
83+
84+
```bash
85+
python3 -m venv .venv
86+
source .venv/bin/activate
87+
```
88+
89+
1. Install the dependencies.
90+
91+
```bash
92+
pip install -r ./benchmark/requirements.txt
93+
```
94+
95+
1. Open the notebook `./benchmark/benchmark.ipynb`, and run each cell. At the end you should
96+
see a bar chart like below:
97+
98+
![alt text](example-bar-chart.png)

0 commit comments

Comments
 (0)