Skip to content

Commit 265773a

Browse files
committed
Add instructions to run benchmarks
1 parent 910407e commit 265773a

8 files changed

+625
-0
lines changed

benchmark/Inference_Extension_Benchmark.ipynb

+358
Large diffs are not rendered by default.

benchmark/README.md

+104
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Benchmark
2+
3+
This user guide shows how to run benchmarks against a vLLM deployment, by using both the Gateway API
4+
inference extension, and a Kubernetes service as the load balancing strategy. The
5+
benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG)
6+
tool to generate load and collect results.
7+
8+
## Prerequisites
9+
10+
### Deploy the inference extension and sample model server
11+
12+
Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the
13+
sample vLLM application, and the inference extension.
14+
15+
### [Optional] Scale the sample vLLM deployment
16+
17+
You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
18+
19+
```bash
20+
kubectl scale deployment my-pool --replicas=8
21+
```
22+
23+
### Expose the model server via a k8s service
24+
25+
As the baseline, let's also expose the vLLM deployment as a k8s service by simply applying the yaml:
26+
27+
```bash
28+
kubectl apply -f .manifests/ModelServerService.yaml
29+
```
30+
31+
## Run benchmark
32+
33+
### Run benchmark using the inference extension as the load balancing strategy
34+
35+
1. Get the gateway IP:
36+
37+
```bash
38+
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
39+
echo "Update the <gateway-ip> in ./manifests/BenchmarkInferenceExtension.yaml to: $IP"
40+
```
41+
42+
1. Then update the `<gateway-ip>` in `./manifests/BenchmarkInferenceExtension.yaml` to the IP
43+
of the gateway. Feel free to adjust other parameters such as request_rates as well.
44+
45+
1. Start the benchmark tool. `kubectl apply -f ./manifests/BenchmarkInferenceExtension.yaml`
46+
47+
1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable
48+
to specify what this benchmark is for. In this case, the result is for the `inference-extension`. You
49+
can use any id you like.
50+
51+
```bash
52+
benchmark_id='inference-extension' ./download-benchmark-results.bash
53+
```
54+
55+
1. After the script finishes, you should see benchmark results under `./output/default-run/inference-extension/results/json` folder.
56+
57+
### Run benchmark using k8s service as the load balancing strategy
58+
59+
1. Get the service IP:
60+
61+
```bash
62+
IP=$(kubectl get service/my-pool-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
63+
echo "Update the <svc-ip> in ./manifests/BenchmarkK8sService.yaml to: $IP"
64+
```
65+
66+
2. Then update the `<svc-ip>` in `./manifests/BenchmarkK8sService.yaml` to the IP
67+
of the service. Feel free to adjust other parameters such as **request_rates** as well.
68+
69+
1. Start the benchmark tool. `kubectl apply -f ./manifests/BenchmarkK8sService.yaml`
70+
71+
2. Wait for benchmark to finish and download the results.
72+
73+
```bash
74+
benchmark_id='k8s-svc' ./download-benchmark-results.bash
75+
```
76+
77+
3. After the script finishes, you should see benchmark results under `./output/default-run/k8s-svc/results/json` folder.
78+
79+
### Tips
80+
81+
* You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script.
82+
This is useful when you run benchmarks multiple times and group the results accordingly.
83+
84+
## Analyze the results
85+
86+
This guide shows how to run the jupyter notebook using vscode.
87+
88+
1. Create a python virtual environment.
89+
90+
```bash
91+
python3 -m venv .venv
92+
source .venv/bin/activate
93+
```
94+
95+
1. Install the dependencies.
96+
97+
```bash
98+
pip install -r requirements.txt
99+
```
100+
101+
1. Open the notebook `Inference_Extension_Benchmark.ipynb`, and run each cell. At the end you should
102+
see a bar chart like below:
103+
104+
![alt text](image.png)
+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
#!/bin/bash
2+
3+
# Downloads the benchmark result files from the benchmark tool pod.
4+
download_benchmark_results() {
5+
until echo $(kubectl logs deployment/benchmark-tool -n ${namespace}) | grep -q -m 1 "LPG_FINISHED"; do sleep 30 ; done;
6+
benchmark_pod=$(kubectl get pods -l app=benchmark-tool -n ${namespace} -o jsonpath="{.items[0].metadata.name}")
7+
echo "Downloading JSON results from pod ${benchmark_pod}"
8+
kubectl exec ${benchmark_pod} -n ${namespace} -- rm -f ShareGPT_V3_unfiltered_cleaned_split.json
9+
for f in $(kubectl exec ${benchmark_pod} -n ${namespace} -- /bin/sh -c ls -l | grep json); do
10+
echo "Downloading json file ${f}"
11+
kubectl cp -n ${namespace} ${benchmark_pod}:$f ${benchmark_output_dir}/results/json/$f;
12+
done
13+
}
14+
15+
# Env vars to be passed when calling this script.
16+
# The id of the benchmark. This is needed to identify what the benchmark is for.
17+
# It decides the filepath to save the results, which later is used by the jupyter notebook to assign
18+
# the benchmark_id as data labels for plotting.
19+
benchmark_id=${benchmark_id:-"inference-extension"}
20+
# run_id can be used to group different runs of the same benchmarks for comparison.
21+
run_id=${run_id:-"default-run"}
22+
namespace=${namespace:-"default"}
23+
output_dir=${output_dir:-'output'}
24+
25+
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
26+
benchmark_output_dir=${SCRIPT_DIR}/${output_dir}/${run_id}/${benchmark_id}
27+
28+
echo "Saving benchmark results to ${benchmark_output_dir}/results/json/"
29+
download_benchmark_results

benchmark/image.png

59.6 KB
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
labels:
5+
app: benchmark-tool
6+
name: benchmark-tool
7+
spec:
8+
replicas: 1
9+
selector:
10+
matchLabels:
11+
app: benchmark-tool
12+
template:
13+
metadata:
14+
labels:
15+
app: benchmark-tool
16+
spec:
17+
containers:
18+
- image: 'us-docker.pkg.dev/cloud-tpu-images/inference/inference-benchmark@sha256:1c100b0cc949c7df7a2db814ae349c790f034b4b373aaad145e77e815e838438'
19+
imagePullPolicy: Always
20+
name: benchmark-tool
21+
command:
22+
- bash
23+
- -c
24+
- ./latency_throughput_curve.sh
25+
env:
26+
- name: IP
27+
value: '<gateway-ip>'
28+
# value: 'envoy-default-inference-gateway-6454a873.envoy-gateway-system.svc.cluster.local'
29+
- name: REQUEST_RATES
30+
value: '40,80,120,160,200'
31+
- name: BENCHMARK_TIME_SECONDS
32+
value: '60'
33+
- name: TOKENIZER
34+
value: 'meta-llama/Llama-2-7b-hf'
35+
- name: MODELS
36+
value: 'meta-llama/Llama-2-7b-hf'
37+
- name: BACKEND
38+
value: vllm
39+
- name: PORT
40+
value: "8081"
41+
- name: INPUT_LENGTH
42+
value: "1024"
43+
- name: OUTPUT_LENGTH
44+
value: '2048'
45+
- name: FILE_PREFIX
46+
value: benchmark
47+
- name: PROMPT_DATASET_FILE
48+
value: ShareGPT_V3_unfiltered_cleaned_split.json
49+
- name: HF_TOKEN
50+
valueFrom:
51+
secretKeyRef:
52+
key: token
53+
name: hf-token
54+
resources:
55+
limits:
56+
cpu: "2"
57+
memory: 20Gi
58+
requests:
59+
cpu: "2"
60+
memory: 20Gi
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
labels:
5+
app: benchmark-tool
6+
name: benchmark-tool
7+
spec:
8+
replicas: 1
9+
selector:
10+
matchLabels:
11+
app: benchmark-tool
12+
template:
13+
metadata:
14+
labels:
15+
app: benchmark-tool
16+
spec:
17+
containers:
18+
- image: 'us-docker.pkg.dev/cloud-tpu-images/inference/inference-benchmark@sha256:1c100b0cc949c7df7a2db814ae349c790f034b4b373aaad145e77e815e838438'
19+
imagePullPolicy: Always
20+
name: benchmark-tool
21+
command:
22+
- bash
23+
- -c
24+
- ./latency_throughput_curve.sh
25+
env:
26+
- name: IP
27+
value: 'my-pool-service.default.svc.cluster.local'
28+
- name: REQUEST_RATES
29+
value: '40,80,120,160,200'
30+
- name: BENCHMARK_TIME_SECONDS
31+
value: '60'
32+
- name: TOKENIZER
33+
value: 'meta-llama/Llama-2-7b-hf'
34+
- name: MODELS
35+
value: 'meta-llama/Llama-2-7b-hf'
36+
- name: BACKEND
37+
value: vllm
38+
- name: PORT
39+
value: "8081"
40+
- name: INPUT_LENGTH
41+
value: "1024"
42+
- name: OUTPUT_LENGTH
43+
value: '2048'
44+
- name: FILE_PREFIX
45+
value: benchmark
46+
- name: PROMPT_DATASET_FILE
47+
value: ShareGPT_V3_unfiltered_cleaned_split.json
48+
- name: HF_TOKEN
49+
valueFrom:
50+
secretKeyRef:
51+
key: token
52+
name: hf-token
53+
resources:
54+
limits:
55+
cpu: "2"
56+
memory: 20Gi
57+
requests:
58+
cpu: "2"
59+
memory: 20Gi
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
apiVersion: v1
2+
kind: Service
3+
metadata:
4+
name: my-pool-service
5+
spec:
6+
ports:
7+
- port: 8081
8+
protocol: TCP
9+
targetPort: 8000
10+
selector:
11+
app: my-pool
12+
type: LoadBalancer

benchmark/requirements.txt

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
pandas
2+
numpy
3+
matplotlib

0 commit comments

Comments
 (0)