Skip to content

Commit 53f383c

Browse files
committed
Polishing to the adapter rollouts guide
1 parent 88c20f1 commit 53f383c

File tree

6 files changed

+149
-124
lines changed

6 files changed

+149
-124
lines changed

mkdocs.yml

+1
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ nav:
5656
- Guides:
5757
- User Guides:
5858
- Getting started: guides/index.md
59+
- Adapter Rollout: guides/adapter-rollout.md
5960
- Implementer's Guide: guides/implementers.md
6061
- Reference:
6162
- API Reference: reference/spec.md

pkg/manifests/inferencemodel.yaml

+1-3
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,5 @@ spec:
1515
kind: InferencePool
1616
name: vllm-llama2-7b-pool
1717
targetModels:
18-
- name: tweet-summary-0
19-
weight: 50
2018
- name: tweet-summary-1
21-
weight: 50
19+
weight: 100

pkg/manifests/vllm/deployment-with-syncer.yaml

+5-13
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,6 @@ spec:
4343
- "--max-cpu-loras"
4444
- "12"
4545
- "--lora-modules"
46-
- '{"name": "tweet-summary-0", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}'
4746
- '{"name": "tweet-summary-1", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}'
4847
env:
4948
- name: PORT
@@ -95,7 +94,7 @@ spec:
9594
- name: lora-adapter-syncer
9695
tty: true
9796
stdin: true
98-
image: us-central1-docker.pkg.dev/ahg-gke-dev/jobset2/lora-syncer:6dc97be
97+
image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main
9998
restartPolicy: Always
10099
imagePullPolicy: Always
101100
env:
@@ -117,29 +116,22 @@ spec:
117116
emptyDir: {}
118117
- name: config-volume
119118
configMap:
120-
name: dynamic-lora-config
119+
name: vllm-llama2-7b-adapters
121120

122121
---
123122

124123
apiVersion: v1
125124
kind: ConfigMap
126125
metadata:
127-
name: dynamic-lora-config
126+
name: vllm-llama2-7b-adapters
128127
data:
129128
configmap.yaml: |
130129
vLLMLoRAConfig:
131-
name: sql-loras-llama
130+
name: vllm-llama2-7b
132131
port: 8000
133132
ensureExist:
134133
models:
135-
- base-model: meta-llama/Llama-2-7b-hf
136-
id: tweet-summary-0
137-
source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
138134
- base-model: meta-llama/Llama-2-7b-hf
139135
id: tweet-summary-1
140136
source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
141-
ensureNotExist:
142-
models:
143-
- base-model: meta-llama/Llama-2-7b-hf
144-
id: tweet-summary-2
145-
source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
137+

site-src/guides/adapter-rollout.md

+126
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# Adapter Rollout
2+
3+
The goal of this guide is to demonstrate how to rollout a new adapter version.
4+
5+
## **Requirements**
6+
7+
Follow the steps in the [main guide](index.md)
8+
9+
10+
## **Safely rollout v2 adapter**
11+
12+
### Load the new adapter version to the model servers
13+
14+
This guide leverages the LoRA syncer sidecar to dynamically manage adapters within a vLLM deployment, enabling users to add or remove them through a shared ConfigMap.
15+
16+
17+
Modify the LoRA syncer ConfigMap to initiate loading of the new adapter version.
18+
19+
20+
```bash
21+
kubectl edit configmap vllm-llama2-7b-adapters
22+
```
23+
24+
Change the ConfigMap to match the following (note the new entry under models):
25+
26+
```yaml
27+
apiVersion: v1
28+
kind: ConfigMap
29+
metadata:
30+
name: vllm-llama2-7b-adapters
31+
data:
32+
configmap.yaml: |
33+
vLLMLoRAConfig:
34+
name: vllm-llama2-7b-adapters
35+
port: 8000
36+
ensureExist:
37+
models:
38+
- base-model: meta-llama/Llama-2-7b-hf
39+
id: tweet-summary-1
40+
source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
41+
- base-model: meta-llama/Llama-2-7b-hf
42+
id: tweet-summary-2
43+
source: mahimairaja/tweet-summarization-llama-2-finetuned
44+
```
45+
46+
The new adapter version is applied to the model servers live, without requiring a restart.
47+
48+
49+
### Direct traffic to the new adapter version
50+
51+
Modify the InferenceModel to configure a canary rollout with traffic splitting. In this example, 10% of traffic for tweet-summary model will be sent to the new ***tweet-summary-2*** adapter.
52+
53+
54+
```bash
55+
kubectl edit configmap tweet-summary
56+
```
57+
58+
Change the InferenceModel to match the following:
59+
60+
61+
```yaml
62+
model:
63+
name: tweet-summary
64+
targetModels:
65+
targetModelName: tweet-summary-1
66+
weight: 90
67+
targetModelName: tweet-summary-2
68+
weight: 10
69+
70+
```
71+
72+
The above configuration means one in every ten requests should be sent to the new version. Try it out:
73+
74+
1. Get the gateway IP:
75+
```bash
76+
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=8081
77+
```
78+
79+
2. Send a few requests as follows:
80+
```bash
81+
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
82+
"model": "tweet-summary",
83+
"prompt": "Write as if you were a critic: San Francisco",
84+
"max_tokens": 100,
85+
"temperature": 0
86+
}'
87+
```
88+
89+
### Finish the rollout
90+
91+
92+
Modify the InferenceModel to direct 100% of the traffic to the latest version of the adapter.
93+
94+
```yaml
95+
model:
96+
name: tweet-summary
97+
targetModels:
98+
targetModelName: tweet-summary-2
99+
weight: 100
100+
```
101+
102+
Unload the older versions from the servers by updating the LoRA syncer ConfigMap to list the older version under the `ensureNotExist` list:
103+
104+
```yaml
105+
apiVersion: v1
106+
kind: ConfigMap
107+
metadata:
108+
name: dynamic-lora-config
109+
data:
110+
configmap.yaml: |
111+
vLLMLoRAConfig:
112+
name: sql-loras-llama
113+
port: 8000
114+
ensureExist:
115+
models:
116+
- base-model: meta-llama/Llama-2-7b-hf
117+
id: tweet-summary-2
118+
source: mahimairaja/tweet-summarization-llama-2-finetuned
119+
ensureNotExist:
120+
models:
121+
- base-model: meta-llama/Llama-2-7b-hf
122+
id: tweet-summary-1
123+
source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
124+
```
125+
126+
With this, all requests should be served by the new adapter version.

site-src/guides/dynamic-lora.md

-93
This file was deleted.

site-src/guides/index.md

+16-15
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,16 @@
22

33
This quickstart guide is intended for engineers familiar with k8s and model servers (vLLM in this instance). The goal of this guide is to get a first, single InferencePool up and running!
44

5-
### Requirements
5+
## **Requirements**
66
- Envoy Gateway [v1.2.1](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher
77
- A cluster with:
88
- Support for Services of type `LoadBalancer`. (This can be validated by ensuring your Envoy Gateway is up and running). For example, with Kind,
99
you can follow [these steps](https://kind.sigs.k8s.io/docs/user/loadbalancer).
1010
- 3 GPUs to run the sample model server. Adjust the number of replicas in `./manifests/vllm/deployment.yaml` as needed.
1111

12-
### Steps
12+
## **Steps**
1313

14-
1. **Deploy Sample Model Server**
14+
### Deploy Sample Model Server
1515

1616
Create a Hugging Face secret to download the model [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf). Ensure that the token grants access to this model.
1717
Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway.
@@ -20,30 +20,29 @@ This quickstart guide is intended for engineers familiar with k8s and model serv
2020
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/vllm/deployment.yaml
2121
```
2222

23-
24-
25-
26-
1. **Install the Inference Extension CRDs:**
23+
### Install the Inference Extension CRDs
2724

2825
```sh
2926
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.1.0/manifests.yaml
3027

31-
1. **Deploy InferenceModel**
28+
### Deploy InferenceModel
3229

3330
Deploy the sample InferenceModel which is configured to load balance traffic between the `tweet-summary-0` and `tweet-summary-1`
3431
[LoRA adapters](https://docs.vllm.ai/en/latest/features/lora.html) of the sample model server.
3532
```bash
3633
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/inferencemodel.yaml
3734
```
38-
1. **Update Envoy Gateway Config to enable Patch Policy**
35+
36+
### Update Envoy Gateway Config to enable Patch Policy**
3937

4038
Our custom LLM Gateway ext-proc is patched into the existing envoy gateway via `EnvoyPatchPolicy`. To enable this feature, we must extend the Envoy Gateway config map. To do this, simply run:
4139
```bash
4240
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/enable_patch_policy.yaml
4341
kubectl rollout restart deployment envoy-gateway -n envoy-gateway-system
4442
```
4543
Additionally, if you would like to enable the admin interface, you can uncomment the admin lines and run this again.
46-
1. **Deploy Gateway**
44+
45+
### Deploy Gateway
4746

4847
```bash
4948
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/gateway.yaml
@@ -56,26 +55,28 @@ This quickstart guide is intended for engineers familiar with k8s and model serv
5655
NAME CLASS ADDRESS PROGRAMMED AGE
5756
inference-gateway inference-gateway <MY_ADDRESS> True 22s
5857
```
59-
1. **Deploy the Inference Extension and InferencePool**
58+
### Deploy the Inference Extension and InferencePool
6059

6160
```bash
6261
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/ext_proc.yaml
6362
```
64-
1. **Deploy Envoy Gateway Custom Policies**
63+
### Deploy Envoy Gateway Custom Policies
6564

6665
```bash
6766
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/extension_policy.yaml
6867
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/patch_policy.yaml
6968
```
7069
> **_NOTE:_** This is also per InferencePool, and will need to be configured to support the new pool should you wish to experiment further.
71-
1. **OPTIONALLY**: Apply Traffic Policy
70+
71+
### **OPTIONALLY**: Apply Traffic Policy
7272

7373
For high-traffic benchmarking you can apply this manifest to avoid any defaults that can cause timeouts/errors.
7474

7575
```bash
7676
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/traffic_policy.yaml
7777
```
78-
1. **Try it out**
78+
79+
### Try it out
7980

8081
Wait until the gateway is ready.
8182

@@ -89,4 +90,4 @@ This quickstart guide is intended for engineers familiar with k8s and model serv
8990
"max_tokens": 100,
9091
"temperature": 0
9192
}'
92-
```
93+
```

0 commit comments

Comments
 (0)