Skip to content

Commit 23cd81b

Browse files
ahg-grramkumar1
authored andcommitted
Fixes to the adapter rollouts guide (kubernetes-sigs#338)
* Polishing to the adapter rollouts guide * Make all guides use the same deployment so that we can till one story as the user navigates through the guides * Addressed comments
1 parent 8588a1c commit 23cd81b

File tree

7 files changed

+187
-278
lines changed

7 files changed

+187
-278
lines changed

mkdocs.yml

+1
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ nav:
5656
- Guides:
5757
- User Guides:
5858
- Getting started: guides/index.md
59+
- Adapter Rollout: guides/adapter-rollout.md
5960
- Implementer's Guide: guides/implementers.md
6061
- Reference:
6162
- API Reference: reference/spec.md

pkg/manifests/inferencemodel.yaml

+1-10
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,12 @@
11
apiVersion: inference.networking.x-k8s.io/v1alpha1
22
kind: InferenceModel
33
metadata:
4-
labels:
5-
app.kubernetes.io/name: api
6-
app.kubernetes.io/managed-by: kustomize
74
name: inferencemodel-sample
85
spec:
96
modelName: tweet-summary
107
criticality: Critical
118
poolRef:
12-
# this is the default val:
13-
group: inference.networking.x-k8s.io
14-
# this is the default val:
15-
kind: InferencePool
169
name: vllm-llama2-7b-pool
1710
targetModels:
18-
- name: tweet-summary-0
19-
weight: 50
2011
- name: tweet-summary-1
21-
weight: 50
12+
weight: 100

pkg/manifests/vllm/deployment-with-syncer.yaml

-145
This file was deleted.

pkg/manifests/vllm/deployment.yaml

+35-14
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,3 @@
1-
apiVersion: v1
2-
kind: Service
3-
metadata:
4-
name: vllm-llama2-7b-pool
5-
spec:
6-
selector:
7-
app: vllm-llama2-7b-pool
8-
ports:
9-
- protocol: TCP
10-
port: 8000
11-
targetPort: 8000
12-
type: ClusterIP
13-
---
141
apiVersion: apps/v1
152
kind: Deployment
163
metadata:
@@ -39,7 +26,7 @@ spec:
3926
- "8000"
4027
- "--enable-lora"
4128
- "--max-loras"
42-
- "4"
29+
- "2"
4330
- "--max-cpu-loras"
4431
- "12"
4532
- "--lora-modules"
@@ -53,6 +40,8 @@ spec:
5340
secretKeyRef:
5441
name: hf-token
5542
key: token
43+
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
44+
value: "true"
5645
ports:
5746
- containerPort: 8000
5847
name: http
@@ -89,6 +78,19 @@ spec:
8978
name: shm
9079
- name: adapters
9180
mountPath: "/adapters"
81+
initContainers:
82+
- name: lora-adapter-syncer
83+
tty: true
84+
stdin: true
85+
image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main
86+
restartPolicy: Always
87+
imagePullPolicy: Always
88+
env:
89+
- name: DYNAMIC_LORA_ROLLOUT_CONFIG
90+
value: "/config/configmap.yaml"
91+
volumeMounts: # DO NOT USE subPath, dynamic configmap updates don't work on subPaths
92+
- name: config-volume
93+
mountPath: /config
9294
restartPolicy: Always
9395
schedulerName: default-scheduler
9496
terminationGracePeriodSeconds: 30
@@ -100,3 +102,22 @@ spec:
100102
medium: Memory
101103
- name: adapters
102104
emptyDir: {}
105+
- name: config-volume
106+
configMap:
107+
name: vllm-llama2-7b-adapters
108+
---
109+
apiVersion: v1
110+
kind: ConfigMap
111+
metadata:
112+
name: vllm-llama2-7b-adapters
113+
data:
114+
configmap.yaml: |
115+
vLLMLoRAConfig:
116+
name: vllm-llama2-7b
117+
port: 8000
118+
ensureExist:
119+
models:
120+
- base-model: meta-llama/Llama-2-7b-hf
121+
id: tweet-summary-1
122+
source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
123+

site-src/guides/adapter-rollout.md

+133
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Adapter Rollout
2+
3+
The goal of this guide is to demonstrate how to rollout a new adapter version.
4+
5+
## **Prerequisites**
6+
7+
Follow the steps in the [main guide](index.md)
8+
9+
10+
## **Safely rollout v2 adapter**
11+
12+
### Load the new adapter version to the model servers
13+
14+
This guide leverages the LoRA syncer sidecar to dynamically manage adapters within a vLLM deployment, enabling users to add or remove them through a shared ConfigMap.
15+
16+
17+
Modify the LoRA syncer ConfigMap to initiate loading of the new adapter version.
18+
19+
20+
```bash
21+
kubectl edit configmap vllm-llama2-7b-adapters
22+
```
23+
24+
Change the ConfigMap to match the following (note the new entry under models):
25+
26+
```yaml
27+
apiVersion: v1
28+
kind: ConfigMap
29+
metadata:
30+
name: vllm-llama2-7b-adapters
31+
data:
32+
configmap.yaml: |
33+
vLLMLoRAConfig:
34+
name: vllm-llama2-7b-adapters
35+
port: 8000
36+
ensureExist:
37+
models:
38+
- base-model: meta-llama/Llama-2-7b-hf
39+
id: tweet-summary-1
40+
source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
41+
- base-model: meta-llama/Llama-2-7b-hf
42+
id: tweet-summary-2
43+
source: mahimairaja/tweet-summarization-llama-2-finetuned
44+
```
45+
46+
The new adapter version is applied to the model servers live, without requiring a restart.
47+
48+
49+
### Direct traffic to the new adapter version
50+
51+
Modify the InferenceModel to configure a canary rollout with traffic splitting. In this example, 10% of traffic for tweet-summary model will be sent to the new ***tweet-summary-2*** adapter.
52+
53+
54+
```bash
55+
kubectl edit inferencemodel tweet-summary
56+
```
57+
58+
Change the targetModels list in InferenceModel to match the following:
59+
60+
61+
```yaml
62+
apiVersion: inference.networking.x-k8s.io/v1alpha1
63+
kind: InferenceModel
64+
metadata:
65+
name: inferencemodel-sample
66+
spec:
67+
modelName: tweet-summary
68+
criticality: Critical
69+
poolRef:
70+
name: vllm-llama2-7b-pool
71+
targetModels:
72+
- name: tweet-summary-1
73+
weight: 90
74+
- name: tweet-summary-2
75+
weight: 10
76+
77+
```
78+
79+
The above configuration means one in every ten requests should be sent to the new version. Try it out:
80+
81+
1. Get the gateway IP:
82+
```bash
83+
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=8081
84+
```
85+
86+
2. Send a few requests as follows:
87+
```bash
88+
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
89+
"model": "tweet-summary",
90+
"prompt": "Write as if you were a critic: San Francisco",
91+
"max_tokens": 100,
92+
"temperature": 0
93+
}'
94+
```
95+
96+
### Finish the rollout
97+
98+
99+
Modify the InferenceModel to direct 100% of the traffic to the latest version of the adapter.
100+
101+
```yaml
102+
model:
103+
name: tweet-summary
104+
targetModels:
105+
targetModelName: tweet-summary-2
106+
weight: 100
107+
```
108+
109+
Unload the older versions from the servers by updating the LoRA syncer ConfigMap to list the older version under the `ensureNotExist` list:
110+
111+
```yaml
112+
apiVersion: v1
113+
kind: ConfigMap
114+
metadata:
115+
name: dynamic-lora-config
116+
data:
117+
configmap.yaml: |
118+
vLLMLoRAConfig:
119+
name: sql-loras-llama
120+
port: 8000
121+
ensureExist:
122+
models:
123+
- base-model: meta-llama/Llama-2-7b-hf
124+
id: tweet-summary-2
125+
source: mahimairaja/tweet-summarization-llama-2-finetuned
126+
ensureNotExist:
127+
models:
128+
- base-model: meta-llama/Llama-2-7b-hf
129+
id: tweet-summary-1
130+
source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
131+
```
132+
133+
With this, all requests should be served by the new adapter version.

0 commit comments

Comments
 (0)