You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The goal of this guide is to demonstrate how to rollout a new adapter version.
4
+
5
+
## **Requirements**
6
+
7
+
Follow the steps in the [main guide](index.md)
8
+
9
+
10
+
## **Safely rollout v2 adapter**
11
+
12
+
### Load the new adapter version to the model servers
13
+
14
+
This guide leverages the LoRA syncer sidecar to dynamically manage adapters within a vLLM deployment, enabling users to add or remove them through a shared ConfigMap.
15
+
16
+
17
+
Modify the LoRA syncer ConfigMap to initiate loading of the new adapter version.
18
+
19
+
20
+
```bash
21
+
kubectl edit configmap vllm-llama2-7b-adapters
22
+
```
23
+
24
+
Change the ConfigMap to match the following (note the new entry under models):
The new adapter version is applied to the model servers live, without requiring a restart.
47
+
48
+
49
+
### Direct traffic to the new adapter version
50
+
51
+
Modify the InferenceModel to configure a canary rollout with traffic splitting. In this example, 10% of traffic for tweet-summary model will be sent to the new ***tweet-summary-2*** adapter.
52
+
53
+
54
+
```bash
55
+
kubectl edit configmap tweet-summary
56
+
```
57
+
58
+
Change the InferenceModel to match the following:
59
+
60
+
61
+
```yaml
62
+
model:
63
+
name: tweet-summary
64
+
targetModels:
65
+
targetModelName: tweet-summary-1
66
+
weight: 90
67
+
targetModelName: tweet-summary-2
68
+
weight: 10
69
+
70
+
```
71
+
72
+
The above configuration means one in every ten requests should be sent to the new version. Try it out:
73
+
74
+
1. Get the gateway IP:
75
+
```bash
76
+
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=8081
Copy file name to clipboardExpand all lines: site-src/guides/index.md
+16-15
Original file line number
Diff line number
Diff line change
@@ -2,16 +2,16 @@
2
2
3
3
This quickstart guide is intended for engineers familiar with k8s and model servers (vLLM in this instance). The goal of this guide is to get a first, single InferencePool up and running!
4
4
5
-
### Requirements
5
+
##**Requirements**
6
6
- Envoy Gateway [v1.2.1](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher
7
7
- A cluster with:
8
8
- Support for Services of type `LoadBalancer`. (This can be validated by ensuring your Envoy Gateway is up and running). For example, with Kind,
9
9
you can follow [these steps](https://kind.sigs.k8s.io/docs/user/loadbalancer).
10
10
- 3 GPUs to run the sample model server. Adjust the number of replicas in `./manifests/vllm/deployment.yaml` as needed.
11
11
12
-
### Steps
12
+
##**Steps**
13
13
14
-
1.**Deploy Sample Model Server**
14
+
### Deploy Sample Model Server
15
15
16
16
Create a Hugging Face secret to download the model [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf). Ensure that the token grants access to this model.
17
17
Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway.
@@ -20,30 +20,29 @@ This quickstart guide is intended for engineers familiar with k8s and model serv
1. **Update Envoy Gateway Config to enable Patch Policy**
35
+
36
+
### Update Envoy Gateway Config to enable Patch Policy**
39
37
40
38
Our custom LLM Gateway ext-proc is patched into the existing envoy gateway via `EnvoyPatchPolicy`. To enable this feature, we must extend the Envoy Gateway config map. To do this, simply run:
0 commit comments