The goal of this guide is to demonstrate how to rollout a new adapter version.
Follow the steps in the main guide
This guide leverages the LoRA syncer sidecar to dynamically manage adapters within a vLLM deployment, enabling users to add or remove them through a shared ConfigMap.
Modify the LoRA syncer ConfigMap to initiate loading of the new adapter version.
kubectl edit configmap vllm-llama2-7b-adapters
Change the ConfigMap to match the following (note the new entry under models):
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-llama2-7b-adapters
data:
configmap.yaml: |
vLLMLoRAConfig:
name: vllm-llama2-7b-adapters
port: 8000
ensureExist:
models:
- base-model: meta-llama/Llama-2-7b-hf
id: tweet-summary-1
source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
- base-model: meta-llama/Llama-2-7b-hf
id: tweet-summary-2
source: mahimairaja/tweet-summarization-llama-2-finetuned
The new adapter version is applied to the model servers live, without requiring a restart.
Modify the InferenceModel to configure a canary rollout with traffic splitting. In this example, 10% of traffic for tweet-summary model will be sent to the new tweet-summary-2 adapter.
kubectl edit inferencemodel tweet-summary
Change the targetModels list in InferenceModel to match the following:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
name: inferencemodel-sample
spec:
modelName: tweet-summary
criticality: Critical
poolRef:
name: vllm-llama2-7b-pool
targetModels:
- name: tweet-summary-1
weight: 90
- name: tweet-summary-2
weight: 10
The above configuration means one in every ten requests should be sent to the new version. Try it out:
- Get the gateway IP:
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=8081
- Send a few requests as follows:
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
Modify the InferenceModel to direct 100% of the traffic to the latest version of the adapter.
model:
name: tweet-summary
targetModels:
targetModelName: tweet-summary-2
weight: 100
Unload the older versions from the servers by updating the LoRA syncer ConfigMap to list the older version under the ensureNotExist
list:
apiVersion: v1
kind: ConfigMap
metadata:
name: dynamic-lora-config
data:
configmap.yaml: |
vLLMLoRAConfig:
name: sql-loras-llama
port: 8000
ensureExist:
models:
- base-model: meta-llama/Llama-2-7b-hf
id: tweet-summary-2
source: mahimairaja/tweet-summarization-llama-2-finetuned
ensureNotExist:
models:
- base-model: meta-llama/Llama-2-7b-hf
id: tweet-summary-1
source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
With this, all requests should be served by the new adapter version.