Getting started with Gateway API Inference Extension with Dynamic lora updates on vllm

The goal of this guide is to get a single InferencePool running with vLLM and demonstrate use of dynamic lora updating!

Requirements

Envoy Gateway v1.2.1 or higher
A cluster with:
- Support for Services of type LoadBalancer. (This can be validated by ensuring your Envoy Gateway is up and running). For example, with Kind, you can follow these steps.
- 3 GPUs to run the sample model server. Adjust the number of replicas in ./manifests/vllm/deployment.yaml as needed.

Steps

**Deploy Sample VLLM Model Server with dynamic lora update enabled and dynamic lora syncer sidecar ** Redeploy the vLLM deployment with Dynamic lora adapter enabled and Lora syncer sidecar and configmap

Rest of the steps are same as general setup

Safely rollout v2 adapter

Update the LoRA syncer ConfigMap to make the new adapter version available on the model servers.

        apiVersion: v1
        kind: ConfigMap
        metadata:
        name: dynamic-lora-config
        data:
        configmap.yaml: |
             vLLMLoRAConfig:
                name: sql-loras-llama
                port: 8000
                ensureExist:
                    models:
                    - base-model: meta-llama/Llama-2-7b-hf
                      id: tweet-summary-0
                      source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
                    - base-model: meta-llama/Llama-2-7b-hf
                      id: tweet-summary-1
                      source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
                    - base-model: meta-llama/Llama-2-7b-hf
                      id: tweet-summary-2
                      source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
2. Configure a canary rollout with traffic split using LLMService. In this example, 40% of traffic for tweet-summary model will be sent to the ***tweet-summary-2*** adapter .

```yaml
model:
    name: tweet-summary
    targetModels:
    targetModelName: tweet-summary-0
            weight: 20
    targetModelName: tweet-summary-1
            weight: 40
    targetModelName: tweet-summary-2
            weight: 40

Finish rollout by setting the traffic to the new version 100%.

model:
    name: tweet-summary
    targetModels:
    targetModelName: tweet-summary-2
            weight: 100

Remove v1 from dynamic lora configmap.

    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: dynamic-lora-config
    data:
    configmap.yaml: |
            vLLMLoRAConfig:
                name: sql-loras-llama
                port: 8000
                ensureExist:
                    models:
                    - base-model: meta-llama/Llama-2-7b-hf
                      id: tweet-summary-2
                      source: vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
                ensureNotExist:
                    models:
                    - base-model: meta-llama/Llama-2-7b-hf
                      id: tweet-summary-1
                      source: gs://[HUGGING FACE PATH]
                    - base-model: meta-llama/Llama-2-7b-hf
                      id: tweet-summary-0
                      source: gs://[HUGGING FACE PATH]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dynamic-lora.md

dynamic-lora.md

Getting started with Gateway API Inference Extension with Dynamic lora updates on vllm

Requirements

Steps

Safely rollout v2 adapter

Files

dynamic-lora.md

Latest commit

History

dynamic-lora.md

File metadata and controls

Getting started with Gateway API Inference Extension with Dynamic lora updates on vllm

Requirements

Steps

Safely rollout v2 adapter