Skip to content

Dynamic lora load/unload sidecar #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Nov 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
14e4b10
Dynamic lora load/unload sidecar
coolkp Oct 23, 2024
bcfee4a
Formatting
coolkp Oct 23, 2024
cb45fe2
Resolve README comments
coolkp Oct 30, 2024
62da988
Address comments on sidecar, store updates in memory, rename base field
coolkp Oct 30, 2024
56cffc2
Address comments in example deployment
coolkp Oct 30, 2024
5cbaeef
Address comments in example deployment
coolkp Oct 30, 2024
5a03f98
base model is optional
coolkp Oct 30, 2024
1af2df4
Check health of server before querying
coolkp Nov 5, 2024
5b51182
Check health of server before querying
coolkp Nov 5, 2024
cc1e686
Docstrings
coolkp Nov 5, 2024
926a71c
Mock health check in tests
coolkp Nov 5, 2024
cb3c9b2
Refactor configmap, switch to watchfiles to detect symbolic link targ…
coolkp Nov 7, 2024
3140610
Refactor configmap, switch to watchfiles to detect symbolic link targ…
coolkp Nov 7, 2024
65cea88
Modify unittests
coolkp Nov 8, 2024
8012ea3
Change example host and port to be explicit
coolkp Nov 8, 2024
ba00b85
Change example sidecar name
coolkp Nov 8, 2024
c8d9c10
Add warning about using subPath
coolkp Nov 8, 2024
828348d
Add screenshots
coolkp Nov 8, 2024
ec40820
Add screenshots
coolkp Nov 8, 2024
1aba325
Add testing results
coolkp Nov 9, 2024
b30051a
Add testing results
coolkp Nov 9, 2024
c5d2527
Add config validation
coolkp Nov 11, 2024
d0d01e1
Add config documentation
coolkp Nov 11, 2024
b4867b6
Add config documentation
coolkp Nov 11, 2024
e60b434
Add config validation
coolkp Nov 11, 2024
bea4068
Add config validation
coolkp Nov 11, 2024
100f636
Make reconciling non blocking
coolkp Nov 11, 2024
c24ff35
Move under tools
coolkp Nov 12, 2024
472b545
Move under tools
coolkp Nov 12, 2024
5354a47
Document usage of sidecar, available by default from 1.29
coolkp Nov 13, 2024
bc2ce32
Document usage of sidecar, available by default from 1.29
coolkp Nov 13, 2024
f82d8b2
Document usage of sidecar, available by default from 1.29
coolkp Nov 13, 2024
e01ec51
Update tools/dynamic-lora-sidecar/README.md
coolkp Nov 16, 2024
28779e7
Update tools/dynamic-lora-sidecar/README.md
coolkp Nov 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions tools/dynamic-lora-sidecar/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sidecar/__pycache__/
23 changes: 23 additions & 0 deletions tools/dynamic-lora-sidecar/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
FROM python:3.9-slim-buster AS test

WORKDIR /dynamic-lora-reconciler-test
COPY requirements.txt .
COPY sidecar/* .
RUN pip install -r requirements.txt
RUN python -m unittest discover || exit 1

FROM python:3.10-slim-buster

WORKDIR /dynamic-lora-reconciler

RUN python3 -m venv /opt/venv

ENV PATH="/opt/venv/bin:$PATH"

RUN pip install --upgrade pip
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY sidecar/* .

CMD ["python", "sidecar.py"]
71 changes: 71 additions & 0 deletions tools/dynamic-lora-sidecar/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Dynamic LORA Adapter Sidecar for vLLM

This is a sidecar-based tool to help rolling out new LoRA adapters to a set of running vLLM model servers. The user deploys the sidecar with a vLLM server, and using a ConfigMap, the user can express their intent as to which LoRA adapters they want to have the running vLLM servers to be configure with. The sidecar watches the ConfigMap and sends load/unload requests to the vLLM container to actuate on the user intent.

## Overview

The sidecar continuously monitors a ConfigMap mounted as a YAML configuration file. This file defines the desired state of LORA adapters, including:

- **Adapter ID:** Unique identifier for the adapter.
- **Source:** Path to the adapter's source files.
- **Base Model:** The base model to which the adapter should be applied.
- **toRemove:** (Optional) Indicates whether the adapter should be unloaded.

The sidecar uses the vLLM server's API to load or unload adapters based on the configuration. It also periodically reconciles the registered adapters on the vLLM server with the desired state defined in the ConfigMap, ensuring consistency.

## Features

- **Dynamic Loading and Unloading:** Load and unload LORA adapters without restarting the vLLM server.
- **Continuous Reconciliation:** Ensures the vLLM server's state matches the desired configuration.
- **ConfigMap Integration:** Leverages Kubernetes ConfigMaps for easy configuration management.
- **Easy Deployment:** Provides a sample deployment YAML for quick setup.

## Repository Contents

- **`sidecar.py`:** Python script for the sidecar container.
- **`Dockerfile`:** Dockerfile to build the sidecar image.
- **`configmap.yaml`:** Example ConfigMap YAML file.
- **`deployment.yaml`:** Example Kubernetes deployment YAML.

## Usage

1. **Build the Docker Image:**
```bash
docker build -t <your-image-name> .
2. **Create a configmap:**
```bash
kubectl create configmap name-of-your-configmap --from-file=your-file.yaml
3. **Mount the configmap and configure sidecar in your pod**
```yaml
volumeMounts: # DO NOT USE subPath
- name: config-volume
mountPath: /config
```
Do not use subPath, since configmap updates are not reflected in the file

[deployment]: deployment.yaml it uses [sidecar](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/)(`initContainer` with `restartPolicy` set to `always`) which is beta feature enabled by default since k8s version 1.29. They need to be enabled in 1.28 and prior to 1.28 sidecar are not officially supported.

## Configuration Fields
- `vLLMLoRAConfig`[**required**] base key
- `host` [*optional*]Model server's host. defaults to localhost
- `port` [*optional*] Model server's port. defaults to 8000
- `name`[*optional*] Name of this config
- `ensureExist`[*optional*] List of models to ensure existence on specified model server.
- `models`[**required**] [list]
- `base-model`[*optional*] Base model for lora adapter
- `id`[**required**] unique id of lora adapter
- `source`[**required**] path (remote or local) to lora adapter
- `ensureNotExist` [*optional*]
- `models`[**required**] [list]
- `id`[**required**] unique id of lora adapter
- `source`[**required**] path (remote or local) to lora adapter
- `base-model`[*optional*] Base model for lora adapter




## Screenshots & Testing
The sidecar was tested with the Deployment and ConfigMap specified in this repo. Here are screen grabs of the logs from the sidecar and vllm server. One can verify that the adapters were loaded by querying `v1/models` and looking at vllm logs.
![lora-adapter-syncer](screenshots/lora-syncer-sidecar.png)
![config map change](screenshots/configmap-change.png)
![vllm-logs](screenshots/vllm-logs.png)
127 changes: 127 additions & 0 deletions tools/dynamic-lora-sidecar/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-deployment
spec:
replicas: 1
selector:
matchLabels:
app: llama-server
template:
metadata:
labels:
app: llama-server
ai.gke.io/model: LLaMA2_7B
ai.gke.io/inference-server: vllm
examples.ai.gke.io/source: model-garden
spec:
shareProcessNamespace: true
containers:
- name: inference-server
image: vllm/vllm-openai:v0.6.3.post1
resources:
requests:
cpu: 5
memory: 20Gi
ephemeral-storage: 40Gi
nvidia.com/gpu : 1
limits:
cpu: 5
memory: 20Gi
ephemeral-storage: 40Gi
nvidia.com/gpu : 1
command: ["/bin/sh", "-c"]
args:
- vllm serve meta-llama/Llama-2-7b-hf
- --host=0.0.0.0
- --port=8000
- --tensor-parallel-size=1
- --swap-space=16
- --gpu-memory-utilization=0.95
- --max-model-len=2048
- --max-num-batched-tokens=4096
- --disable-log-stats
- --enable-loras
- --max-loras=5
env:
- name: DEPLOY_SOURCE
value: UI_NATIVE_MODEL
- name: MODEL_ID
value: "Llama2-7B"
- name: AIP_STORAGE_URI
value: "gs://vertex-model-garden-public-us/llama2/llama2-7b-hf"
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
value: "true"
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token # The name of your Kubernetes Secret
key: token # The specific key within the Secret
- name: DYNAMIC_LORA_ROLLOUT_CONFIG
value: "/config/configmap.yaml"
volumeMounts:
- mountPath: /dev/shm
name: dshm
initContainers:
- name: lora-adapter-syncer
tty: true
stdin: true
image: <SIDECAR_IMAGE>
restartPolicy: Always
imagePullPolicy: Always
env:
- name: DYNAMIC_LORA_ROLLOUT_CONFIG
value: "/config/configmap.yaml"
volumeMounts: # DO NOT USE subPath
- name: config-volume
mountPath: /config
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: config-volume
configMap:
name: dynamic-lora-config

---
apiVersion: v1
kind: Service
metadata:
name: llama-service
spec:
selector:
app: llama-server
type: ClusterIP
ports:
- protocol: TCP
port: 8000
targetPort: 8000

---

apiVersion: v1
kind: ConfigMap
metadata:
name: dynamic-lora-config
data:
configmap.yaml: |
vLLMLoRAConfig:
host: modelServerHost
name: sql-loras-llama
port: modelServerPort
ensureExist:
models:
- base-model: meta-llama/Llama-2-7b-hf
id: sql-lora-v1
source: yard1/llama-2-7b-sql-lora-test
- base-model: meta-llama/Llama-2-7b-hf
id: sql-lora-v3
source: yard1/llama-2-7b-sql-lora-test
- base-model: meta-llama/Llama-2-7b-hf
id: sql-lora-v4
source: yard1/llama-2-7b-sql-lora-test
ensureNotExist:
models:
- base-model: meta-llama/Llama-2-7b-hf
id: sql-lora-v2
source: yard1/llama-2-7b-sql-lora-test
6 changes: 6 additions & 0 deletions tools/dynamic-lora-sidecar/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
aiohttp
jsonschema
pyyaml
requests
watchfiles
watchdog
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading