kubernetes-sigs · k8s-ci-robot · Nov 18, 2024 · Oct 23, 2024 · Oct 23, 2024 · Oct 30, 2024
diff --git a/tools/dynamic-lora-sidecar/.gitignore b/tools/dynamic-lora-sidecar/.gitignore
@@ -0,0 +1 @@
+sidecar/__pycache__/
diff --git a/tools/dynamic-lora-sidecar/Dockerfile b/tools/dynamic-lora-sidecar/Dockerfile
@@ -0,0 +1,23 @@
+FROM python:3.9-slim-buster AS test
+
+WORKDIR /dynamic-lora-reconciler-test
+COPY requirements.txt .
+COPY sidecar/* . 
+RUN pip install -r requirements.txt
+RUN python -m unittest discover || exit 1  
+
+FROM python:3.10-slim-buster
+
+WORKDIR /dynamic-lora-reconciler
+
+RUN python3 -m venv /opt/venv
+
+ENV PATH="/opt/venv/bin:$PATH"
+
+RUN pip install --upgrade pip
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+COPY sidecar/* . 
+
+CMD ["python", "sidecar.py"]
diff --git a/tools/dynamic-lora-sidecar/README.md b/tools/dynamic-lora-sidecar/README.md
@@ -0,0 +1,71 @@
+# Dynamic LORA Adapter Sidecar for vLLM
+
+This is a sidecar-based tool to help rolling out new LoRA adapters to a set of running vLLM model servers. The user deploys the sidecar with a vLLM server, and using a ConfigMap, the user can express their intent as to which LoRA adapters they want to have the running vLLM servers to be configure with. The sidecar watches the ConfigMap and sends load/unload requests to the vLLM container to actuate on the user intent. 
+
+## Overview
+
+The sidecar continuously monitors a ConfigMap mounted as a YAML configuration file. This file defines the desired state of LORA adapters, including:
+
+- **Adapter ID:** Unique identifier for the adapter.
+- **Source:** Path to the adapter's source files.
+- **Base Model:** The base model to which the adapter should be applied.
+- **toRemove:** (Optional) Indicates whether the adapter should be unloaded.
+
+The sidecar uses the vLLM server's API to load or unload adapters based on the configuration. It also periodically reconciles the registered adapters on the vLLM server with the desired state defined in the ConfigMap, ensuring consistency.
+
+## Features
+
+- **Dynamic Loading and Unloading:**  Load and unload LORA adapters without restarting the vLLM server.
+- **Continuous Reconciliation:**  Ensures the vLLM server's state matches the desired configuration.
+- **ConfigMap Integration:**  Leverages Kubernetes ConfigMaps for easy configuration management.
+- **Easy Deployment:**  Provides a sample deployment YAML for quick setup.
+
+## Repository Contents
+
+- **`sidecar.py`:**  Python script for the sidecar container.
+- **`Dockerfile`:**  Dockerfile to build the sidecar image.
+- **`configmap.yaml`:**  Example ConfigMap YAML file.
+- **`deployment.yaml`:**  Example Kubernetes deployment YAML.
+
+## Usage
+
+1. **Build the Docker Image:**
+   ```bash
+   docker build -t <your-image-name> .
+2. **Create a configmap:**
+    ```bash
+    kubectl create configmap name-of-your-configmap --from-file=your-file.yaml
+3. **Mount the configmap and configure sidecar in your pod**
+    ```yaml
+    volumeMounts: # DO NOT USE subPath
+          - name: config-volume
+            mountPath:  /config
+    ```
+    Do not use subPath, since configmap updates are not reflected in the file
+
+[deployment]: deployment.yaml it uses [sidecar](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/)(`initContainer` with `restartPolicy` set to `always`) which is beta feature enabled by default since k8s version 1.29. They need to be enabled in 1.28 and prior to 1.28 sidecar are not officially supported.
+
+## Configuration Fields
+- `vLLMLoRAConfig`[**required**]  base key 
+- `host` [*optional*]Model server's host. defaults to localhost
+- `port` [*optional*] Model server's port. defaults to 8000
+- `name`[*optional*] Name of this config
+- `ensureExist`[*optional*] List of models to ensure existence on specified model server.
+    -  `models`[**required**] [list]
+        - `base-model`[*optional*] Base model for lora adapter
+        - `id`[**required**] unique id of lora adapter
+        - `source`[**required**] path (remote or local) to lora adapter
+- `ensureNotExist` [*optional*]
+    - `models`[**required**] [list]
+        - `id`[**required**] unique id of lora adapter
+        -  `source`[**required**] path (remote or local) to lora adapter
+        - `base-model`[*optional*] Base model for lora adapter
+
+
+
+
+## Screenshots & Testing
+The sidecar was tested with the Deployment and ConfigMap specified in this repo. Here are screen grabs of the logs from the sidecar and vllm server. One can verify that the adapters were loaded by querying `v1/models` and looking at vllm logs.
+![lora-adapter-syncer](screenshots/lora-syncer-sidecar.png)
+![config map change](screenshots/configmap-change.png)
+![vllm-logs](screenshots/vllm-logs.png)
diff --git a/tools/dynamic-lora-sidecar/deployment.yaml b/tools/dynamic-lora-sidecar/deployment.yaml
@@ -0,0 +1,127 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: llama-deployment
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: llama-server
+  template:
+    metadata:
+      labels:
+        app: llama-server
+        ai.gke.io/model: LLaMA2_7B
+        ai.gke.io/inference-server: vllm
+        examples.ai.gke.io/source: model-garden
+    spec:
+      shareProcessNamespace: true
+      containers:
+      - name: inference-server
+        image: vllm/vllm-openai:v0.6.3.post1
+        resources:
+          requests:
+            cpu: 5
+            memory: 20Gi
+            ephemeral-storage: 40Gi
+            nvidia.com/gpu : 1
+          limits:
+            cpu: 5
+            memory: 20Gi
+            ephemeral-storage: 40Gi
+            nvidia.com/gpu : 1
+        command: ["/bin/sh", "-c"]
+        args:
+        - vllm serve meta-llama/Llama-2-7b-hf
+        - --host=0.0.0.0
+        - --port=8000
+        - --tensor-parallel-size=1
+        - --swap-space=16
+        - --gpu-memory-utilization=0.95
+        - --max-model-len=2048
+        - --max-num-batched-tokens=4096
+        - --disable-log-stats
+        - --enable-loras
+        - --max-loras=5
+        env:
+        - name: DEPLOY_SOURCE
+          value: UI_NATIVE_MODEL
+        - name: MODEL_ID
+          value: "Llama2-7B"
+        - name: AIP_STORAGE_URI
+          value: "gs://vertex-model-garden-public-us/llama2/llama2-7b-hf"
+        - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
+          value: "true"
+        - name: HF_TOKEN
+          valueFrom:
+            secretKeyRef:
+              name: hf-token  # The name of your Kubernetes Secret
+              key: token   # The specific key within the Secret
+        - name: DYNAMIC_LORA_ROLLOUT_CONFIG
+          value: "/config/configmap.yaml"
+        volumeMounts:
+        - mountPath: /dev/shm
+          name: dshm
+      initContainers:
+        - name: lora-adapter-syncer
+          tty: true
+          stdin: true 
+          image: <SIDECAR_IMAGE>
+          restartPolicy: Always
+          imagePullPolicy: Always
+          env: 
+            - name: DYNAMIC_LORA_ROLLOUT_CONFIG
+              value: "/config/configmap.yaml"
+          volumeMounts: # DO NOT USE subPath
+          - name: config-volume
+            mountPath:  /config
+      volumes:
+      - name: dshm
+        emptyDir:
+          medium: Memory
+      - name: config-volume
+        configMap:
+          name: dynamic-lora-config
+
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: llama-service
+spec:
+  selector:
+    app: llama-server
+  type: ClusterIP
+  ports:
+  - protocol: TCP
+    port: 8000
+    targetPort: 8000
+
+---
+
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: dynamic-lora-config
+data:
+  configmap.yaml: |
+      vLLMLoRAConfig:
+        host: modelServerHost
+        name: sql-loras-llama
+        port: modelServerPort
+        ensureExist:
+          models:
+          - base-model: meta-llama/Llama-2-7b-hf
+            id: sql-lora-v1
+            source: yard1/llama-2-7b-sql-lora-test
+          - base-model: meta-llama/Llama-2-7b-hf
+            id: sql-lora-v3
+            source: yard1/llama-2-7b-sql-lora-test
+          - base-model: meta-llama/Llama-2-7b-hf
+            id: sql-lora-v4
+            source: yard1/llama-2-7b-sql-lora-test
+        ensureNotExist:
+          models:
+          - base-model: meta-llama/Llama-2-7b-hf
+            id: sql-lora-v2
+            source: yard1/llama-2-7b-sql-lora-test
diff --git a/tools/dynamic-lora-sidecar/requirements.txt b/tools/dynamic-lora-sidecar/requirements.txt
@@ -0,0 +1,6 @@
+aiohttp
+jsonschema
+pyyaml
+requests 
+watchfiles
+watchdog
diff --git a/tools/dynamic-lora-sidecar/screenshots/configmap-change.png b/tools/dynamic-lora-sidecar/screenshots/configmap-change.png
diff --git a/tools/dynamic-lora-sidecar/screenshots/lora-syncer-sidecar.png b/tools/dynamic-lora-sidecar/screenshots/lora-syncer-sidecar.png
diff --git a/tools/dynamic-lora-sidecar/screenshots/vllm-logs.png b/tools/dynamic-lora-sidecar/screenshots/vllm-logs.png