Skip to content

Commit 7208cff

Browse files
authored
added cpu based example (#436)
* added cpu based example to quickstart Signed-off-by: Nir Rozenbaum <[email protected]> * removed quickstart cleanup instructions Signed-off-by: Nir Rozenbaum <[email protected]> --------- Signed-off-by: Nir Rozenbaum <[email protected]>
1 parent 45e9533 commit 7208cff

File tree

5 files changed

+134
-12
lines changed

5 files changed

+134
-12
lines changed

config/manifests/ext_proc.yaml

+3-3
Original file line numberDiff line numberDiff line change
@@ -44,11 +44,11 @@ apiVersion: inference.networking.x-k8s.io/v1alpha2
4444
kind: InferencePool
4545
metadata:
4646
labels:
47-
name: vllm-llama2-7b-pool
47+
name: my-pool
4848
spec:
4949
targetPortNumber: 8000
5050
selector:
51-
app: vllm-llama2-7b-pool
51+
app: my-pool
5252
extensionRef:
5353
name: inference-gateway-ext-proc
5454
---
@@ -75,7 +75,7 @@ spec:
7575
imagePullPolicy: Always
7676
args:
7777
- -poolName
78-
- "vllm-llama2-7b-pool"
78+
- "my-pool"
7979
- -v
8080
- "3"
8181
- -grpcPort

config/manifests/inferencemodel.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ spec:
66
modelName: tweet-summary
77
criticality: Critical
88
poolRef:
9-
name: vllm-llama2-7b-pool
9+
name: my-pool
1010
targetModels:
1111
- name: tweet-summary-1
1212
weight: 100
+101
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
name: my-pool
5+
spec:
6+
replicas: 3
7+
selector:
8+
matchLabels:
9+
app: my-pool
10+
template:
11+
metadata:
12+
labels:
13+
app: my-pool
14+
spec:
15+
containers:
16+
- name: lora
17+
image: "seedjeffwan/vllm-cpu-env:bb392af4-20250203"
18+
imagePullPolicy: Always
19+
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
20+
args:
21+
- "--model"
22+
- "Qwen/Qwen2.5-1.5B-Instruct"
23+
- "--port"
24+
- "8000"
25+
- "--enable-lora"
26+
- "--lora-modules"
27+
- '{"name": "tweet-summary-0", "path": "/adapters/hub/models--ai-blond--Qwen-Qwen2.5-Coder-1.5B-Instruct-lora/snapshots/9cde18d8ed964b0519fb481cca6acd936b2ca811"}'
28+
- '{"name": "tweet-summary-1", "path": "/adapters/hub/models--ai-blond--Qwen-Qwen2.5-Coder-1.5B-Instruct-lora/snapshots/9cde18d8ed964b0519fb481cca6acd936b2ca811"}'
29+
env:
30+
- name: PORT
31+
value: "8000"
32+
- name: HUGGING_FACE_HUB_TOKEN
33+
valueFrom:
34+
secretKeyRef:
35+
name: hf-token
36+
key: token
37+
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
38+
value: "true"
39+
ports:
40+
- containerPort: 8000
41+
name: http
42+
protocol: TCP
43+
livenessProbe:
44+
failureThreshold: 240
45+
httpGet:
46+
path: /health
47+
port: http
48+
scheme: HTTP
49+
initialDelaySeconds: 5
50+
periodSeconds: 5
51+
successThreshold: 1
52+
timeoutSeconds: 1
53+
readinessProbe:
54+
failureThreshold: 600
55+
httpGet:
56+
path: /health
57+
port: http
58+
scheme: HTTP
59+
initialDelaySeconds: 5
60+
periodSeconds: 5
61+
successThreshold: 1
62+
timeoutSeconds: 1
63+
volumeMounts:
64+
- mountPath: /data
65+
name: data
66+
- mountPath: /dev/shm
67+
name: shm
68+
- name: adapters
69+
mountPath: "/adapters"
70+
initContainers:
71+
- name: adapter-loader
72+
image: ghcr.io/tomatillo-and-multiverse/adapter-puller:demo
73+
command: ["python"]
74+
args:
75+
- ./pull_adapters.py
76+
- --adapter
77+
- ai-blond/Qwen-Qwen2.5-Coder-1.5B-Instruct-lora
78+
- --duplicate-count
79+
- "4"
80+
env:
81+
- name: HF_TOKEN
82+
valueFrom:
83+
secretKeyRef:
84+
name: hf-token
85+
key: token
86+
- name: HF_HOME
87+
value: /adapters
88+
volumeMounts:
89+
- name: adapters
90+
mountPath: "/adapters"
91+
restartPolicy: Always
92+
schedulerName: default-scheduler
93+
terminationGracePeriodSeconds: 30
94+
volumes:
95+
- name: data
96+
emptyDir: {}
97+
- name: shm
98+
emptyDir:
99+
medium: Memory
100+
- name: adapters
101+
emptyDir: {}

config/manifests/vllm/deployment.yaml config/manifests/vllm/gpu-deployment.yaml

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
apiVersion: apps/v1
22
kind: Deployment
33
metadata:
4-
name: vllm-llama2-7b-pool
4+
name: my-pool
55
spec:
66
replicas: 3
77
selector:
88
matchLabels:
9-
app: vllm-llama2-7b-pool
9+
app: my-pool
1010
template:
1111
metadata:
1212
labels:
13-
app: vllm-llama2-7b-pool
13+
app: my-pool
1414
spec:
1515
containers:
1616
- name: lora

site-src/guides/index.md

+26-5
Original file line numberDiff line numberDiff line change
@@ -5,19 +5,40 @@ This quickstart guide is intended for engineers familiar with k8s and model serv
55
## **Prerequisites**
66
- Envoy Gateway [v1.2.1](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher
77
- A cluster with:
8-
- Support for Services of type `LoadBalancer`. (This can be validated by ensuring your Envoy Gateway is up and running). For example, with Kind,
9-
you can follow [these steps](https://kind.sigs.k8s.io/docs/user/loadbalancer).
10-
- 3 GPUs to run the sample model server. Adjust the number of replicas in `./config/manifests/vllm/deployment.yaml` as needed.
8+
- Support for services of typs `LoadBalancer`. (This can be validated by ensuring your Envoy Gateway is up and running).
9+
For example, with Kind, you can follow [these steps](https://kind.sigs.k8s.io/docs/user/loadbalancer).
1110

1211
## **Steps**
1312

1413
### Deploy Sample Model Server
1514

15+
This quickstart guide contains two options for setting up model server:
16+
17+
1. GPU-based model server.
18+
Requirements: a Hugging Face access token that grants access to the model [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf).
19+
20+
1. CPU-based model server (not using GPUs).
21+
Requirements: a Hugging Face access token that grants access to the model [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct).
22+
23+
Choose one of these options and follow the steps below. Please do not deploy both, as the deployments have the same name and will override each other.
24+
25+
#### GPU-Based Model Server
26+
27+
For this setup, you will need 3 GPUs to run the sample model server. Adjust the number of replicas in `./config/manifests/vllm/deployment.yaml` as needed.
1628
Create a Hugging Face secret to download the model [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf). Ensure that the token grants access to this model.
1729
Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway.
1830
```bash
1931
kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Llama2
20-
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/deployment.yaml
32+
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
33+
```
34+
35+
#### CPU-Based Model Server
36+
37+
Create a Hugging Face secret to download the model [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct). Ensure that the token grants access to this model.
38+
Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway.
39+
```bash
40+
kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Qwen
41+
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/cpu-deployment.yaml
2142
```
2243

2344
### Install the Inference Extension CRDs
@@ -49,7 +70,7 @@ This quickstart guide is intended for engineers familiar with k8s and model serv
4970
```bash
5071
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/gateway.yaml
5172
```
52-
> **_NOTE:_** This file couples together the gateway infra and the HTTPRoute infra for a convenient, quick startup. Creating additional/different InferencePools on the same gateway will require an additional set of: `Backend`, `HTTPRoute`, the resources included in the `./manifests/gateway/ext-proc.yaml` file, and an additional `./manifests/gateway/patch_policy.yaml` file. ***Should you choose to experiment, familiarity with xDS and Envoy are very useful.***
73+
> **_NOTE:_** This file couples together the gateway infra and the HTTPRoute infra for a convenient, quick startup. Creating additional/different InferencePools on the same gateway will require an additional set of: `Backend`, `HTTPRoute`, the resources included in the `./config/manifests/gateway/ext-proc.yaml` file, and an additional `./config/manifests/gateway/patch_policy.yaml` file. ***Should you choose to experiment, familiarity with xDS and Envoy are very useful.***
5374
5475
Confirm that the Gateway was assigned an IP address and reports a `Programmed=True` status:
5576
```bash

0 commit comments

Comments
 (0)