Skip to content

Commit 976b9b6

Browse files
haitwang-cloudsumitd2
authored andcommitted
[Doc]: Add deploying_with_k8s guide (vllm-project#8451)
Signed-off-by: Sumit Dubey <[email protected]>
1 parent 434dbe7 commit 976b9b6

File tree

2 files changed

+176
-0
lines changed

2 files changed

+176
-0
lines changed

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ Documentation
7979

8080
serving/openai_compatible_server
8181
serving/deploying_with_docker
82+
serving/deploying_with_k8s
8283
serving/distributed_serving
8384
serving/metrics
8485
serving/env_vars
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
.. _deploying_with_k8s:
2+
3+
Deploying with Kubernetes
4+
==========================
5+
6+
Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
7+
8+
Prerequisites
9+
-------------
10+
Before you begin, ensure that you have the following:
11+
12+
- A running Kubernetes cluster
13+
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
14+
- Available GPU resources in your cluster
15+
16+
Deployment Steps
17+
----------------
18+
19+
1. **Create a PVC , Secret and Deployment for vLLM**
20+
21+
22+
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
23+
24+
.. code-block:: yaml
25+
26+
apiVersion: v1
27+
kind: PersistentVolumeClaim
28+
metadata:
29+
name: mistral-7b
30+
namespace: default
31+
spec:
32+
accessModes:
33+
- ReadWriteOnce
34+
resources:
35+
requests:
36+
storage: 50Gi
37+
storageClassName: default
38+
volumeMode: Filesystem
39+
40+
Secret is optional and only required for accessing gated models, you can skip this step if you are not using gated models
41+
42+
.. code-block:: yaml
43+
44+
apiVersion: v1
45+
kind: Secret
46+
metadata:
47+
name: hf-token-secret
48+
namespace: default
49+
type: Opaque
50+
data:
51+
token: "REPLACE_WITH_TOKEN"
52+
53+
54+
Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
55+
56+
.. code-block:: yaml
57+
58+
apiVersion: apps/v1
59+
kind: Deployment
60+
metadata:
61+
name: mistral-7b
62+
namespace: default
63+
labels:
64+
app: mistral-7b
65+
spec:
66+
replicas: 1
67+
selector:
68+
matchLabels:
69+
app: mistral-7b
70+
template:
71+
metadata:
72+
labels:
73+
app: mistral-7b
74+
spec:
75+
volumes:
76+
- name: cache-volume
77+
persistentVolumeClaim:
78+
claimName: mistral-7b
79+
# vLLM needs to access the host's shared memory for tensor parallel inference.
80+
- name: shm
81+
emptyDir:
82+
medium: Memory
83+
sizeLimit: "2Gi"
84+
containers:
85+
- name: mistral-7b
86+
image: vllm/vllm-openai:latest
87+
command: ["/bin/sh", "-c"]
88+
args: [
89+
"vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
90+
]
91+
env:
92+
- name: HUGGING_FACE_HUB_TOKEN
93+
valueFrom:
94+
secretKeyRef:
95+
name: hf-token-secret
96+
key: token
97+
ports:
98+
- containerPort: 8000
99+
resources:
100+
limits:
101+
cpu: "10"
102+
memory: 20G
103+
nvidia.com/gpu: "1"
104+
requests:
105+
cpu: "2"
106+
memory: 6G
107+
nvidia.com/gpu: "1"
108+
volumeMounts:
109+
- mountPath: /root/.cache/huggingface
110+
name: cache-volume
111+
- name: shm
112+
mountPath: /dev/shm
113+
livenessProbe:
114+
httpGet:
115+
path: /health
116+
port: 8000
117+
initialDelaySeconds: 60
118+
periodSeconds: 10
119+
readinessProbe:
120+
httpGet:
121+
path: /health
122+
port: 8000
123+
initialDelaySeconds: 60
124+
periodSeconds: 5
125+
126+
2. **Create a Kubernetes Service for vLLM**
127+
128+
Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
129+
130+
.. code-block:: yaml
131+
132+
apiVersion: v1
133+
kind: Service
134+
metadata:
135+
name: mistral-7b
136+
namespace: default
137+
spec:
138+
ports:
139+
- name: http-mistral-7b
140+
port: 80
141+
protocol: TCP
142+
targetPort: 8000
143+
# The label selector should match the deployment labels & it is useful for prefix caching feature
144+
selector:
145+
app: mistral-7b
146+
sessionAffinity: None
147+
type: ClusterIP
148+
149+
3. **Deploy and Test**
150+
151+
Apply the deployment and service configurations using ``kubectl apply -f <filename>``:
152+
153+
.. code-block:: console
154+
155+
kubectl apply -f deployment.yaml
156+
kubectl apply -f service.yaml
157+
158+
To test the deployment, run the following ``curl`` command:
159+
160+
.. code-block:: console
161+
162+
curl http://mistral-7b.default.svc.cluster.local/v1/completions \
163+
-H "Content-Type: application/json" \
164+
-d '{
165+
"model": "facebook/opt-125m",
166+
"prompt": "San Francisco is a",
167+
"max_tokens": 7,
168+
"temperature": 0
169+
}'
170+
171+
If the service is correctly deployed, you should receive a response from the vLLM model.
172+
173+
Conclusion
174+
----------
175+
Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.

0 commit comments

Comments
 (0)