This quickstart guide is intended for engineers familiar with k8s and model servers (vLLM in this instance). The goal of this guide is to get a first, single InferencePool up and running!
- Envoy Gateway v1.2.1 or higher
- A cluster with:
- Support for Services of type
LoadBalancer
. (This can be validated by ensuring your Envoy Gateway is up and running). For example, with Kind, you can follow these steps. - 3 GPUs to run the sample model server. Adjust the number of replicas in
./manifests/vllm/deployment.yaml
as needed.
- Support for Services of type
Create a Hugging Face secret to download the model meta-llama/Llama-2-7b-hf. Ensure that the token grants access to this model. Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway.
kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Llama2
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/deployment.yaml
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.1.0/manifests.yaml
Deploy the sample InferenceModel which is configured to load balance traffic between the tweet-summary-0
and tweet-summary-1
LoRA adapters of the sample model server.
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencemodel.yaml
Our custom LLM Gateway ext-proc is patched into the existing envoy gateway via EnvoyPatchPolicy
. To enable this feature, we must extend the Envoy Gateway config map. To do this, simply run:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/enable_patch_policy.yaml
kubectl rollout restart deployment envoy-gateway -n envoy-gateway-system
Additionally, if you would like to enable the admin interface, you can uncomment the admin lines and run this again.
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/gateway.yaml
NOTE: This file couples together the gateway infra and the HTTPRoute infra for a convenient, quick startup. Creating additional/different InferencePools on the same gateway will require an additional set of:
Backend
,HTTPRoute
, the resources included in the./manifests/gateway/ext-proc.yaml
file, and an additional./manifests/gateway/patch_policy.yaml
file. Should you choose to experiment, familiarity with xDS and Envoy are very useful.
Confirm that the Gateway was assigned an IP address and reports a Programmed=True
status:
$ kubectl get gateway inference-gateway
NAME CLASS ADDRESS PROGRAMMED AGE
inference-gateway inference-gateway <MY_ADDRESS> True 22s
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/ext_proc.yaml
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/extension_policy.yaml
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/patch_policy.yaml
NOTE: This is also per InferencePool, and will need to be configured to support the new pool should you wish to experiment further.
For high-traffic benchmarking you can apply this manifest to avoid any defaults that can cause timeouts/errors.
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/traffic_policy.yaml
Wait until the gateway is ready.
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
PORT=8081
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'