|
1 | 1 | ## Quickstart
|
2 | 2 |
|
3 |
| -This quickstart guide is intended for engineers familiar with k8s and model servers (vLLM in this instance). The goal of this guide is to get a first, single InferencePool up and running! |
4 |
| - |
5 |
| -### Requirements |
6 |
| - - Envoy Gateway [v1.2.1](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher |
7 |
| - - A cluster with: |
8 |
| - - Support for Services of type `LoadBalancer`. (This can be validated by ensuring your Envoy Gateway is up and running). For example, with Kind, |
9 |
| - you can follow [these steps](https://kind.sigs.k8s.io/docs/user/loadbalancer). |
10 |
| - - 3 GPUs to run the sample model server. Adjust the number of replicas in `./manifests/vllm/deployment.yaml` as needed. |
11 |
| - |
12 |
| -### Steps |
13 |
| - |
14 |
| -1. **Deploy Sample Model Server** |
15 |
| - |
16 |
| - Create a Hugging Face secret to download the model [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf). Ensure that the token grants access to this model. |
17 |
| - Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway. |
18 |
| - ```bash |
19 |
| - kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Llama2 |
20 |
| - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/vllm/deployment.yaml |
21 |
| - ``` |
22 |
| - |
23 |
| -1. **Install the Inference Extension CRDs:** |
24 |
| - |
25 |
| - ```sh |
26 |
| - kubectl apply -k https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd |
27 |
| - ``` |
28 |
| - |
29 |
| -1. **Deploy InferenceModel** |
30 |
| - |
31 |
| - Deploy the sample InferenceModel which is configured to load balance traffic between the `tweet-summary-0` and `tweet-summary-1` |
32 |
| - [LoRA adapters](https://docs.vllm.ai/en/latest/features/lora.html) of the sample model server. |
33 |
| - ```bash |
34 |
| - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/inferencemodel.yaml |
35 |
| - ``` |
36 |
| - |
37 |
| -1. **Update Envoy Gateway Config to enable Patch Policy** |
38 |
| - |
39 |
| - Our custom LLM Gateway ext-proc is patched into the existing envoy gateway via `EnvoyPatchPolicy`. To enable this feature, we must extend the Envoy Gateway config map. To do this, simply run: |
40 |
| - ```bash |
41 |
| - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/enable_patch_policy.yaml |
42 |
| - kubectl rollout restart deployment envoy-gateway -n envoy-gateway-system |
43 |
| - ``` |
44 |
| - Additionally, if you would like to enable the admin interface, you can uncomment the admin lines and run this again. |
45 |
| - |
46 |
| -1. **Deploy Gateway** |
47 |
| - |
48 |
| - ```bash |
49 |
| - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/gateway.yaml |
50 |
| - ``` |
51 |
| - > **_NOTE:_** This file couples together the gateway infra and the HTTPRoute infra for a convenient, quick startup. Creating additional/different InferencePools on the same gateway will require an additional set of: `Backend`, `HTTPRoute`, the resources included in the `./manifests/gateway/ext-proc.yaml` file, and an additional `./manifests/gateway/patch_policy.yaml` file. ***Should you choose to experiment, familiarity with xDS and Envoy are very useful.*** |
52 |
| -
|
53 |
| - Confirm that the Gateway was assigned an IP address and reports a `Programmed=True` status: |
54 |
| - ```bash |
55 |
| - $ kubectl get gateway inference-gateway |
56 |
| - NAME CLASS ADDRESS PROGRAMMED AGE |
57 |
| - inference-gateway inference-gateway <MY_ADDRESS> True 22s |
58 |
| - ``` |
59 |
| - |
60 |
| -1. **Deploy the Inference Extension and InferencePool** |
61 |
| - |
62 |
| - ```bash |
63 |
| - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/ext_proc.yaml |
64 |
| - ``` |
65 |
| - |
66 |
| -1. **Deploy Envoy Gateway Custom Policies** |
67 |
| - |
68 |
| - ```bash |
69 |
| - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/extension_policy.yaml |
70 |
| - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/patch_policy.yaml |
71 |
| - ``` |
72 |
| - > **_NOTE:_** This is also per InferencePool, and will need to be configured to support the new pool should you wish to experiment further. |
73 |
| -
|
74 |
| -1. **OPTIONALLY**: Apply Traffic Policy |
75 |
| - |
76 |
| - For high-traffic benchmarking you can apply this manifest to avoid any defaults that can cause timeouts/errors. |
77 |
| - |
78 |
| - ```bash |
79 |
| - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/traffic_policy.yaml |
80 |
| - ``` |
81 |
| - |
82 |
| -1. **Try it out** |
83 |
| - |
84 |
| - Wait until the gateway is ready. |
85 |
| - |
86 |
| - ```bash |
87 |
| - IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}') |
88 |
| - PORT=8081 |
89 |
| - |
90 |
| - curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ |
91 |
| - "model": "tweet-summary", |
92 |
| - "prompt": "Write as if you were a critic: San Francisco", |
93 |
| - "max_tokens": 100, |
94 |
| - "temperature": 0 |
95 |
| - }' |
96 |
| - ``` |
| 3 | +Please refer to our Getting started guide here: https://gateway-api-inference-extension.sigs.k8s.io/guides/ |
0 commit comments