diff --git a/README.md b/README.md index 5bc53bde..85185a2b 100644 --- a/README.md +++ b/README.md @@ -8,25 +8,13 @@ This extension is intented to provide value to multiplexed LLM services on a sha This project is currently in development. -For more rapid testing, our PoC is in the `./examples/` dir. - - ## Getting Started -**Install the CRDs into the cluster:** - -```sh -make install -``` - -**Delete the APIs(CRDs) from the cluster:** +Follow this [README](./pkg/README.md) to get the inference-extension up and running on your cluster! -```sh -make uninstall -``` +## Website -**Deploying the ext-proc image** -Refer to this [README](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/README.md) on how to deploy the Ext-Proc image. +Detailed documentation is available on our website: https://gateway-api-inference-extension.sigs.k8s.io/ ## Contributing diff --git a/examples/placeholder.md b/examples/placeholder.md new file mode 100644 index 00000000..e69de29b diff --git a/pkg/README.md b/pkg/README.md index dc376a79..80991c66 100644 --- a/pkg/README.md +++ b/pkg/README.md @@ -1,7 +1,11 @@ ## Quickstart +This quickstart guide is intended for engineers familiar with k8s and model servers (vLLM in this instance). The goal of this guide is to get a first, single InferencePool up and running! + ### Requirements -The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher. + - Envoy Gateway [v1.2.1](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher + - A cluster that has built-in support for `ServiceType=LoadBalancer`. (This can be validated by ensuring your Envoy Gateway is up and running) + - For example, with Kind, you can follow these steps: https://kind.sigs.k8s.io/docs/user/loadbalancer ### Steps @@ -11,21 +15,27 @@ The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy. Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway. ```bash kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Llama2 - kubectl apply -f ../examples/poc/manifests/vllm/vllm-lora-deployment.yaml + kubectl apply -f ./manifests/vllm/vllm-lora-deployment.yaml + ``` + +1. **Install the CRDs into the cluster:** + + ```sh + kubectl apply -f config/crd/bases ``` 1. **Deploy InferenceModel and InferencePool** Deploy a sample InferenceModel and InferencePool configuration based on the vLLM deployments mentioned above. ```bash - kubectl apply -f ../examples/poc/manifests/inferencepool-with-model.yaml + kubectl apply -f ./manifests/inferencepool-with-model.yaml ``` 1. **Update Envoy Gateway Config to enable Patch Policy** Our custom LLM Gateway ext-proc is patched into the existing envoy gateway via `EnvoyPatchPolicy`. To enable this feature, we must extend the Envoy Gateway config map. To do this, simply run: ```bash - kubectl apply -f ./manifests/enable_patch_policy.yaml + kubectl apply -f ./manifests/gateway/enable_patch_policy.yaml kubectl rollout restart deployment envoy-gateway -n envoy-gateway-system ``` Additionally, if you would like to enable the admin interface, you can uncomment the admin lines and run this again. @@ -33,8 +43,12 @@ The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy. 1. **Deploy Gateway** ```bash - kubectl apply -f ./manifests/gateway.yaml + kubectl apply -f ./manifests/gateway/gateway.yaml ``` + > **_NOTE:_** This file couples together the gateway infra and the HTTPRoute infra for a convenient, quick startup. Creating additional/different InferencePools on the same gateway will require an additional set of: `Backend`, `HTTPRoute`, the resources included in the `./manifests/gateway/ext-proc.yaml` file, and an additional `./manifests/gateway/patch_policy.yaml` file. ***Should you choose to experiment, familiarity with xDS and Envoy are very useful.*** + + + 1. **Deploy Ext-Proc** @@ -45,8 +59,17 @@ The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy. 1. **Deploy Envoy Gateway Custom Policies** ```bash - kubectl apply -f ./manifests/extension_policy.yaml - kubectl apply -f ./manifests/patch_policy.yaml + kubectl apply -f ./manifests/gateway/extension_policy.yaml + kubectl apply -f ./manifests/gateway/patch_policy.yaml + ``` + > **_NOTE:_** This is also per InferencePool, and will need to be configured to support the new pool should you wish to experiment further. + +1. **OPTIONALLY**: Apply Traffic Policy + + For high-traffic benchmarking you can apply this manifest to avoid any defaults that can cause timeouts/errors. + + ```bash + kubectl apply -f ./manifests/gateway/traffic_policy.yaml ``` 1. **Try it out** @@ -63,10 +86,4 @@ The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy. "max_tokens": 100, "temperature": 0 }' - ``` - -## Scheduling Package in Ext Proc -The scheduling package implements request scheduling algorithms for load balancing requests across backend pods in an inference gateway. The scheduler ensures efficient resource utilization while maintaining low latency and prioritizing critical requests. It applies a series of filters based on metrics and heuristics to select the best pod for a given request. - -# Flowchart -Scheduling Algorithm \ No newline at end of file + ``` \ No newline at end of file diff --git a/pkg/manifests/enable_patch_policy.yaml b/pkg/manifests/gateway/enable_patch_policy.yaml similarity index 88% rename from pkg/manifests/enable_patch_policy.yaml rename to pkg/manifests/gateway/enable_patch_policy.yaml index c1d00e9a..1e9818a1 100644 --- a/pkg/manifests/enable_patch_policy.yaml +++ b/pkg/manifests/gateway/enable_patch_policy.yaml @@ -5,6 +5,7 @@ metadata: namespace: envoy-gateway-system data: # This manifest's main purpose is to set `enabledEnvoyPatchPolicy` to `true`. +# This only needs to be ran once on your cluster (unless you'd like to change anything. i.e. enabling the admin dash) # Any field under `admin` is optional, and only for enabling the admin endpoints, for debugging. # Admin Interface: https://www.envoyproxy.io/docs/envoy/latest/operations/admin # PatchPolicy docs: https://gateway.envoyproxy.io/docs/tasks/extensibility/envoy-patch-policy/#enable-envoypatchpolicy diff --git a/pkg/manifests/extension_policy.yaml b/pkg/manifests/gateway/extension_policy.yaml similarity index 100% rename from pkg/manifests/extension_policy.yaml rename to pkg/manifests/gateway/extension_policy.yaml diff --git a/pkg/manifests/gateway.yaml b/pkg/manifests/gateway/gateway.yaml similarity index 100% rename from pkg/manifests/gateway.yaml rename to pkg/manifests/gateway/gateway.yaml diff --git a/pkg/manifests/patch_policy.yaml b/pkg/manifests/gateway/patch_policy.yaml similarity index 100% rename from pkg/manifests/patch_policy.yaml rename to pkg/manifests/gateway/patch_policy.yaml diff --git a/pkg/manifests/traffic_policy.yaml b/pkg/manifests/gateway/traffic_policy.yaml similarity index 100% rename from pkg/manifests/traffic_policy.yaml rename to pkg/manifests/gateway/traffic_policy.yaml diff --git a/examples/poc/manifests/inferencepool-with-model.yaml b/pkg/manifests/inferencepool-with-model.yaml similarity index 100% rename from examples/poc/manifests/inferencepool-with-model.yaml rename to pkg/manifests/inferencepool-with-model.yaml diff --git a/examples/poc/manifests/vllm/vllm-lora-deployment.yaml b/pkg/manifests/vllm/vllm-lora-deployment.yaml similarity index 100% rename from examples/poc/manifests/vllm/vllm-lora-deployment.yaml rename to pkg/manifests/vllm/vllm-lora-deployment.yaml diff --git a/pkg/scheduling.md b/pkg/scheduling.md new file mode 100644 index 00000000..99223ad2 --- /dev/null +++ b/pkg/scheduling.md @@ -0,0 +1,5 @@ +## Scheduling Package in Ext Proc +The scheduling package implements request scheduling algorithms for load balancing requests across backend pods in an inference gateway. The scheduler ensures efficient resource utilization while maintaining low latency and prioritizing critical requests. It applies a series of filters based on metrics and heuristics to select the best pod for a given request. + +# Flowchart +Scheduling Algorithm \ No newline at end of file