|
| 1 | +# Endpoint Picker Protocol |
| 2 | + |
| 3 | +The Endpoint Picker, or EPP, is a core component of the inference extension. Ultimately it's |
| 4 | +responsible for picking an endpoint from the `InferencePool`. A reference implementation can be |
| 5 | +found [here](../../../pkg/ext-proc/). |
| 6 | + |
| 7 | +## Proxy Protocol |
| 8 | + |
| 9 | +This is the protocol between the EPP and the proxy (e.g, Envoy). |
| 10 | + |
| 11 | +The EPP MUST implement the Envoy |
| 12 | +[external processing service](https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ext_proc/v3/external_processor)protocol. |
| 13 | + |
| 14 | +For each HTTP request, the EPP MUST communicate to the proxy the picked model server endpoint, via |
| 15 | +adding the `target-pod` HTTP header in the request, or otherwise return an error. |
| 16 | + |
| 17 | +## Model Server Protocol |
| 18 | + |
| 19 | +This is the protocol between the EPP and the model servers. |
| 20 | + |
| 21 | +### Inference API Protocol |
| 22 | + |
| 23 | +The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions) |
| 24 | +and [Chat](https://platform.openai.com/docs/api-reference/chat) APIs. |
| 25 | + |
| 26 | +### Metrics Reporting |
| 27 | + |
| 28 | +The inference extension scrapes metrics from the model servers to make optimal request scheduling |
| 29 | +decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. The exact |
| 30 | +metric names don't necessarily need to be the same as the recommended names here, however the |
| 31 | +metric types and semantics MUST follow this doc. |
| 32 | + |
| 33 | +Note the requirements here are aligned with the |
| 34 | +[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) |
| 35 | +effort. |
| 36 | + |
| 37 | +The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated |
| 38 | +into the reference endpoint picker implementation. |
| 39 | + |
| 40 | +| Metric | Type | Description | vLLM metric | |
| 41 | +| ----- | ---- | ---- | ---- | |
| 42 | +| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| |
| 43 | +| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| |
| 44 | + |
| 45 | + |
| 46 | +### LoRA Adapter Serving |
| 47 | + |
| 48 | +Model servers that support dynamic LoRA serving can benefit from the LoRA affinity algorithm. Note |
| 49 | +the current algorithm in the reference EPP is highly biased towards vLLM's current dynamic LoRA |
| 50 | +implementation. |
| 51 | + |
| 52 | +The model servers MUST support serving a LoRA adapter specified in the `model` argument of the |
| 53 | +request, provided the requested adapter is valid. |
| 54 | + |
| 55 | +The model server MUST expose the following LoRA adapter metrics via the same Prometheus endpoint: |
| 56 | + |
| 57 | +* Metric name implemented in vLLM: `vllm:lora_requests_info` |
| 58 | +* Metric type: Gauge |
| 59 | +* Metric value: The last updated timestamp (so the EPP can find the latest). |
| 60 | +* Metric labels: |
| 61 | + * `max_lora`: The maximum number of adapters that can be loaded to GPU memory to serve a batch. |
| 62 | + Requests will be queued if the model server has reached MaxActiveAdapter and canno load the |
| 63 | + requested adapter. Example: `"max_lora": "8"`. |
| 64 | + * `running_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU |
| 65 | + memory and ready to serve requests. Example: `"running_lora_adapters": "adapter1, adapter2"` |
0 commit comments