Skip to content

Commit ee46fd9

Browse files
authored
Add Endpoint Picker Protocol Proposal (#164)
* Add model server protocol proposal * Remove future work and focus on current release * address comments * document current lora metrics
1 parent 95ae8da commit ee46fd9

File tree

1 file changed

+65
-0
lines changed

1 file changed

+65
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Endpoint Picker Protocol
2+
3+
The Endpoint Picker, or EPP, is a core component of the inference extension. Ultimately it's
4+
responsible for picking an endpoint from the `InferencePool`. A reference implementation can be
5+
found [here](../../../pkg/ext-proc/).
6+
7+
## Proxy Protocol
8+
9+
This is the protocol between the EPP and the proxy (e.g, Envoy).
10+
11+
The EPP MUST implement the Envoy
12+
[external processing service](https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ext_proc/v3/external_processor)protocol.
13+
14+
For each HTTP request, the EPP MUST communicate to the proxy the picked model server endpoint, via
15+
adding the `target-pod` HTTP header in the request, or otherwise return an error.
16+
17+
## Model Server Protocol
18+
19+
This is the protocol between the EPP and the model servers.
20+
21+
### Inference API Protocol
22+
23+
The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
24+
and [Chat](https://platform.openai.com/docs/api-reference/chat) APIs.
25+
26+
### Metrics Reporting
27+
28+
The inference extension scrapes metrics from the model servers to make optimal request scheduling
29+
decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. The exact
30+
metric names don't necessarily need to be the same as the recommended names here, however the
31+
metric types and semantics MUST follow this doc.
32+
33+
Note the requirements here are aligned with the
34+
[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
35+
effort.
36+
37+
The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated
38+
into the reference endpoint picker implementation.
39+
40+
| Metric | Type | Description | vLLM metric |
41+
| ----- | ---- | ---- | ---- |
42+
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`|
43+
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`|
44+
45+
46+
### LoRA Adapter Serving
47+
48+
Model servers that support dynamic LoRA serving can benefit from the LoRA affinity algorithm. Note
49+
the current algorithm in the reference EPP is highly biased towards vLLM's current dynamic LoRA
50+
implementation.
51+
52+
The model servers MUST support serving a LoRA adapter specified in the `model` argument of the
53+
request, provided the requested adapter is valid.
54+
55+
The model server MUST expose the following LoRA adapter metrics via the same Prometheus endpoint:
56+
57+
* Metric name implemented in vLLM: `vllm:lora_requests_info`
58+
* Metric type: Gauge
59+
* Metric value: The last updated timestamp (so the EPP can find the latest).
60+
* Metric labels:
61+
* `max_lora`: The maximum number of adapters that can be loaded to GPU memory to serve a batch.
62+
Requests will be queued if the model server has reached MaxActiveAdapter and canno load the
63+
requested adapter. Example: `"max_lora": "8"`.
64+
* `running_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU
65+
memory and ready to serve requests. Example: `"running_lora_adapters": "adapter1, adapter2"`

0 commit comments

Comments
 (0)