Skip to content

Commit 1bb383c

Browse files
committed
Add model server protocol proposal
1 parent adad31c commit 1bb383c

File tree

1 file changed

+72
-0
lines changed

1 file changed

+72
-0
lines changed

Diff for: docs/proposals/003-model-server-protocol/protocol.md

+72
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Model Server Protocol for Gateway API Inference Extension
2+
3+
## Inference API Protocol
4+
5+
The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
6+
and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to
7+
supporting more API protocols.
8+
9+
To explain this in more detail, the extension makes intelligent request scheduling decisions based
10+
on certain information from the request body, such as the `model` field.
11+
12+
13+
## Metrics Reporting
14+
15+
The inference extension scrapes metrics from the model servers to make optimal request scheduling
16+
decisions. The PREFERRED metrics format is Prometheus. We do not intend to dictate the exact metric
17+
naming and format, especially if the corresponding metric already exists. We will leverage the
18+
[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
19+
effort to bring as much unification as possible across model server communities.
20+
21+
We also show the metrics in vLLM, which is already integrated into the inference extension. We are
22+
working on integrating with more model servers.
23+
24+
25+
26+
| Metric | Type | Description | vLLM metric |
27+
| ----- | ---- | ---- | ---- |
28+
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`|
29+
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`|
30+
| MaxActiveModels| Gauge | Maximum number of models/adapters that can be loaded to GPU memory to serve a batch. Requests will be queued if the model server has reached MaxActiveModels and cannot load the requested model/adapter.| `vllm:lora_requests_info.max_lora`|
31+
| ActiveModels| String (can be a label of a Prometheus Gauge metric) | Comma separated list of models/adapters that are currently loaded into GPU memory and therefore new requests of the same models/adapters don't require eviction of models/adapters. | `vllm:lora_requests_info.running_lora_adapters`|
32+
33+
The following metrics MAY be needed in the future for further optimization.
34+
35+
| Metric |Type | Description | vLLM metric |
36+
| ----- | ---- | ---- | ---- |
37+
| TotalTokensInCurrentBatch | Gauge | Number of tokens in the current batch.| `vllm:num_tokens_running`|
38+
| TotalQueuedTokens| Gauge | The current total number of tokens in the queued requests.| `vllm:num_tokens_waiting` (need to be added)|
39+
| MaxTokenCapacity| Gauge | The total size of the KV cache in number of tokens.| `vllm:max_token_capacity` <br> NOTE: This info is available indirectly in [`cache_config_info`](https://github.com/vllm-project/vllm/blob/15702038642192002cd8973cf8948751b750fd07/vllm/engine/metrics.py#L551) metric already , and also proposed in, can be added [here](https://github.com/vllm-project/vllm/blob/22f5851b807376a836eb3551903c7fc6c81eaa9b/vllm/engine/llm_engine.py#L1588). |
40+
| AvailableModels| String | All the available models/adapters that the model server is able to serve, otherwise an error may be returned.| This is already available from the /models API.|
41+
| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking. In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` |
42+
| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` |
43+
44+
## LoRA Adapter Serving
45+
46+
47+
### Dynamic LoRA Serving
48+
49+
Model servers that support dynamic LoRA serving can gain additional benefit from the inference
50+
extension's LoRA affinity algorithm. Generally we expect model servers to:
51+
52+
* Support running multiple LoRA adapters in parallel in the same decode batch.
53+
* Dynamically load/unload adapters in GPU memory from/to host memory depending on the requested
54+
adapters in the current batch.
55+
56+
57+
#### Register/Unregister Adapters
58+
59+
Model servers SHOULD have APIs to dynamically register/unregister models (usually LoRA adapters). This enables platform teams to multiplex multiple LoRA adapters on shared model servers and dynamically rollout LoRA adapters.
60+
61+
NOTE this is not a strict requirement from the inference extension, but a critical feature for CI/CD integration.
62+
63+
While we don’t intend to dictate how model servers should implement this API, a reference REST API can look this:
64+
65+
```
66+
POST ${server_endpoint}/adapters/{adapter-id}
67+
{
68+
        "path": "path/to/my/adapter"
69+
}
70+
71+
DELETE ${server_endpoint}/adapters/{adapter-id}
72+
```

0 commit comments

Comments
 (0)