|
| 1 | +# Model Server Protocol for Gateway API Inference Extension |
| 2 | + |
| 3 | +## Inference API Protocol |
| 4 | + |
| 5 | +The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions) |
| 6 | +and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to |
| 7 | +supporting more API protocols. |
| 8 | + |
| 9 | +To explain this in more detail, the extension makes intelligent request scheduling decisions based |
| 10 | +on certain information from the request body, such as the `model` field. |
| 11 | + |
| 12 | + |
| 13 | +## Metrics Reporting |
| 14 | + |
| 15 | +The inference extension scrapes metrics from the model servers to make optimal request scheduling |
| 16 | +decisions. The PREFERRED metrics format is Prometheus. We do not intend to dictate the exact metric |
| 17 | +naming and format, especially if the corresponding metric already exists. We will leverage the |
| 18 | +[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) |
| 19 | +effort to bring as much unification as possible across model server communities. |
| 20 | + |
| 21 | +We also show the metrics in vLLM, which is already integrated into the inference extension. We are |
| 22 | +working on integrating with more model servers. |
| 23 | + |
| 24 | + |
| 25 | + |
| 26 | +| Metric | Type | Description | vLLM metric | |
| 27 | +| ----- | ---- | ---- | ---- | |
| 28 | +| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| |
| 29 | +| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| |
| 30 | +| MaxActiveModels| Gauge | Maximum number of models/adapters that can be loaded to GPU memory to serve a batch. Requests will be queued if the model server has reached MaxActiveModels and cannot load the requested model/adapter.| `vllm:lora_requests_info.max_lora`| |
| 31 | +| ActiveModels| String (can be a label of a Prometheus Gauge metric) | Comma separated list of models/adapters that are currently loaded into GPU memory and therefore new requests of the same models/adapters don't require eviction of models/adapters. | `vllm:lora_requests_info.running_lora_adapters`| |
| 32 | + |
| 33 | +The following metrics MAY be needed in the future for further optimization. |
| 34 | + |
| 35 | +| Metric |Type | Description | vLLM metric | |
| 36 | +| ----- | ---- | ---- | ---- | |
| 37 | +| TotalTokensInCurrentBatch | Gauge | Number of tokens in the current batch.| `vllm:num_tokens_running`| |
| 38 | +| TotalQueuedTokens| Gauge | The current total number of tokens in the queued requests.| `vllm:num_tokens_waiting` (need to be added)| |
| 39 | +| MaxTokenCapacity| Gauge | The total size of the KV cache in number of tokens.| `vllm:max_token_capacity` <br> NOTE: This info is available indirectly in [`cache_config_info`](https://github.com/vllm-project/vllm/blob/15702038642192002cd8973cf8948751b750fd07/vllm/engine/metrics.py#L551) metric already , and also proposed in, can be added [here](https://github.com/vllm-project/vllm/blob/22f5851b807376a836eb3551903c7fc6c81eaa9b/vllm/engine/llm_engine.py#L1588). | |
| 40 | +| AvailableModels| String | All the available models/adapters that the model server is able to serve, otherwise an error may be returned.| This is already available from the /models API.| |
| 41 | +| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking. In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` | |
| 42 | +| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` | |
| 43 | + |
| 44 | +## LoRA Adapter Serving |
| 45 | + |
| 46 | + |
| 47 | +### Dynamic LoRA Serving |
| 48 | + |
| 49 | +Model servers that support dynamic LoRA serving can gain additional benefit from the inference |
| 50 | +extension's LoRA affinity algorithm. Generally we expect model servers to: |
| 51 | + |
| 52 | +* Support running multiple LoRA adapters in parallel in the same decode batch. |
| 53 | +* Dynamically load/unload adapters in GPU memory from/to host memory depending on the requested |
| 54 | + adapters in the current batch. |
| 55 | + |
| 56 | + |
| 57 | +#### Register/Unregister Adapters |
| 58 | + |
| 59 | +Model servers SHOULD have APIs to dynamically register/unregister models (usually LoRA adapters). This enables platform teams to multiplex multiple LoRA adapters on shared model servers and dynamically rollout LoRA adapters. |
| 60 | + |
| 61 | +NOTE this is not a strict requirement from the inference extension, but a critical feature for CI/CD integration. |
| 62 | + |
| 63 | +While we don’t intend to dictate how model servers should implement this API, a reference REST API can look this: |
| 64 | + |
| 65 | +``` |
| 66 | +POST ${server_endpoint}/adapters/{adapter-id} |
| 67 | +{ |
| 68 | + "path": "path/to/my/adapter" |
| 69 | +} |
| 70 | +
|
| 71 | +DELETE ${server_endpoint}/adapters/{adapter-id} |
| 72 | +``` |
0 commit comments