From a5e340c066037efc9aae4526077c87b69ec527ad Mon Sep 17 00:00:00 2001 From: Cong Liu Date: Mon, 6 Jan 2025 15:10:36 -0800 Subject: [PATCH 1/4] Add model server protocol proposal --- .../003-model-server-protocol/protocol.md | 106 ++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 docs/proposals/003-model-server-protocol/protocol.md diff --git a/docs/proposals/003-model-server-protocol/protocol.md b/docs/proposals/003-model-server-protocol/protocol.md new file mode 100644 index 00000000..f4b09b5a --- /dev/null +++ b/docs/proposals/003-model-server-protocol/protocol.md @@ -0,0 +1,106 @@ +# Model Server Protocol for Gateway API Inference Extension + +## Inference API Protocol + +The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions) +and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to +supporting more API protocols. + +
+Why? +The extension makes intelligent request scheduling decisions based on certain information from the +request body, such as the `model` field. +
+ +## Metrics Reporting + +The inference extension scrapes metrics from the model servers to make optimal request scheduling +decisions. The model servers SHOULD provide the following metrics via a Prometheus endpoint. While +the metric names may differ slightly in different model servers, the metric types MUST be the same. +We will align with the +[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) +effort. + +We also show the metrics in vLLM, which is already integrated into the inference extension. We will +add more model server once they are integrated. + +| Metric | Type | Description | vLLM metric | +| ----- | ---- | ---- | ---- | +| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| +| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| + + +### Future Metrics +The following metrics MAY be needed in the future for further optimization. + +| Metric |Type | Description | vLLM metric | +| ----- | ---- | ---- | ---- | +| TotalTokensInCurrentBatch | Gauge | Number of tokens in the current batch.| `vllm:num_tokens_running`| +| TotalQueuedTokens| Gauge | The current total number of tokens in the queued requests.| `vllm:num_tokens_waiting` (need to be added)| +| MaxTokenCapacity| Gauge | The total size of the KV cache in number of tokens.| `vllm:max_token_capacity`
NOTE: This info is available indirectly in [`cache_config_info`](https://github.com/vllm-project/vllm/blob/15702038642192002cd8973cf8948751b750fd07/vllm/engine/metrics.py#L551) metric already , and also proposed in [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk). | +| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking. In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` | +| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` | + +## LoRA Adapter Serving + +### Dynamic LoRA Serving + +Model servers that support dynamic LoRA serving can gain additional benefit from the inference +extension's LoRA affinity algorithm. While dynamic LoRA serving is quite new and evolving, and there +is no common standard, the inference extension generally expects the following behavior. + +* Support running multiple LoRA adapters in parallel in the same decode batch. +* Dynamically load/unload adapters in GPU memory from/to a cahe (e.g., in host memory) depending on + the requested adapters in the current batch. + +The model server SHOULD expose the following information via an API: + +* AdapterConfig + * LoRAEnabled: Whether dynamic LoRA serving is enabled. + * MaxActiveAdapter: Maximum number of adapters that can be loaded to GPU memory to serve a batch. + Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the + requested adapter. In vLLM, this is currently exposed as a string label `max_lora` in the + `vllm:lora_requests_info` metric. +* AdapterState + * ActiveAdapters: A list of adapters that are currently loaded in GPU memory and ready to servce + requests. In vLLM, this is currently exposed as a comma separated string label `running_lora_adapters` + in the `vllm:lora_requests_info` metric. + +The API MAY look like this: +``` +GET ${server_endpoint}/adapters/info +``` + +And the response MAY look like this: +``` +{ + "config": { + "enabled": true, + "maxActiveAdapters": 4, + }, + "state": { + "activeAdapters": ["adapter1", "adapter2"] + } +} +``` + +#### Dynamically Register/Unregister Adapters + +Model servers SHOULD have APIs to dynamically register/unregister models (usually LoRA adapters). +This enables platform teams to multiplex multiple LoRA adapters on shared model servers and +dynamically rollout LoRA adapters. + +NOTE this is not a strict requirement from the inference extension, but a critical feature for CI/CD +integration. + +While we don’t intend to dictate how model servers should implement this API, a reference REST API +MAY look this: + +``` +POST ${server_endpoint}/adapters/{adapter-id} +{ +        "path": "path/to/my/adapter" +} + +DELETE ${server_endpoint}/adapters/{adapter-id} +``` From cbc26394e404fd8a1c8a000e0b90ad27c5c342b3 Mon Sep 17 00:00:00 2001 From: Cong Liu Date: Thu, 23 Jan 2025 13:25:19 -0800 Subject: [PATCH 2/4] Remove future work and focus on current release --- .../003-model-server-protocol/protocol.md | 72 +++++-------------- 1 file changed, 19 insertions(+), 53 deletions(-) diff --git a/docs/proposals/003-model-server-protocol/protocol.md b/docs/proposals/003-model-server-protocol/protocol.md index f4b09b5a..d393b0d5 100644 --- a/docs/proposals/003-model-server-protocol/protocol.md +++ b/docs/proposals/003-model-server-protocol/protocol.md @@ -15,63 +15,47 @@ request body, such as the `model` field. ## Metrics Reporting The inference extension scrapes metrics from the model servers to make optimal request scheduling -decisions. The model servers SHOULD provide the following metrics via a Prometheus endpoint. While +decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. While the metric names may differ slightly in different model servers, the metric types MUST be the same. We will align with the [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) effort. We also show the metrics in vLLM, which is already integrated into the inference extension. We will -add more model server once they are integrated. +add more model servers once they are integrated. | Metric | Type | Description | vLLM metric | | ----- | ---- | ---- | ---- | | TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| | KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| - -### Future Metrics -The following metrics MAY be needed in the future for further optimization. - -| Metric |Type | Description | vLLM metric | -| ----- | ---- | ---- | ---- | -| TotalTokensInCurrentBatch | Gauge | Number of tokens in the current batch.| `vllm:num_tokens_running`| -| TotalQueuedTokens| Gauge | The current total number of tokens in the queued requests.| `vllm:num_tokens_waiting` (need to be added)| -| MaxTokenCapacity| Gauge | The total size of the KV cache in number of tokens.| `vllm:max_token_capacity`
NOTE: This info is available indirectly in [`cache_config_info`](https://github.com/vllm-project/vllm/blob/15702038642192002cd8973cf8948751b750fd07/vllm/engine/metrics.py#L551) metric already , and also proposed in [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk). | -| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking. In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` | -| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` | - -## LoRA Adapter Serving - -### Dynamic LoRA Serving +## [Experimental] LoRA Adapter Serving Model servers that support dynamic LoRA serving can gain additional benefit from the inference -extension's LoRA affinity algorithm. While dynamic LoRA serving is quite new and evolving, and there -is no common standard, the inference extension generally expects the following behavior. +extension's LoRA affinity algorithm. As dynamic LoRA serving is quite new and evolving, this part is considered experimental and subject to changes in future releases. + +The inference extension expects the following behavior from compatible model servers. * Support running multiple LoRA adapters in parallel in the same decode batch. -* Dynamically load/unload adapters in GPU memory from/to a cahe (e.g., in host memory) depending on +* Dynamically load/unload adapters in GPU memory from/to a cache (e.g., in host memory) depending on the requested adapters in the current batch. -The model server SHOULD expose the following information via an API: +The model server MUST expose the following LoRA adapter information via a RESTful API with response in JSON : -* AdapterConfig - * LoRAEnabled: Whether dynamic LoRA serving is enabled. - * MaxActiveAdapter: Maximum number of adapters that can be loaded to GPU memory to serve a batch. +* `Config` + * `LoRAEnabled`: boolean, whether dynamic LoRA serving is enabled. + * `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the - requested adapter. In vLLM, this is currently exposed as a string label `max_lora` in the - `vllm:lora_requests_info` metric. -* AdapterState - * ActiveAdapters: A list of adapters that are currently loaded in GPU memory and ready to servce - requests. In vLLM, this is currently exposed as a comma separated string label `running_lora_adapters` - in the `vllm:lora_requests_info` metric. - -The API MAY look like this: + requested adapter. +* `State` + * `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and ready to serve + requests. + +This is an example API endpoint and response: ``` GET ${server_endpoint}/adapters/info ``` -And the response MAY look like this: ``` { "config": { @@ -84,23 +68,5 @@ And the response MAY look like this: } ``` -#### Dynamically Register/Unregister Adapters - -Model servers SHOULD have APIs to dynamically register/unregister models (usually LoRA adapters). -This enables platform teams to multiplex multiple LoRA adapters on shared model servers and -dynamically rollout LoRA adapters. - -NOTE this is not a strict requirement from the inference extension, but a critical feature for CI/CD -integration. - -While we don’t intend to dictate how model servers should implement this API, a reference REST API -MAY look this: - -``` -POST ${server_endpoint}/adapters/{adapter-id} -{ -        "path": "path/to/my/adapter" -} - -DELETE ${server_endpoint}/adapters/{adapter-id} -``` +NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the `vllm:lora_requests_info` metric, where +`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma separated string label `running_lora_adapters`. \ No newline at end of file From c4700052594b0f0eb6596c6db7893ae2eff304a5 Mon Sep 17 00:00:00 2001 From: Cong Liu Date: Mon, 27 Jan 2025 18:16:22 -0800 Subject: [PATCH 3/4] address comments --- .../003-model-server-protocol/protocol.md | 75 +++++++++++-------- 1 file changed, 45 insertions(+), 30 deletions(-) diff --git a/docs/proposals/003-model-server-protocol/protocol.md b/docs/proposals/003-model-server-protocol/protocol.md index d393b0d5..a9a3c96f 100644 --- a/docs/proposals/003-model-server-protocol/protocol.md +++ b/docs/proposals/003-model-server-protocol/protocol.md @@ -1,55 +1,68 @@ -# Model Server Protocol for Gateway API Inference Extension +# Endpoint Picker Protocol -## Inference API Protocol +The Endpoint Picker, or EPP, is a core component of the inference extension. Ultimately it's +responsible for picking an endpoint from the `InferencePool`. A reference implementation can be +found [here](../../../pkg/ext-proc/). -The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions) -and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to -supporting more API protocols. +## Proxy Protocol + +This is the protocol between the EPP and the proxy (e.g, Envoy). + +The EPP MUST implement the Envoy +[external processing service](https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ext_proc/v3/external_processor)protocol. + +For each HTTP request, the EPP MUST communicate to the proxy the picked model server endpoint, via +adding the `target-pod` HTTP header in the request, or otherwise return an error. + +## Model Server Protocol -
-Why? -The extension makes intelligent request scheduling decisions based on certain information from the -request body, such as the `model` field. -
+This is the protocol between the EPP and the model servers. -## Metrics Reporting +### Inference API Protocol + +The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions) +and [Chat](https://platform.openai.com/docs/api-reference/chat) APIs. + +### Metrics Reporting The inference extension scrapes metrics from the model servers to make optimal request scheduling -decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. While -the metric names may differ slightly in different model servers, the metric types MUST be the same. -We will align with the +decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. The exact +metric names don't necessarily need to be the same as the recommended names here, however the +metric types and semantics MUST follow this doc. + +Note the requirements here are aligned with the [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) effort. -We also show the metrics in vLLM, which is already integrated into the inference extension. We will -add more model servers once they are integrated. +The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated +into the reference endpoint picker implementation. | Metric | Type | Description | vLLM metric | | ----- | ---- | ---- | ---- | | TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| | KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| -## [Experimental] LoRA Adapter Serving -Model servers that support dynamic LoRA serving can gain additional benefit from the inference -extension's LoRA affinity algorithm. As dynamic LoRA serving is quite new and evolving, this part is considered experimental and subject to changes in future releases. +### LoRA Adapter Serving -The inference extension expects the following behavior from compatible model servers. +Model servers that support dynamic LoRA serving can benefit from the LoRA affinity algorithm. Note +the current algorithm in the reference EPP is highly biased towards vLLM's current dynamic LoRA +implementation. -* Support running multiple LoRA adapters in parallel in the same decode batch. -* Dynamically load/unload adapters in GPU memory from/to a cache (e.g., in host memory) depending on - the requested adapters in the current batch. +The model servers MUST support serving a LoRA adapter specified in the `model` argument of the +request, provided the requested adapter is valid. -The model server MUST expose the following LoRA adapter information via a RESTful API with response in JSON : +The model server MUST expose the following LoRA adapter information via a RESTful API with response +in JSON : * `Config` * `LoRAEnabled`: boolean, whether dynamic LoRA serving is enabled. - * `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to serve a batch. - Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the - requested adapter. + * `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to + serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and cannot + load the requested adapter. * `State` - * `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and ready to serve - requests. + * `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and + ready to serve requests. This is an example API endpoint and response: ``` @@ -69,4 +82,6 @@ GET ${server_endpoint}/adapters/info ``` NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the `vllm:lora_requests_info` metric, where -`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma separated string label `running_lora_adapters`. \ No newline at end of file +`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma +separated string label `running_lora_adapters`. We will use [this issue](https://github.com/vllm-project/vllm/issues/10086) +to track integration efforts with vLLM. \ No newline at end of file From 00c6e6188c34166017ac257213c8f00826b9ccaa Mon Sep 17 00:00:00 2001 From: Cong Liu Date: Tue, 28 Jan 2025 15:52:13 -0800 Subject: [PATCH 4/4] document current lora metrics --- .../003-model-server-protocol/protocol.md | 44 +++++-------------- 1 file changed, 11 insertions(+), 33 deletions(-) diff --git a/docs/proposals/003-model-server-protocol/protocol.md b/docs/proposals/003-model-server-protocol/protocol.md index a9a3c96f..3ce38344 100644 --- a/docs/proposals/003-model-server-protocol/protocol.md +++ b/docs/proposals/003-model-server-protocol/protocol.md @@ -52,36 +52,14 @@ implementation. The model servers MUST support serving a LoRA adapter specified in the `model` argument of the request, provided the requested adapter is valid. -The model server MUST expose the following LoRA adapter information via a RESTful API with response -in JSON : - -* `Config` - * `LoRAEnabled`: boolean, whether dynamic LoRA serving is enabled. - * `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to - serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and cannot - load the requested adapter. -* `State` - * `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and - ready to serve requests. - -This is an example API endpoint and response: -``` -GET ${server_endpoint}/adapters/info -``` - -``` -{ - "config": { - "enabled": true, - "maxActiveAdapters": 4, - }, - "state": { - "activeAdapters": ["adapter1", "adapter2"] - } -} -``` - -NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the `vllm:lora_requests_info` metric, where -`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma -separated string label `running_lora_adapters`. We will use [this issue](https://github.com/vllm-project/vllm/issues/10086) -to track integration efforts with vLLM. \ No newline at end of file +The model server MUST expose the following LoRA adapter metrics via the same Prometheus endpoint: + +* Metric name implemented in vLLM: `vllm:lora_requests_info` +* Metric type: Gauge +* Metric value: The last updated timestamp (so the EPP can find the latest). +* Metric labels: + * `max_lora`: The maximum number of adapters that can be loaded to GPU memory to serve a batch. + Requests will be queued if the model server has reached MaxActiveAdapter and canno load the + requested adapter. Example: `"max_lora": "8"`. + * `running_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU + memory and ready to serve requests. Example: `"running_lora_adapters": "adapter1, adapter2"` \ No newline at end of file