From a5e340c066037efc9aae4526077c87b69ec527ad Mon Sep 17 00:00:00 2001
From: Cong Liu <conliu@google.com>
Date: Mon, 6 Jan 2025 15:10:36 -0800
Subject: [PATCH 1/4] Add model server protocol proposal

---
 .../003-model-server-protocol/protocol.md     | 106 ++++++++++++++++++
 1 file changed, 106 insertions(+)
 create mode 100644 docs/proposals/003-model-server-protocol/protocol.md
diff --git a/docs/proposals/003-model-server-protocol/protocol.md b/docs/proposals/003-model-server-protocol/protocol.md
new file mode 100644
index 00000000..f4b09b5a
--- /dev/null
+++ b/docs/proposals/003-model-server-protocol/protocol.md
@@ -0,0 +1,106 @@
+# Model Server Protocol for Gateway API Inference Extension
+
+## Inference API Protocol
+
+The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
+and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to
+supporting more API protocols.
+
+<details>
+<summary>Why?</summary>
+The extension makes intelligent request scheduling decisions based on certain information from the
+request body, such as the `model` field.
+</details>
+
+## Metrics Reporting
+
+The inference extension scrapes metrics from the model servers to make optimal request scheduling
+decisions. The model servers SHOULD provide the following metrics via a Prometheus endpoint. While
+the metric names may differ slightly in different model servers, the metric types MUST be the same.
+We will align with the
+[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
+effort.
+
+We also show the metrics in vLLM, which is already integrated into the inference extension. We will
+add more model server once they are integrated.
+
+| Metric | Type | Description | vLLM metric |
+| ----- | ---- | ---- | ---- |
+| TotalQueuedRequests         | Gauge     | The current total number of requests in the queue.| `vllm:num_requests_waiting`|
+| KVCacheUtilization| Gauge     | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`|
+
+
+### Future Metrics
+The following metrics MAY be needed in the future for further optimization.
+
+| Metric |Type | Description | vLLM   metric |
+| ----- | ---- | ---- | ---- |
+| TotalTokensInCurrentBatch   | Gauge     | Number of tokens in the current batch.| `vllm:num_tokens_running`|
+| TotalQueuedTokens| Gauge     | The current total number of tokens in the queued requests.| `vllm:num_tokens_waiting` (need to be added)|
+| MaxTokenCapacity| Gauge     | The total size of the KV cache in number of tokens.| `vllm:max_token_capacity` <br> NOTE: This info is available indirectly in [`cache_config_info`](https://github.com/vllm-project/vllm/blob/15702038642192002cd8973cf8948751b750fd07/vllm/engine/metrics.py#L551) metric already , and also  proposed in [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk). |
+| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking.  In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` | 
+| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` | 
+
+## LoRA Adapter Serving
+
+### Dynamic LoRA Serving
+
+Model servers that support dynamic LoRA serving can gain additional benefit from the inference
+extension's LoRA affinity algorithm. While dynamic LoRA serving is quite new and evolving, and there
+is no common standard, the inference extension generally expects the following behavior.
+
+* Support running multiple LoRA adapters in parallel in the same decode batch.
+* Dynamically load/unload adapters in GPU memory from/to a cahe (e.g., in host memory) depending on
+  the requested adapters in the current batch.
+
+The model server SHOULD expose the following information via an API:
+
+* AdapterConfig 
+  * LoRAEnabled: Whether dynamic LoRA serving is enabled.
+  *  MaxActiveAdapter: Maximum number of adapters that can be loaded to GPU memory to serve a batch.
+  Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the
+  requested adapter. In vLLM, this is currently exposed as a string label `max_lora` in the
+  `vllm:lora_requests_info` metric.
+* AdapterState
+  * ActiveAdapters: A list of adapters that are currently loaded in GPU memory and ready to servce
+  requests. In vLLM, this is currently exposed as a comma separated string label `running_lora_adapters`
+  in the `vllm:lora_requests_info` metric.
+
+The API MAY look like this:
+```
+GET ${server_endpoint}/adapters/info
+```
+
+And the response MAY look like this:
+```
+{
+    "config": {
+        "enabled": true,
+        "maxActiveAdapters": 4,
+    },
+    "state": {
+        "activeAdapters": ["adapter1", "adapter2"]
+    }
+}
+```
+
+#### Dynamically Register/Unregister Adapters
+
+Model servers SHOULD have APIs to dynamically register/unregister models (usually LoRA adapters).
+This enables platform teams to multiplex multiple LoRA adapters on shared model servers and
+dynamically rollout LoRA adapters. 
+
+NOTE this is not a strict requirement from the inference extension, but a critical feature for CI/CD
+integration.
+
+While we don’t intend to dictate how model servers should implement this API, a reference REST API
+MAY look this:
+
+```
+POST ${server_endpoint}/adapters/{adapter-id}
+{
+        "path": "path/to/my/adapter"
+}
+
+DELETE ${server_endpoint}/adapters/{adapter-id}
+```

From cbc26394e404fd8a1c8a000e0b90ad27c5c342b3 Mon Sep 17 00:00:00 2001
From: Cong Liu <conliu@google.com>
Date: Thu, 23 Jan 2025 13:25:19 -0800
Subject: [PATCH 2/4] Remove future work and focus on current release

---
 .../003-model-server-protocol/protocol.md     | 72 +++++--------------
 1 file changed, 19 insertions(+), 53 deletions(-)

diff --git a/docs/proposals/003-model-server-protocol/protocol.md b/docs/proposals/003-model-server-protocol/protocol.md
index f4b09b5a..d393b0d5 100644
--- a/docs/proposals/003-model-server-protocol/protocol.md
+++ b/docs/proposals/003-model-server-protocol/protocol.md
@@ -15,63 +15,47 @@ request body, such as the `model` field.
 ## Metrics Reporting
 
 The inference extension scrapes metrics from the model servers to make optimal request scheduling
-decisions. The model servers SHOULD provide the following metrics via a Prometheus endpoint. While
+decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. While
 the metric names may differ slightly in different model servers, the metric types MUST be the same.
 We will align with the
 [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
 effort.
 
 We also show the metrics in vLLM, which is already integrated into the inference extension. We will
-add more model server once they are integrated.
+add more model servers once they are integrated.
 
 | Metric | Type | Description | vLLM metric |
 | ----- | ---- | ---- | ---- |
 | TotalQueuedRequests         | Gauge     | The current total number of requests in the queue.| `vllm:num_requests_waiting`|
 | KVCacheUtilization| Gauge     | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`|
 
-
-### Future Metrics
-The following metrics MAY be needed in the future for further optimization.
-
-| Metric |Type | Description | vLLM   metric |
-| ----- | ---- | ---- | ---- |
-| TotalTokensInCurrentBatch   | Gauge     | Number of tokens in the current batch.| `vllm:num_tokens_running`|
-| TotalQueuedTokens| Gauge     | The current total number of tokens in the queued requests.| `vllm:num_tokens_waiting` (need to be added)|
-| MaxTokenCapacity| Gauge     | The total size of the KV cache in number of tokens.| `vllm:max_token_capacity` <br> NOTE: This info is available indirectly in [`cache_config_info`](https://github.com/vllm-project/vllm/blob/15702038642192002cd8973cf8948751b750fd07/vllm/engine/metrics.py#L551) metric already , and also  proposed in [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk). |
-| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking.  In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` | 
-| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` | 
-
-## LoRA Adapter Serving
-
-### Dynamic LoRA Serving
+## [Experimental] LoRA Adapter Serving
 
 Model servers that support dynamic LoRA serving can gain additional benefit from the inference
-extension's LoRA affinity algorithm. While dynamic LoRA serving is quite new and evolving, and there
-is no common standard, the inference extension generally expects the following behavior.
+extension's LoRA affinity algorithm. As dynamic LoRA serving is quite new and evolving, this part is considered experimental and subject to changes in future releases.
+
+The inference extension expects the following behavior from compatible model servers.
 
 * Support running multiple LoRA adapters in parallel in the same decode batch.
-* Dynamically load/unload adapters in GPU memory from/to a cahe (e.g., in host memory) depending on
+* Dynamically load/unload adapters in GPU memory from/to a cache (e.g., in host memory) depending on
   the requested adapters in the current batch.
 
-The model server SHOULD expose the following information via an API:
+The model server MUST expose the following LoRA adapter information via a RESTful API with response in JSON :
 
-* AdapterConfig 
-  * LoRAEnabled: Whether dynamic LoRA serving is enabled.
-  *  MaxActiveAdapter: Maximum number of adapters that can be loaded to GPU memory to serve a batch.
+* `Config` 
+  * `LoRAEnabled`: boolean, whether dynamic LoRA serving is enabled.
+  *  `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to serve a batch.
   Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the
-  requested adapter. In vLLM, this is currently exposed as a string label `max_lora` in the
-  `vllm:lora_requests_info` metric.
-* AdapterState
-  * ActiveAdapters: A list of adapters that are currently loaded in GPU memory and ready to servce
-  requests. In vLLM, this is currently exposed as a comma separated string label `running_lora_adapters`
-  in the `vllm:lora_requests_info` metric.
-
-The API MAY look like this:
+  requested adapter. 
+* `State`
+  * `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and ready to serve
+  requests.
+
+This is an example API endpoint and response:
 ```
 GET ${server_endpoint}/adapters/info
 ```
 
-And the response MAY look like this:
 ```
 {
     "config": {
@@ -84,23 +68,5 @@ And the response MAY look like this:
 }
 ```
 
-#### Dynamically Register/Unregister Adapters
-
-Model servers SHOULD have APIs to dynamically register/unregister models (usually LoRA adapters).
-This enables platform teams to multiplex multiple LoRA adapters on shared model servers and
-dynamically rollout LoRA adapters. 
-
-NOTE this is not a strict requirement from the inference extension, but a critical feature for CI/CD
-integration.
-
-While we don’t intend to dictate how model servers should implement this API, a reference REST API
-MAY look this:
-
-```
-POST ${server_endpoint}/adapters/{adapter-id}
-{
-        "path": "path/to/my/adapter"
-}
-
-DELETE ${server_endpoint}/adapters/{adapter-id}
-```
+NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the `vllm:lora_requests_info` metric, where
+`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma separated string label `running_lora_adapters`.
\ No newline at end of file

From c4700052594b0f0eb6596c6db7893ae2eff304a5 Mon Sep 17 00:00:00 2001
From: Cong Liu <conliu@google.com>
Date: Mon, 27 Jan 2025 18:16:22 -0800
Subject: [PATCH 3/4] address comments

---
 .../003-model-server-protocol/protocol.md     | 75 +++++++++++--------
 1 file changed, 45 insertions(+), 30 deletions(-)

diff --git a/docs/proposals/003-model-server-protocol/protocol.md b/docs/proposals/003-model-server-protocol/protocol.md
index d393b0d5..a9a3c96f 100644
--- a/docs/proposals/003-model-server-protocol/protocol.md
+++ b/docs/proposals/003-model-server-protocol/protocol.md
@@ -1,55 +1,68 @@
-# Model Server Protocol for Gateway API Inference Extension
+# Endpoint Picker Protocol
 
-## Inference API Protocol
+The Endpoint Picker, or EPP, is a core component of the inference extension. Ultimately it's
+responsible for picking an endpoint from the `InferencePool`. A reference implementation can be
+found [here](../../../pkg/ext-proc/).
 
-The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
-and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to
-supporting more API protocols.
+## Proxy Protocol
+
+This is the protocol between the EPP and the proxy (e.g, Envoy).
+
+The EPP MUST implement the Envoy
+[external processing service](https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ext_proc/v3/external_processor)protocol.
+
+For each HTTP request, the EPP MUST communicate to the proxy the picked model server endpoint, via
+adding the `target-pod` HTTP header in the request, or otherwise return an error.
+
+## Model Server Protocol
 
-<details>
-<summary>Why?</summary>
-The extension makes intelligent request scheduling decisions based on certain information from the
-request body, such as the `model` field.
-</details>
+This is the protocol between the EPP and the model servers.
 
-## Metrics Reporting
+### Inference API Protocol
+
+The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
+and [Chat](https://platform.openai.com/docs/api-reference/chat) APIs.
+
+### Metrics Reporting
 
 The inference extension scrapes metrics from the model servers to make optimal request scheduling
-decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. While
-the metric names may differ slightly in different model servers, the metric types MUST be the same.
-We will align with the
+decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. The exact
+metric names don't necessarily need to be the same as the recommended names here, however the
+metric types and semantics MUST follow this doc.
+
+Note the requirements here are aligned with the
 [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
 effort.
 
-We also show the metrics in vLLM, which is already integrated into the inference extension. We will
-add more model servers once they are integrated.
+The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated
+into the reference endpoint picker implementation.
 
 | Metric | Type | Description | vLLM metric |
 | ----- | ---- | ---- | ---- |
 | TotalQueuedRequests         | Gauge     | The current total number of requests in the queue.| `vllm:num_requests_waiting`|
 | KVCacheUtilization| Gauge     | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`|
 
-## [Experimental] LoRA Adapter Serving
 
-Model servers that support dynamic LoRA serving can gain additional benefit from the inference
-extension's LoRA affinity algorithm. As dynamic LoRA serving is quite new and evolving, this part is considered experimental and subject to changes in future releases.
+### LoRA Adapter Serving
 
-The inference extension expects the following behavior from compatible model servers.
+Model servers that support dynamic LoRA serving can benefit from the LoRA affinity algorithm. Note
+the current algorithm in the reference EPP is highly biased towards vLLM's current dynamic LoRA 
+implementation.
 
-* Support running multiple LoRA adapters in parallel in the same decode batch.
-* Dynamically load/unload adapters in GPU memory from/to a cache (e.g., in host memory) depending on
-  the requested adapters in the current batch.
+The model servers MUST support serving a LoRA adapter specified in the `model` argument of the
+request, provided the requested adapter is valid.
 
-The model server MUST expose the following LoRA adapter information via a RESTful API with response in JSON :
+The model server MUST expose the following LoRA adapter information via a RESTful API with response
+in JSON :
 
 * `Config` 
   * `LoRAEnabled`: boolean, whether dynamic LoRA serving is enabled.
-  *  `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to serve a batch.
-  Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the
-  requested adapter. 
+  *  `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to
+  serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and cannot
+  load the requested adapter. 
 * `State`
-  * `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and ready to serve
-  requests.
+  * `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and
+  ready to serve requests.
 
 This is an example API endpoint and response:
 ```
@@ -69,4 +82,6 @@ GET ${server_endpoint}/adapters/info
 ```
 
 NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the `vllm:lora_requests_info` metric, where
-`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma separated string label `running_lora_adapters`.
\ No newline at end of file
+`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma
+separated string label `running_lora_adapters`. We will use [this issue](https://github.com/vllm-project/vllm/issues/10086)
+to track integration efforts with vLLM.
\ No newline at end of file

From 00c6e6188c34166017ac257213c8f00826b9ccaa Mon Sep 17 00:00:00 2001
From: Cong Liu <conliu@google.com>
Date: Tue, 28 Jan 2025 15:52:13 -0800
Subject: [PATCH 4/4] document current lora metrics

---
 .../003-model-server-protocol/protocol.md     | 44 +++++--------------
 1 file changed, 11 insertions(+), 33 deletions(-)

diff --git a/docs/proposals/003-model-server-protocol/protocol.md b/docs/proposals/003-model-server-protocol/protocol.md
index a9a3c96f..3ce38344 100644
--- a/docs/proposals/003-model-server-protocol/protocol.md
+++ b/docs/proposals/003-model-server-protocol/protocol.md
@@ -52,36 +52,14 @@ implementation.
 The model servers MUST support serving a LoRA adapter specified in the `model` argument of the
 request, provided the requested adapter is valid.
 
-The model server MUST expose the following LoRA adapter information via a RESTful API with response
-in JSON :
-
-* `Config` 
-  * `LoRAEnabled`: boolean, whether dynamic LoRA serving is enabled.
-  *  `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to
-  serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and cannot
-  load the requested adapter. 
-* `State`
-  * `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and
-  ready to serve requests.
-
-This is an example API endpoint and response:
-```
-GET ${server_endpoint}/adapters/info
-```
-
-```
-{
-    "config": {
-        "enabled": true,
-        "maxActiveAdapters": 4,
-    },
-    "state": {
-        "activeAdapters": ["adapter1", "adapter2"]
-    }
-}
-```
-
-NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the `vllm:lora_requests_info` metric, where
-`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma
-separated string label `running_lora_adapters`. We will use [this issue](https://github.com/vllm-project/vllm/issues/10086)
-to track integration efforts with vLLM.
\ No newline at end of file
+The model server MUST expose the following LoRA adapter metrics via the same Prometheus endpoint:
+
+* Metric name implemented in vLLM: `vllm:lora_requests_info` 
+* Metric type: Gauge
+* Metric value: The last updated timestamp (so the EPP can find the latest).
+* Metric labels: 
+  * `max_lora`: The maximum number of adapters that can be loaded to GPU memory to serve a batch.
+  Requests will be queued if the model server has reached MaxActiveAdapter and canno load the
+  requested adapter. Example: `"max_lora": "8"`.
+  * `running_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU
+    memory and ready to serve requests. Example: `"running_lora_adapters": "adapter1, adapter2"`
\ No newline at end of file