This is the protocol between the EPP and the model servers.
The model server MUST implement OpenAI’s Completions and Chat APIs.
The inference extension scrapes metrics from the model servers to make optimal request scheduling decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. The exact metric names don't necessarily need to be the same as the recommended names here, however the metric types and semantics MUST follow this doc.
Note the requirements here are aligned with the model server metrics standardization effort.
The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated into the reference endpoint picker implementation.
Metric | Type | Description | vLLM metric |
---|---|---|---|
TotalQueuedRequests | Gauge | The current total number of requests in the queue. | vllm:num_requests_waiting |
KVCacheUtilization | Gauge | The current KV cache utilization in percentage. | vllm:gpu_cache_usage_perc |
Model servers that support dynamic LoRA serving can benefit from the LoRA affinity algorithm. Note the current algorithm in the reference EPP is highly biased towards vLLM's current dynamic LoRA implementation.
The model servers MUST support serving a LoRA adapter specified in the model
argument of the
request, provided the requested adapter is valid.
The model server MUST expose the following LoRA adapter metrics via the same Prometheus endpoint:
- Metric name implemented in vLLM:
vllm:lora_requests_info
- Metric type: Gauge
- Metric value: The last updated timestamp (so the EPP can find the latest).
- Metric labels:
max_lora
: The maximum number of adapters that can be loaded to GPU memory to serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the requested adapter. Example:"max_lora": "8"
.running_lora_adapters
: A comma separated list of adapters that are currently loaded in GPU memory and ready to serve requests. Example:"running_lora_adapters": "adapter1, adapter2"
waiting_lora_adapters
: A comma separated list of adapters that are waiting to be served. Example:"waiting_lora_adapters": "adapter1, adapter2"