Redesign EPP Metrics Pipeline to be Model Server Agnostic #461

BenjaminBraunDev · 2025-03-07T00:18:08Z

This is a redesign how EPP extracts metrics from its model server endpoints in a way that works with any model server that supports prometheus metrics.

Additionally, so long as the model server implements the OpenAI API protocol EPP can send inference requests and load balance across InferenceModel objects regardless of the underlying model server architecture.

What problem is being solved?

The difficult part of enabling EPP to collect metrics for any model server is the translation of metrics from what the model server exposes into the set of metrics we load balance on. While many model serves export their metrics in prometheus format, there is no consistent naming scheme or interface to follow.

The goal of this redesign is to provide a set of assignable modelserver-side prometheus metrics and automatically map them to the datastore.Metrics type fields that EPP uses to load balance.

Why the complexity?

In some cases, a model server's prometheus metric might directly correspond to one of the load-balancing metrics, but other times we need multiple prometheus metrics to derive a single load-balancing metric. Therefore, some form of mapping from the set of common prometheus metrics a server might provide to the load-balancing metrics we need is necessary.

To address this, we provide EPP with several potential runtime flags that define what metrics it has to work with for the model server endpoints it's serving. From there, we check if the flags provided are adequate to derive all the required load balancing metrics. For example:

In order for EPP to derive KVCacheUsagePercent, a model server must provide either:

KVCacheUsagePercent (vLLM provides this directly)
OR
UsedKVCacheBlocks and MaxKVCacheBlocks (Triton has no percentage metric, so provides both of these instead)

Each of these is provided as the latest value of some prometheus metric family, optionally filtered on a set of key/value labels. Going forward, to incorporate a new model server it can either provide a family/labels for the direct metric or alternatively a family/labels for each of the constituent metrics, and EPP will derive the metric from there.

Breakdown of Changes

`backend/metrics.go` and new `backend/metrics_spec.go`

backend now has a single metrics.go for all model servers, and has an a new field in its PodMetricsClient implementation called MetricMapping

type PodMetricsClientImpl struct {
	MetricMapping *MetricMapping
}

This MetricMapping is a set of predefined MetricSpec objects that represent what constituent metrics a model server might provide, the idea is any model server should be able to provide most of these, but perhaps not all, and we should be robust to that without requiring any input from the model server side.

// MetricSpec represents a single metric's specification.
type MetricSpec struct {
	MetricName string // Prometheus metric family name
	Labels     map[string]string // Label name -> Label value
}

// MetricMapping holds named MetricSpecs.
type MetricMapping struct {
	AllRequests       *MetricSpec // Option 1
	WaitingRequests   *MetricSpec // Option 2
	RunningRequests   *MetricSpec // Required
	UsedKVCacheBlocks *MetricSpec // Option 1 (part of a group)
	MaxKVCacheBlocks  *MetricSpec // Option 1 (part of a group)
	KVCacheUsage      *MetricSpec // Option 2 (alternative to the group above)
	// LoRA Metrics (vLLM Specific, optional)
	LoraRequestInfo *MetricSpec
}

From there, the logic for knowing whether we have substantial prometheus metrics to load balance, and how to derive those load balancing metrics, is in metrics.go. To clarify, the goal is not to process prometheus metrics based on the specific model server, but rather abstract away what model server we are scraping and scrape purely based on the subset of prometheus metrics we are given in the EPP runtime flags. This allows us to avoid writing a new metrics.go (and compiling a separate EPP image) for each model sever we want to support, only requiring code changes if a new model server isn't able to provide metrics that "span" the load balancing metrics.

`main.go`

Here we add flags for the prometheus family/labels. The format of these flags is modeled after prometheus strings themselves, that being a metric family follows by a list of key-value pair labels in curly braces:

-metricFlagName="metric_family_name{label1=value1,label2=value2}"

From there, metrics.go will scrape the latest metrics in that family that has at least all the specified labels. For example, for triton, the EPP flags look like this:

from args section of config/manifests/triton/ext_proc.yaml

- -allRequestsMetric
- "nv_trt_llm_request_metrics{request_type=active}"
- -runningRequestsMetric
- "nv_trt_llm_request_metrics{request_type=scheduled}"
- -usedKVCacheBlocksMetric
- "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=used}"
- -maxKVCacheBlocksMetric
- "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=max}"

Whereas vLLM has separate metric families for each of its metrics (except lora metrics, which are handled differently) and the flags look like this:

from args section of config/manifests/vllm/ext_proc.yaml

- -waitingRequestsMetric
- "vllm:num_requests_waiting"
- -runningRequestsMetric
- "vllm:num_requests_running"
- -kVCacheUsageMetric
- "vllm:gpu_cache_usage_perc"
- -loraRequestInfoMetric
- "vllm:lora_requests_info"

In either case, we can deploy EPP for either Triton or vLLM using the same image, only changing the flags.

Limitations

Currently, this still requires EPP to be re-run to switch model server is being inferenced, the next logical step is to allow gateway to choose between EPP services based on request, or perhaps allow a single running instance of EPP to support multiple model server types.

Additionally, if a model server has new prometheus metrics that can still be derived into our load balancing metrics in some way, we would have to add that logic in metrics.go.

LoRA metrics are also handled differently and are treated as a special case for vLLM. There is a single metric family, and for each entry the value is a time stamp, with the actual metric information being in the labels that entry has. If a model server happens to use this same system as vLLM, is can provide a metric family name for the -loraRequestInfoMetric flag, and it will be handled in the same way vLLM is, however it's unclear if this is a standard approach for LoRA metrics.

k8s-ci-robot · 2025-03-07T00:18:18Z

Hi @BenjaminBraunDev. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-03-07T00:18:23Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`81ee1e6`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67d475310359e80008d1b701
😎 Deploy Preview	https://deploy-preview-461--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

hzxuzhonghu

I would think it is better to define a MetricsFetcher interface, for each inference framework, there should be a implement.

By passing the metric name from args looks like hardcoding, it can be difficult if we want to support multi inference engine simultaneously later

pkg/epp/backend/metrics.go

BenjaminBraunDev · 2025-03-07T19:58:46Z

@liu-cong @ahg-g

Removed intermediate metrics and triton support (for now, until we make the changes in trtllm backend). Made the metrics mapping 1-to-1 with the metrics we support for load balancing, and made defaults the metric families for vLLM.

Also, we should update the Quickstart Guide to account to account for the movement of ext_proc into the vllm directory here, and also because it's out of date with some of the file paths. For instance, the vLLM deployment was broken into 2 separate yamls, so this command doesn't work anymore:

https://gateway-api-inference-extension.sigs.k8s.io/guides/#deploy-sample-model-server

We should change that, and here change (I moved this here since currently a new ext_proc.yaml will need to be made for each model server, setting the appropriate metric flags. I think we should keep vLLM specific things inside a vllm directory.)

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/ext_proc.yaml

to

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/ext_proc.yaml

cmd/epp/main.go

pkg/epp/backend/metrics_spec.go

pkg/epp/backend/vllm/metrics.go

pkg/epp/backend/vllm/metrics_spec_test.go

pkg/epp/backend/vllm/metrics_spec.go

pkg/epp/backend/vllm/metrics_spec_test.go

pkg/epp/backend/vllm/metrics_test.go

pkg/epp/backend/metrics.go

pkg/epp/backend/metrics_spec.go

BenjaminBraunDev · 2025-03-10T17:30:17Z

@ahg-g

I am guessing this code is not really getting executed for vllm since we are not selecting specific labels (the labels we select for lora are hardcoded somewhere else). If we have good test coverage it is fine, but again, I would rather have a clear and well defined spec in the model server protocol rather than this.

It's true that most of the string parsing logic isn't being run unless labels are present. There are 2 phases we're planning to go through:

Allow any prometheus metric to map to the PodMetrics, i.e. any family with any labels can be KvCacheUtil, WaitingRequestCount, etc. (this requires the ability to filter on arbitrary names/lables)
Have a specific metrics family and name dedicated to Gateway that model servers must implement with labels for each PodMetric we need for load balancing. We would no longer need complex parsing, as we'd be looking for a very specific metric format that we could hardcode the scaping for

For this PR (and for initial triton support once they have the compound metrics we need) we're on step 1, but if we can get model serves on board, we could have vLLM, Triton, Jetstream, etc. add a prometheus gauge metrics like:

gateway_api_inference_metrics{type="kvCacheUtil"} 
gateway_api_inference_metrics{type="waitingRequestCount"}
... etc ...

Then we could do away with this parsing, on our end.

ahg-g · 2025-03-12T21:04:20Z

/ok-to-test

… based on EPP runtime flags.

…main for consistency.

…ge to main AR instead of testing registry.

…balancing metrics.

… trace, fix comments.

ahg-g · 2025-03-13T21:53:36Z

Can you rebase pls?

BenjaminBraunDev · 2025-03-14T02:40:06Z

Note: While it does make the diff more difficult to read, I moved the metrics out of the vllm folder since it would result in multiple backend/metrics packages, and the code in metrics.go is no longer vllm specific.

ahg-g

Pls keep the yaml files where they are for now, the guides reference them.

cmd/epp/main.go

pkg/epp/backend/metrics/metrics.go

… all logging from metrics.go.

BenjaminBraunDev · 2025-03-14T17:48:24Z

Pls keep the yaml files where they are for now, the guides reference them.

Moved these files back.

pkg/epp/backend/metrics/metrics.go

ahg-g · 2025-03-14T18:30:00Z

Thanks for the quick back and forth, this is perfect!

/lgtm
/approve

k8s-ci-robot · 2025-03-14T18:30:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, BenjaminBraunDev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 7, 2025

k8s-ci-robot requested review from ahg-g and robscott March 7, 2025 00:18

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 7, 2025

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 7, 2025

BenjaminBraunDev mentioned this pull request Mar 7, 2025

Refactor the vllm specific code to become model server agnostic #383

Closed

hzxuzhonghu reviewed Mar 7, 2025

View reviewed changes

pkg/epp/backend/metrics.go Outdated Show resolved Hide resolved

liu-cong reviewed Mar 7, 2025

View reviewed changes

ahg-g reviewed Mar 8, 2025

View reviewed changes

pkg/epp/backend/metrics.go Outdated Show resolved Hide resolved

ahg-g reviewed Mar 8, 2025

View reviewed changes

pkg/epp/backend/metrics.go Outdated Show resolved Hide resolved

pkg/epp/backend/metrics.go Outdated Show resolved Hide resolved

pkg/epp/backend/metrics_spec.go Outdated Show resolved Hide resolved

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 11, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 12, 2025

BenjaminBraunDev added 11 commits March 13, 2025 20:57

start adding metrics changes for trion support

214905d

Refactor metrics to work with any prometheus metric naming convention…

6125054

… based on EPP runtime flags.

Finalize metric refactor and testing.

71e00ad

Set streaming env var to false in triton ext_proc.yaml

dd2825f

Update titon server deployment to pull frozen repo branch instead of …

aa2ee06

…main for consistency.

Remove model server specific metric files and tests and point EPP ima…

d4c083e

…ge to main AR instead of testing registry.

Remove commented prints and old comments.

df3f3e3

Remove triton support for now, make metrics mapping 1-to-1 with load …

558132e

…balancing metrics.

moved files for cleaner diff

5838459

re-add todos and rename kv flag to reflect percentage usage.

1c367a6

Fix nits, move logging channel for backend/metrics.go from default to…

3356bd3

… trace, fix comments.

BenjaminBraunDev added 2 commits March 13, 2025 23:37

Rebase into metric agnostic redesign.

371fd58

Merge getLatestMetric and getLabeledMetric.

97fd0de

BenjaminBraunDev force-pushed the metrics_refactor branch from 6188ab5 to 97fd0de Compare March 14, 2025 02:27

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2025

Remove unused datastore types.

27b34e9

Fix lint.

4b84744

ahg-g reviewed Mar 14, 2025

View reviewed changes

BenjaminBraunDev added 2 commits March 14, 2025 17:06

Remove log and fix nits.

66e0376

Move ext_proc and inferencemodel yaml files back, fix nits and remove…

9f4859b

… all logging from metrics.go.

ahg-g reviewed Mar 14, 2025

View reviewed changes

pkg/epp/backend/metrics/metrics.go Outdated Show resolved Hide resolved

ahg-g reviewed Mar 14, 2025

View reviewed changes

pkg/epp/backend/metrics/metrics.go Outdated Show resolved Hide resolved

BenjaminBraunDev added 2 commits March 14, 2025 18:00

Remove the rest of logging from metrics.go and tests.

c082e86

Add trace log to podmetrics and small warning fix to metrics_spec_test.

81ee1e6

k8s-ci-robot assigned ahg-g Mar 14, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 14, 2025

k8s-ci-robot merged commit a13179a into kubernetes-sigs:main Mar 14, 2025
8 checks passed

BenjaminBraunDev deleted the metrics_refactor branch March 14, 2025 19:22

BenjaminBraunDev mentioned this pull request Mar 14, 2025

Add nil option for metric_spec to specify metrics to not be scraped. #503

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign EPP Metrics Pipeline to be Model Server Agnostic #461

Redesign EPP Metrics Pipeline to be Model Server Agnostic #461

BenjaminBraunDev commented Mar 7, 2025 •

edited

Loading

k8s-ci-robot commented Mar 7, 2025

netlify bot commented Mar 7, 2025 •

edited

Loading

hzxuzhonghu left a comment

BenjaminBraunDev commented Mar 7, 2025

BenjaminBraunDev commented Mar 10, 2025

ahg-g commented Mar 12, 2025

ahg-g commented Mar 13, 2025

BenjaminBraunDev commented Mar 14, 2025 •

edited

Loading

ahg-g left a comment

BenjaminBraunDev commented Mar 14, 2025

ahg-g commented Mar 14, 2025

k8s-ci-robot commented Mar 14, 2025

Redesign EPP Metrics Pipeline to be Model Server Agnostic #461

Redesign EPP Metrics Pipeline to be Model Server Agnostic #461

Conversation

BenjaminBraunDev commented Mar 7, 2025 • edited Loading

What problem is being solved?

Why the complexity?

Breakdown of Changes

backend/metrics.go and new backend/metrics_spec.go

main.go

Limitations

k8s-ci-robot commented Mar 7, 2025

netlify bot commented Mar 7, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

BenjaminBraunDev commented Mar 7, 2025

BenjaminBraunDev commented Mar 10, 2025

ahg-g commented Mar 12, 2025

ahg-g commented Mar 13, 2025

BenjaminBraunDev commented Mar 14, 2025 • edited Loading

ahg-g left a comment

Choose a reason for hiding this comment

BenjaminBraunDev commented Mar 14, 2025

ahg-g commented Mar 14, 2025

k8s-ci-robot commented Mar 14, 2025

BenjaminBraunDev commented Mar 7, 2025 •

edited

Loading

`backend/metrics.go` and new `backend/metrics_spec.go`

`main.go`

netlify bot commented Mar 7, 2025 •

edited

Loading

BenjaminBraunDev commented Mar 14, 2025 •

edited

Loading