-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign EPP Metrics Pipeline to be Model Server Agnostic #461
Redesign EPP Metrics Pipeline to be Model Server Agnostic #461
Conversation
Hi @BenjaminBraunDev. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think it is better to define a MetricsFetcher interface, for each inference framework, there should be a implement.
By passing the metric name from args looks like hardcoding, it can be difficult if we want to support multi inference engine simultaneously later
Removed intermediate metrics and triton support (for now, until we make the changes in trtllm backend). Made the metrics mapping 1-to-1 with the metrics we support for load balancing, and made defaults the metric families for vLLM. Also, we should update the Quickstart Guide to account to account for the movement of ext_proc into the vllm directory here, and also because it's out of date with some of the file paths. For instance, the vLLM deployment was broken into 2 separate yamls, so this command doesn't work anymore:
We should change that, and here change (I moved this here since currently a new ext_proc.yaml will need to be made for each model server, setting the appropriate metric flags. I think we should keep vLLM specific things inside a vllm directory.)
to
|
It's true that most of the string parsing logic isn't being run unless labels are present. There are 2 phases we're planning to go through:
For this PR (and for initial triton support once they have the compound metrics we need) we're on step 1, but if we can get model serves on board, we could have vLLM, Triton, Jetstream, etc. add a prometheus gauge metrics like:
Then we could do away with this parsing, on our end. |
/ok-to-test |
… based on EPP runtime flags.
…main for consistency.
…ge to main AR instead of testing registry.
…balancing metrics.
… trace, fix comments.
Can you rebase pls? |
6188ab5
to
97fd0de
Compare
Note: While it does make the diff more difficult to read, I moved the metrics out of the vllm folder since it would result in multiple backend/metrics packages, and the code in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls keep the yaml files where they are for now, the guides reference them.
… all logging from metrics.go.
Moved these files back. |
Thanks for the quick back and forth, this is perfect! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, BenjaminBraunDev The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is a redesign how EPP extracts metrics from its model server endpoints in a way that works with any model server that supports prometheus metrics.
Additionally, so long as the model server implements the OpenAI API protocol EPP can send inference requests and load balance across InferenceModel objects regardless of the underlying model server architecture.
What problem is being solved?
The difficult part of enabling EPP to collect metrics for any model server is the translation of metrics from what the model server exposes into the set of metrics we load balance on. While many model serves export their metrics in prometheus format, there is no consistent naming scheme or interface to follow.
The goal of this redesign is to provide a set of assignable modelserver-side prometheus metrics and automatically map them to the
datastore.Metrics
type fields that EPP uses to load balance.Why the complexity?
In some cases, a model server's prometheus metric might directly correspond to one of the load-balancing metrics, but other times we need multiple prometheus metrics to derive a single load-balancing metric. Therefore, some form of mapping from the set of common prometheus metrics a server might provide to the load-balancing metrics we need is necessary.
To address this, we provide EPP with several potential runtime flags that define what metrics it has to work with for the model server endpoints it's serving. From there, we check if the flags provided are adequate to derive all the required load balancing metrics. For example:
In order for EPP to derive
KVCacheUsagePercent
, a model server must provide either:KVCacheUsagePercent
(vLLM provides this directly)OR
UsedKVCacheBlocks
andMaxKVCacheBlocks
(Triton has no percentage metric, so provides both of these instead)Each of these is provided as the latest value of some prometheus metric family, optionally filtered on a set of key/value labels. Going forward, to incorporate a new model server it can either provide a family/labels for the direct metric or alternatively a family/labels for each of the constituent metrics, and EPP will derive the metric from there.
Breakdown of Changes
backend/metrics.go
and newbackend/metrics_spec.go
backend
now has a single metrics.go for all model servers, and has an a new field in itsPodMetricsClient
implementation calledMetricMapping
This MetricMapping is a set of predefined MetricSpec objects that represent what constituent metrics a model server might provide, the idea is any model server should be able to provide most of these, but perhaps not all, and we should be robust to that without requiring any input from the model server side.
From there, the logic for knowing whether we have substantial prometheus metrics to load balance, and how to derive those load balancing metrics, is in
metrics.go
. To clarify, the goal is not to process prometheus metrics based on the specific model server, but rather abstract away what model server we are scraping and scrape purely based on the subset of prometheus metrics we are given in the EPP runtime flags. This allows us to avoid writing a new metrics.go (and compiling a separate EPP image) for each model sever we want to support, only requiring code changes if a new model server isn't able to provide metrics that "span" the load balancing metrics.main.go
Here we add flags for the prometheus family/labels. The format of these flags is modeled after prometheus strings themselves, that being a metric family follows by a list of key-value pair labels in curly braces:
From there,
metrics.go
will scrape the latest metrics in that family that has at least all the specified labels. For example, for triton, the EPP flags look like this:from args section of
config/manifests/triton/ext_proc.yaml
Whereas vLLM has separate metric families for each of its metrics (except lora metrics, which are handled differently) and the flags look like this:
from args section of
config/manifests/vllm/ext_proc.yaml
In either case, we can deploy EPP for either Triton or vLLM using the same image, only changing the flags.
Limitations
Currently, this still requires EPP to be re-run to switch model server is being inferenced, the next logical step is to allow gateway to choose between EPP services based on request, or perhaps allow a single running instance of EPP to support multiple model server types.
Additionally, if a model server has new prometheus metrics that can still be derived into our load balancing metrics in some way, we would have to add that logic in
metrics.go
.LoRA metrics are also handled differently and are treated as a special case for vLLM. There is a single metric family, and for each entry the value is a time stamp, with the actual metric information being in the labels that entry has. If a model server happens to use this same system as vLLM, is can provide a metric family name for the
-loraRequestInfoMetric
flag, and it will be handled in the same way vLLM is, however it's unclear if this is a standard approach for LoRA metrics.