Each pod has independent loops to refresh metrics #460

liu-cong · 2025-03-06T23:30:10Z

Fixes #99
Close #223

This PR creates independent metrics refresh loops per pod, so that a slow or bad pod won't affect others. This PR is built on top of #223. Credits to @spacewander for the initial effort!

Added a new backend/metrics package, which is mainly responsible for handling metrics scraping from backend servers.
Get rid of the provider.go, whose responsibility was very broad.
- Backend metrics scraping is now done by the PodMetrics implementation, see below.
- Print metrics debug logs is now moved to the backend/metrics package.
- Flushing EPP metrics logic is now moved to the metrics package.
Changed PodMetrics to an interface, and added an implementation which can periodically update metrics for a single pod, which was previously done in the provider.go
The DataStore now purely stores the PodMetrics, and cleaned up Pod operations in DataStore, made the interface much cleaner.
Use atomic pointer update for high performance metrics update.
Fixes various data races. Some are in tests only.
Removed benchmark.go, we are not using it currently, and it's outdated. Created Add EPP micro benchmark #462 to follow up on doing it properly.

Benchmark Results

The benchmark results below show that the updated EPP has slightly better e2e latency compared to the latest v0.2 release, and similar effects in keep the kv-cache lower compared to the baseline k8s svc. This is to ensure no regression is introduced by this PR.

k8s-ci-robot · 2025-03-06T23:30:12Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

netlify · 2025-03-06T23:30:30Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`1ef6a91`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67cf24f77a98d90008c85d49
😎 Deploy Preview	https://deploy-preview-460--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

pkg/epp/datastore/datastore.go

pkg/epp/datastore/datastore_test.go

ahg-g

I think the PodMetricRefresher is a detail of PodMetrics, not the other way around. If we do that, then we limit the exported types to one, PodMetrics. Here is the suggestion in more detail:

Move PodMetrics to the metrics pkg, and wrap it around an interface such as this:

type PodMetrics interface {
  func GetObj() *Pod 
  func SetObj(pod *Pod)
  func GetMetrics() *Metrics
  func SetMetrics(m *Metrics)
  func Status() Status
  func Start(m *Metrics)
  func Stop(m *Metrics)
}

type Status int

const (
	Fresh Status = 0
	Stale  Status = 1
)

type podMetrics struct {
  obj *Pod
  metrics *Metrics 
  refresher *MetricsRefresher
}

type PodMetricsFactory struct {
  ParentCtx Context
  MetricsClient   PodMetricsClient
  RefreshInterval time.Duration
}

func (f *PodMetricsFactory) New(pod *v1.Pod) PodMetrics {
  ....
}

The datastore owns updating the Obj part (i.e. it is the only component that call SetObj), while the refresher owns updating the Metrics part (SetMetrics). This clear separation of concern (plus limiting updates to only be done by replacing the reference, not its individual parameters) make it thread safe and the responsibilities aligned.

The datastore would only receive one parameter, which is a PodMetricsFactory. With this, the dependencies are now cleaner (Datastore -> Metrics because it stores the PodMetrics objects), and this also cleans the datastore from all parameters related to metrics scraping.

pkg/epp/datastore/datastore.go

pkg/epp/test/utils.go

pkg/epp/datastore/datastore.go

liu-cong · 2025-03-07T17:17:47Z

RE: I think the PodMetricRefresher is a detail of PodMetrics

I agree. Ideally I would do this in incremental steps,

Refactor PodMetrics to an interface
Do this PR

I looked at this, but there are many places that are directly using the PodMetrics object, so this would result in a massive PR. So I decided to do 2 first to minimize merge conflicts, as that requires minimal refactoring, and I can use the change in #223 directly. Now the changes are encapsulated to the datastore package, I should be able to do 1 pretty easily.

ahg-g · 2025-03-07T17:35:59Z

Can you do 1 in this PR as a separate commit? I am happy to review the resulting big PR, most of the changes will be updates to how the various places access PodMetrics so it should be easy to review. What I was thinking is that the datastore itself should have very limited changes.

…cs package

liu-cong

This unfortunately becomes a massive PR. But I believe this cleans up a bunch of things and make each package's responsibilities much cleaner.

pkg/epp/backend/metrics/types.go

pkg/epp/datastore/datastore.go

pkg/epp/metrics/flusher.go

pkg/epp/util/testing/request.go

ahg-g

This is great!

As a followup, may be we try to select a different name for backend/metrics because we have epp/metrics pkg as well, each are concerned with something completely different, but use the same name, and so when searching the code, it gets confusing.

ahg-g · 2025-03-08T19:53:05Z

pkg/epp/backend/metrics/pod_metrics.go

+	pod        unsafe.Pointer // stores a *Pod
+	metrics    unsafe.Pointer // stores a *Metrics
+	pmc        PodMetricsClient
+	targetPort int32 // metrics endpoint port in the pod


If port on the pool changes, we will need to update this. To avoid this, I suggest to add the GetPool function to the Datastore interface you defined in this pkg to get the port directly instead of storing it here and risking it getting stale.

We should separate the concerns. It's the reconcilers responsibility to make sure it terminates a stale PodMetrics and create a new one with the new port number. The PodMetrics is purely responsible for refreshing metrics and has no knowledge of the datastore

We need to balance that with complexity, storing the port in every entry and updating it for all entries on every change to the pool in redundant and error prone. The port number is defined by the inference pool, it is reasonable to call into the store to get the port number.

I would prefer keeping the podMetrics as simple as possible, i.e., keep the metrics of a pod fresh. It should not depend on the datastore. Someone needs to be responsible for properly managing the lifecycle of these refreshers, and IMO the "someone" should be the reconcilers who manage the pool and pod lifecycles. Otherwise this PodMetrics becomes another "reconciler" to some extent, it needs to monitor the pool updates and update itself accordingly.

Note the PodMetrics doesn't depends on the Datastore currently. The Datastore interface in this package is only used by the logger to print debug logs. Also note that just letting the PodMetrics have access to GetPool won't fix the stale problem, it needs to monitor pool updates.

As it is right now, this PR introduces a regression, if the port changes on the pool, it will not be reflected here, how do you suggest we address this?

Also note that just letting the PodMetrics have access to GetPool won't fix the stale problem, it needs to monitor pool updates.

It does fix the problem: when the pool gets updated, the port stored in the pool object will. The existing code gets the pool on every scrape, and so it always work with the most updated version of the pool. In the new code, we will need to pass the Datastore interface to NewPodMetrics func, and in refreshMetrics we get the pool at the beginning to obtain the port.

if the port changes on the pool, it will not be reflected here, how do you suggest we address this?

Ah I see what you mean. I wasn't aware of the fact that we currently don't resync if the port changes in the pool so this would lead to a regression. Sorry for that confusion.

One option is what you suggested. This will lead to frequent mutex access to the pool, but probably OK as writes to the pool should be very infrequent. The other option is to expose an interface on the PodMetrics to update the port, like we do with the UpdatePod interface today, but this requires careful control logic on the pool reconciler. And if there is a new property in the future we should watch for, we need to remember to add that as well, unless we unconditionally resync all pods upon any change in the pool.

Upon weighing the tradeoffs, I think your suggestion makes sense.

ahg-g · 2025-03-08T20:02:31Z

pkg/epp/datastore/datastore.go

+	PodGetAll() []backendmetrics.PodMetrics
+	// PodList lists pods matching the given predicate.
+	PodList(func(backendmetrics.PodMetrics) bool) []backendmetrics.PodMetrics
+	PodUpdateOrAddIfNotExist(pod *corev1.Pod, pool *v1alpha2.InferencePool) bool


Passing in the pool as a parameter here and in PodResyncAll is odd, the store is defined on a specific pool, and so it must enforce that the pods and models in the pool are the ones that belong to the pool that the datastore is defined on.

Agree it seems a bit odd to the first glance, but this forces the caller to make sure the pool is valid. Otherwise in PodUpdateOrAddIfNotExist, and PodResyncAll, we need to check whether the pool exists. Currently the code just assumes the pool exists and ignores the error, which I think is a dangerous assumption (we should never ignore error).

But also open to not passing the pool and just add validation logic in the PodXX functions.

+1 to adding those error checks and have those functions return an error, I agree this is better; passing the pool makes it look like any other pool can be used to refresh the pods, which the interface better to be designed in a way to not allow.

Upon a second thought, I actually think passing the pool in makes a lot of sense, it reads like "add these pods for this pool". So this pushes the responsibility to the caller to make sure the pods and pool match. Otherwise the datastore needs to validate that (basically duplicating some logic in the pool reconciler). The datastore is already pretty busy, I would prefer the datastore to be as thin as possible and not deal with too much business logic.

Would we ever want to add pods to the store that don't belong to the stored pool? if the answer is no, then passing the pool as a parameter is redundant and complicates the interface.

In any case, I am not going to push back more here, but my preference is to not change the existing interface design.

Would we ever want to add pods to the store that don't belong to the stored pool?

No. But IMO the pod and pool reconcilers are already doing this validation. Seems unnecessary to validate this in the datastore. Note such validation doesn't exist today in datastore and we would need to add the logic and tests.

pkg/epp/datastore/datastore.go

pkg/epp/metrics/flusher.go

pkg/epp/server/runserver.go

pkg/epp/backend/metrics/logger.go

ahg-g · 2025-03-10T05:59:49Z

pkg/epp/backend/metrics/pod_metrics.go

+	pod        unsafe.Pointer // stores a *Pod
+	metrics    unsafe.Pointer // stores a *Metrics
+	pmc        PodMetricsClient
+	targetPort int32 // metrics endpoint port in the pod


As it is right now, this PR introduces a regression, if the port changes on the pool, it will not be reflected here, how do you suggest we address this?

Also note that just letting the PodMetrics have access to GetPool won't fix the stale problem, it needs to monitor pool updates.

It does fix the problem: when the pool gets updated, the port stored in the pool object will. The existing code gets the pool on every scrape, and so it always work with the most updated version of the pool. In the new code, we will need to pass the Datastore interface to NewPodMetrics func, and in refreshMetrics we get the pool at the beginning to obtain the port.

ahg-g · 2025-03-10T06:43:26Z

pkg/epp/datastore/datastore.go

+	PodGetAll() []backendmetrics.PodMetrics
+	// PodList lists pods matching the given predicate.
+	PodList(func(backendmetrics.PodMetrics) bool) []backendmetrics.PodMetrics
+	PodUpdateOrAddIfNotExist(pod *corev1.Pod, pool *v1alpha2.InferencePool) bool


Would we ever want to add pods to the store that don't belong to the stored pool? if the answer is no, then passing the pool as a parameter is redundant and complicates the interface.

In any case, I am not going to push back more here, but my preference is to not change the existing interface design.

…tastore

ahg-g · 2025-03-10T17:49:23Z

Thanks Cong!

/lgtm
/approve

k8s-ci-robot · 2025-03-10T17:49:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 6, 2025

k8s-ci-robot requested review from ahg-g and Jeffwan March 6, 2025 23:30

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 6, 2025

liu-cong force-pushed the per-pod-refresh branch from 61c0ebb to 4bf715a Compare March 7, 2025 00:04

Each pod has independent loops to refresh metrics

cb38e61

liu-cong force-pushed the per-pod-refresh branch from 4bf715a to cb38e61 Compare March 7, 2025 00:29

liu-cong commented Mar 7, 2025

View reviewed changes

pkg/epp/datastore/datastore.go Outdated Show resolved Hide resolved

pkg/epp/datastore/datastore.go Show resolved Hide resolved

pkg/epp/datastore/datastore_test.go Show resolved Hide resolved

liu-cong marked this pull request as ready for review March 7, 2025 00:31

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 7, 2025

k8s-ci-robot requested a review from danehans March 7, 2025 00:31

ahg-g reviewed Mar 7, 2025

View reviewed changes

Major refactoring, move metrics logic from datastore to backend/metri…

0241829

…cs package

liu-cong commented Mar 8, 2025

View reviewed changes

pkg/epp/backend/metrics/types.go Show resolved Hide resolved

pkg/epp/datastore/datastore.go Outdated Show resolved Hide resolved

pkg/epp/metrics/flusher.go Outdated Show resolved Hide resolved

pkg/epp/util/testing/request.go Outdated Show resolved Hide resolved

ahg-g reviewed Mar 8, 2025

View reviewed changes

Address comments

98b9371

liu-cong force-pushed the per-pod-refresh branch 2 times, most recently from d2591d2 to a67a8d6 Compare March 10, 2025 04:00

Fix test and fmt

27d60cb

liu-cong force-pushed the per-pod-refresh branch from a67a8d6 to 27d60cb Compare March 10, 2025 04:13

ahg-g reviewed Mar 10, 2025

View reviewed changes

The podMetrics updates the targetPort by reading the pool from the da…

1ef6a91

…tastore

k8s-ci-robot assigned ahg-g Mar 10, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2025

k8s-ci-robot merged commit a70d66e into kubernetes-sigs:main Mar 10, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Each pod has independent loops to refresh metrics #460

Each pod has independent loops to refresh metrics #460

liu-cong commented Mar 6, 2025 •

edited

Loading

k8s-ci-robot commented Mar 6, 2025

netlify bot commented Mar 6, 2025 •

edited

Loading

ahg-g left a comment

liu-cong commented Mar 7, 2025

ahg-g commented Mar 7, 2025

liu-cong left a comment

ahg-g left a comment

ahg-g Mar 8, 2025

liu-cong Mar 9, 2025

ahg-g Mar 9, 2025

liu-cong Mar 10, 2025

ahg-g Mar 10, 2025

liu-cong Mar 10, 2025

ahg-g Mar 8, 2025

liu-cong Mar 9, 2025

ahg-g Mar 9, 2025

liu-cong Mar 10, 2025

ahg-g Mar 10, 2025

liu-cong Mar 10, 2025

ahg-g Mar 10, 2025

ahg-g Mar 10, 2025

ahg-g commented Mar 10, 2025

k8s-ci-robot commented Mar 10, 2025

Each pod has independent loops to refresh metrics #460

Each pod has independent loops to refresh metrics #460

Conversation

liu-cong commented Mar 6, 2025 • edited Loading

Benchmark Results

k8s-ci-robot commented Mar 6, 2025

netlify bot commented Mar 6, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

ahg-g left a comment

Choose a reason for hiding this comment

liu-cong commented Mar 7, 2025

ahg-g commented Mar 7, 2025

liu-cong left a comment

Choose a reason for hiding this comment

ahg-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Mar 10, 2025

k8s-ci-robot commented Mar 10, 2025

liu-cong commented Mar 6, 2025 •

edited

Loading

netlify bot commented Mar 6, 2025 •

edited

Loading