Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Each pod has independent loops to refresh metrics #460

Merged
merged 5 commits into from
Mar 10, 2025

Conversation

liu-cong
Copy link
Contributor

@liu-cong liu-cong commented Mar 6, 2025

Fixes #99
Close #223

This PR creates independent metrics refresh loops per pod, so that a slow or bad pod won't affect others. This PR is built on top of #223. Credits to @spacewander for the initial effort!

  • Added a new backend/metrics package, which is mainly responsible for handling metrics scraping from backend servers.
  • Get rid of the provider.go, whose responsibility was very broad.
    • Backend metrics scraping is now done by the PodMetrics implementation, see below.
    • Print metrics debug logs is now moved to the backend/metrics package.
    • Flushing EPP metrics logic is now moved to the metrics package.
  • Changed PodMetrics to an interface, and added an implementation which can periodically update metrics for a single pod, which was previously done in the provider.go
  • The DataStore now purely stores the PodMetrics, and cleaned up Pod operations in DataStore, made the interface much cleaner.
  • Use atomic pointer update for high performance metrics update.
  • Fixes various data races. Some are in tests only.
  • Removed benchmark.go, we are not using it currently, and it's outdated. Created Add EPP micro benchmark #462 to follow up on doing it properly.

Benchmark Results

The benchmark results below show that the updated EPP has slightly better e2e latency compared to the latest v0.2 release, and similar effects in keep the kv-cache lower compared to the baseline k8s svc. This is to ensure no regression is introduced by this PR.

image

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 6, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and Jeffwan March 6, 2025 23:30
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 6, 2025
Copy link

netlify bot commented Mar 6, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 1ef6a91
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67cf24f77a98d90008c85d49
😎 Deploy Preview https://deploy-preview-460--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@liu-cong liu-cong marked this pull request as ready for review March 7, 2025 00:31
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 7, 2025
@k8s-ci-robot k8s-ci-robot requested a review from danehans March 7, 2025 00:31
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PodMetricRefresher is a detail of PodMetrics, not the other way around. If we do that, then we limit the exported types to one, PodMetrics. Here is the suggestion in more detail:

Move PodMetrics to the metrics pkg, and wrap it around an interface such as this:

type PodMetrics interface {
  func GetObj() *Pod 
  func SetObj(pod *Pod)
  func GetMetrics() *Metrics
  func SetMetrics(m *Metrics)
  func Status() Status
  func Start(m *Metrics)
  func Stop(m *Metrics)
}

type Status int

const (
	Fresh Status = 0
	Stale  Status = 1
)

type podMetrics struct {
  obj *Pod
  metrics *Metrics 
  refresher *MetricsRefresher
}

type PodMetricsFactory struct {
  ParentCtx Context
  MetricsClient   PodMetricsClient
  RefreshInterval time.Duration
}

func (f *PodMetricsFactory) New(pod *v1.Pod) PodMetrics {
  ....
}

The datastore owns updating the Obj part (i.e. it is the only component that call SetObj), while the refresher owns updating the Metrics part (SetMetrics). This clear separation of concern (plus limiting updates to only be done by replacing the reference, not its individual parameters) make it thread safe and the responsibilities aligned.

The datastore would only receive one parameter, which is a PodMetricsFactory. With this, the dependencies are now cleaner (Datastore -> Metrics because it stores the PodMetrics objects), and this also cleans the datastore from all parameters related to metrics scraping.

@liu-cong
Copy link
Contributor Author

liu-cong commented Mar 7, 2025

RE: I think the PodMetricRefresher is a detail of PodMetrics

I agree. Ideally I would do this in incremental steps,

  1. Refactor PodMetrics to an interface
  2. Do this PR

I looked at this, but there are many places that are directly using the PodMetrics object, so this would result in a massive PR. So I decided to do 2 first to minimize merge conflicts, as that requires minimal refactoring, and I can use the change in #223 directly. Now the changes are encapsulated to the datastore package, I should be able to do 1 pretty easily.

@ahg-g
Copy link
Contributor

ahg-g commented Mar 7, 2025

Can you do 1 in this PR as a separate commit? I am happy to review the resulting big PR, most of the changes will be updates to how the various places access PodMetrics so it should be easy to review. What I was thinking is that the datastore itself should have very limited changes.

Copy link
Contributor Author

@liu-cong liu-cong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unfortunately becomes a massive PR. But I believe this cleans up a bunch of things and make each package's responsibilities much cleaner.

Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

As a followup, may be we try to select a different name for backend/metrics because we have epp/metrics pkg as well, each are concerned with something completely different, but use the same name, and so when searching the code, it gets confusing.

pod unsafe.Pointer // stores a *Pod
metrics unsafe.Pointer // stores a *Metrics
pmc PodMetricsClient
targetPort int32 // metrics endpoint port in the pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If port on the pool changes, we will need to update this. To avoid this, I suggest to add the GetPool function to the Datastore interface you defined in this pkg to get the port directly instead of storing it here and risking it getting stale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should separate the concerns. It's the reconcilers responsibility to make sure it terminates a stale PodMetrics and create a new one with the new port number. The PodMetrics is purely responsible for refreshing metrics and has no knowledge of the datastore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to balance that with complexity, storing the port in every entry and updating it for all entries on every change to the pool in redundant and error prone. The port number is defined by the inference pool, it is reasonable to call into the store to get the port number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer keeping the podMetrics as simple as possible, i.e., keep the metrics of a pod fresh. It should not depend on the datastore. Someone needs to be responsible for properly managing the lifecycle of these refreshers, and IMO the "someone" should be the reconcilers who manage the pool and pod lifecycles. Otherwise this PodMetrics becomes another "reconciler" to some extent, it needs to monitor the pool updates and update itself accordingly.

Note the PodMetrics doesn't depends on the Datastore currently. The Datastore interface in this package is only used by the logger to print debug logs. Also note that just letting the PodMetrics have access to GetPool won't fix the stale problem, it needs to monitor pool updates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is right now, this PR introduces a regression, if the port changes on the pool, it will not be reflected here, how do you suggest we address this?

Also note that just letting the PodMetrics have access to GetPool won't fix the stale problem, it needs to monitor pool updates.

It does fix the problem: when the pool gets updated, the port stored in the pool object will. The existing code gets the pool on every scrape, and so it always work with the most updated version of the pool. In the new code, we will need to pass the Datastore interface to NewPodMetrics func, and in refreshMetrics we get the pool at the beginning to obtain the port.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the port changes on the pool, it will not be reflected here, how do you suggest we address this?

Ah I see what you mean. I wasn't aware of the fact that we currently don't resync if the port changes in the pool so this would lead to a regression. Sorry for that confusion.

One option is what you suggested. This will lead to frequent mutex access to the pool, but probably OK as writes to the pool should be very infrequent. The other option is to expose an interface on the PodMetrics to update the port, like we do with the UpdatePod interface today, but this requires careful control logic on the pool reconciler. And if there is a new property in the future we should watch for, we need to remember to add that as well, unless we unconditionally resync all pods upon any change in the pool.

Upon weighing the tradeoffs, I think your suggestion makes sense.

PodGetAll() []backendmetrics.PodMetrics
// PodList lists pods matching the given predicate.
PodList(func(backendmetrics.PodMetrics) bool) []backendmetrics.PodMetrics
PodUpdateOrAddIfNotExist(pod *corev1.Pod, pool *v1alpha2.InferencePool) bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing in the pool as a parameter here and in PodResyncAll is odd, the store is defined on a specific pool, and so it must enforce that the pods and models in the pool are the ones that belong to the pool that the datastore is defined on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree it seems a bit odd to the first glance, but this forces the caller to make sure the pool is valid. Otherwise in PodUpdateOrAddIfNotExist, and PodResyncAll, we need to check whether the pool exists. Currently the code just assumes the pool exists and ignores the error, which I think is a dangerous assumption (we should never ignore error).

But also open to not passing the pool and just add validation logic in the PodXX functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to adding those error checks and have those functions return an error, I agree this is better; passing the pool makes it look like any other pool can be used to refresh the pods, which the interface better to be designed in a way to not allow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon a second thought, I actually think passing the pool in makes a lot of sense, it reads like "add these pods for this pool". So this pushes the responsibility to the caller to make sure the pods and pool match. Otherwise the datastore needs to validate that (basically duplicating some logic in the pool reconciler). The datastore is already pretty busy, I would prefer the datastore to be as thin as possible and not deal with too much business logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we ever want to add pods to the store that don't belong to the stored pool? if the answer is no, then passing the pool as a parameter is redundant and complicates the interface.

In any case, I am not going to push back more here, but my preference is to not change the existing interface design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we ever want to add pods to the store that don't belong to the stored pool?

No. But IMO the pod and pool reconcilers are already doing this validation. Seems unnecessary to validate this in the datastore. Note such validation doesn't exist today in datastore and we would need to add the logic and tests.

@liu-cong liu-cong force-pushed the per-pod-refresh branch 2 times, most recently from d2591d2 to a67a8d6 Compare March 10, 2025 04:00
pod unsafe.Pointer // stores a *Pod
metrics unsafe.Pointer // stores a *Metrics
pmc PodMetricsClient
targetPort int32 // metrics endpoint port in the pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is right now, this PR introduces a regression, if the port changes on the pool, it will not be reflected here, how do you suggest we address this?

Also note that just letting the PodMetrics have access to GetPool won't fix the stale problem, it needs to monitor pool updates.

It does fix the problem: when the pool gets updated, the port stored in the pool object will. The existing code gets the pool on every scrape, and so it always work with the most updated version of the pool. In the new code, we will need to pass the Datastore interface to NewPodMetrics func, and in refreshMetrics we get the pool at the beginning to obtain the port.

PodGetAll() []backendmetrics.PodMetrics
// PodList lists pods matching the given predicate.
PodList(func(backendmetrics.PodMetrics) bool) []backendmetrics.PodMetrics
PodUpdateOrAddIfNotExist(pod *corev1.Pod, pool *v1alpha2.InferencePool) bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we ever want to add pods to the store that don't belong to the stored pool? if the answer is no, then passing the pool as a parameter is redundant and complicates the interface.

In any case, I am not going to push back more here, but my preference is to not change the existing interface design.

@ahg-g
Copy link
Contributor

ahg-g commented Mar 10, 2025

Thanks Cong!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2025
@k8s-ci-robot k8s-ci-robot merged commit a70d66e into kubernetes-sigs:main Mar 10, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The metrics refresh time might be much larger than the refreshMetricsInterval
3 participants