Skip to content

KEP-5325: HPA - Improve pod selection accuracy across workload types #5331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

omerap12
Copy link
Member

  • One-line PR description: HPA - Improve pod selection accuracy across workload types

/sig autoscaling

Signed-off-by: Omer Aplatony <[email protected]>
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 22, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 22, 2025
@omerap12
Copy link
Member Author

This is a work in progress - looking for feedback and opinions.

@omerap12 omerap12 changed the title 5325-hpa-pod-selection-accuracy KEP-5325: HPA pod selection accuracy May 22, 2025
Copy link
Member

@adrianmoisey adrianmoisey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some low hanging changes

omerap12 and others added 8 commits May 23, 2025 12:03
Co-authored-by: Adrian Moisey <[email protected]>
Co-authored-by: Adrian Moisey <[email protected]>
Co-authored-by: Adrian Moisey <[email protected]>
Co-authored-by: Adrian Moisey <[email protected]>
Co-authored-by: Adrian Moisey <[email protected]>
Co-authored-by: Adrian Moisey <[email protected]>
Co-authored-by: Adrian Moisey <[email protected]>
Signed-off-by: Omer Aplatony <[email protected]>
@omerap12 omerap12 changed the title KEP-5325: HPA pod selection accuracy KEP-5325: HPA - Improve pod selection accuracy across workload types May 23, 2025
Signed-off-by: Omer Aplatony <[email protected]>
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I focused mostly on the design details, including the API surface. But once this gets opt-ed into the release I'll look carefully for the remaining bits, including the PRR portion of the doc.

@@ -0,0 +1,879 @@
<!--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you're missing keps/prod-readiness/sig-autoscaling/5325.yaml based of https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/template/nnnn.yaml, feel free to add my name as the PRR approver for alpha.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, make sure to get a lead opt-in #5325 into 1.34 release.


## Summary

The Horizontal Pod Autoscaler (HPA) has a critical limitation in its pod selection mechanism: it collects metrics from all pods that match the target workload's label selector, regardless of whether those pods are actually managed by the target workload. This can lead to incorrect scaling decisions when unrelated pods (such as Jobs, CronJobs, or other Deployments) happen to share the same labels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please make sure to break lines, since this makes the reviews easier to read. This applies to entire document.

We propose adding a new field to the HPA specification called `strictPodSelection` that allows users to specify how pods should be selected for metric collection:
* If set to true only pods that are actually owned by the target workload (through owner references) are being selected.
* If not set or set to false - default behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md, specifically this part:

Think twice about bool fields. Many ideas start as boolean but eventually trend towards 
a small set of mutually exclusive options. Plan for future expansions by describing the 
policy options explicitly as a string type alias (e.g. TerminationMessagePolicy).

When we were introducing new mechanism to PDBs we went with UnhealthyPodEvictionPolicy. You'll notice that pattern used also heavily across apps/v1 types.

In your case that'd mean introducing something like PodSelectionType with two initial values Selector or Labels being the default, and backward compatible, and OwnerRefs. This will ensure you can easily extend the mechanism in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, thank you for the reference! :)

- a search in the Kubernetes bug triage tool (https://storage.googleapis.com/k8s-triage/index.html)
-->

- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to list integration tests you're planning to add.

omerap12 added 3 commits May 29, 2025 06:25
Signed-off-by: Omer Aplatony <[email protected]>
Signed-off-by: Omer Aplatony <[email protected]>
Signed-off-by: Omer Aplatony <[email protected]>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: omerap12
Once this PR has been reviewed and has the lgtm label, please assign deads2k, towca for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 29, 2025
## Proposal
We propose adding a new field to the HPA specification called `podSelectionStrategy` that allows users to specify how pods should be selected for metric collection:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would selectionStrategy be a shorter more concise phrase for what we're trying to achieve?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

@omerap12 omerap12 marked this pull request as ready for review June 1, 2025 12:40
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 1, 2025
Signed-off-by: Omer Aplatony <[email protected]>
@omerap12
Copy link
Member Author

omerap12 commented Jun 1, 2025

WIP here: kubernetes/kubernetes#132018

Comment on lines +44 to +45
metrics:
- my_feature_metric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metrics:
- my_feature_metric
metrics: []

How necessary is it to make a metric? Taking a look at other similar features, I found an example that added a metric that would be the count of the number of times that the new API field is used. Is that necessary?

(I guess I'm asking @soltysh that question)

Co-authored-by: Adrian Moisey <[email protected]>
We propose adding a new field to the HPA specification called `selectionStrategy` that allows users to specify how pods should be selected for metric collection:
* If set to `LabelSelector` (default): Uses the current behavior of selecting all pods that match the target workload's label selector.
* If set to `OwnerReferences`: Only selects pods that are owned by the target workload through owner references.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit - perhaps OwnerReference is better here since HPA would only use one reference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitty counter point: Sometimes there are multiple hops to get from the Pod to the targetRef. So even though Pod -> ReplicaSet and ReplicaSet -> Deployment is 1 OwnerRef each, in total it's 2 OwnerRefs.

I'm personally not strongly opinionated on this and can go either direction.

* `OwnerReferences`:
* Further filters the label-selected pods
* Only keeps pods that are owned by the target workload through owner references
* Follows the ownership chain (e.g., Deployment → ReplicaSet → Pods)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How high will it go? Will it stop when the owner matches the targetRef in the HPA?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct direction is down (since we start from the targetRef and go down to the pods it owns), so we just get the pods by looking at their ownerReference.

`OwnerReferencesFilter`:
* Validates pod ownership through reference chain
* Only includes pods that are owned by the target workload
* Handles different workload types (Deployments, StatefulSets, etc.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it goes up the chain, the HPA Controller will need to fetch controllers, will we cache these too?

Otherwise you may be fetching the same controller objects every 15s for every pod :(

(In VPA we have a whole bunch of heuristics here too to reduce load on API Server: https://github.com/kubernetes/autoscaler/blob/9220470d9f5154eefc82f53f39de4efb76befe77/vertical-pod-autoscaler/pkg/target/controller_fetcher/controller_fetcher.go#L175)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I forgot about this. You're right - I originally implemented the fetcher to work from the top down, starting at the Deployment and traversing down to the Pod. I agree we should reverse this approach: start from the Pod and cache intermediate controllers along the way to avoid repeatedly fetching the same objects every 15s.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified as follows: kubernetes/kubernetes@387fbbb

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the cache felt a bit excessive. It does reduce some API calls, but is it really worth it?

@raywainman
Copy link

One additional comment here - do you know if there is other prior work anywhere else in Kubernetes for this problem?

If I understand correctly, this label overlapping use-case is actually quite prevalent everywhere throughout the ecosystem - Service is one that I can think about right off the top of my head.

If there is, it would be great to tackle that in a consistent way.

@adrianmoisey
Copy link
Member

One additional comment here - do you know if there is other prior work anywhere else in Kubernetes for this problem?

If I understand correctly, this label overlapping use-case is actually quite prevalent everywhere throughout the ecosystem - Service is one that I can think about right off the top of my head.

If there is, it would be great to tackle that in a consistent way.

I don't believe this is a general problem.

The thing with the HPA is that it has a targetRef, and not a selector.

$ kubectl explain Horizontalpodautoscaler.spec.scaleTargetRef
GROUP:      autoscaling
KIND:       HorizontalPodAutoscaler
VERSION:    v2

FIELD: scaleTargetRef <CrossVersionObjectReference>


DESCRIPTION:
    scaleTargetRef points to the target resource to scale, and is used to the
    pods for which metrics should be collected, as well as to actually change
    the replica count.
    CrossVersionObjectReference contains enough information to let you identify
    the referred resource.

FIELDS:
  apiVersion	<string>
    apiVersion is the API version of the referent

  kind	<string> -required-
    kind is the kind of the referent; More info:
    https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds

  name	<string> -required-
    name is the name of the referent; More info:
    https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names

As a user I read this as "The HPA will use the target's Pods and Metrics to scale my workload". But it's actually saying "The HPA will scale this workload, and it will take the label selector from this workload to find Pods/Metrics".

A Service is different, it just takes in a selector:

$ kubectl explain Service.spec.selector
KIND:       Service
VERSION:    v1

FIELD: selector <map[string]string>


DESCRIPTION:
    Route service traffic to pods with label keys and values matching this
    selector. If empty or not present, the service is assumed to have an
    external process managing its endpoints, which Kubernetes will not modify.
    Only applies to types ClusterIP, NodePort, and LoadBalancer. Ignored if type
    is ExternalName. More info:
    https://kubernetes.io/docs/concepts/services-networking/service/

So as a user I'm more likely to worry about overlapping workloads with the Service than I am with an HPA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants