Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removing endpointsliceReconcilier and replacing with pod #268

Closed
wants to merge 5 commits into from

Conversation

kfswain
Copy link
Collaborator

@kfswain kfswain commented Jan 31, 2025

No description provided.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 31, 2025
Copy link

netlify bot commented Jan 31, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 2379e93
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/679d428dd8cfff0008fef01d
😎 Deploy Preview https://deploy-preview-268--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 31, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 31, 2025
@kfswain kfswain changed the title Pod recon removing endpointsliceReconcilier and replacing with pod Jan 31, 2025
Record record.EventRecorder
}

func (c *PodReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a reconciler on pods is not necessary, the pods are already cached in the informer cache, this reconciler seems to just replicate what the informer cache does.

Calling a list using the provided selector should achieve the same effect as getting the list of pods from the datastore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue we will face is missed pod events (e.g., due to delayed InferencePool creation, selector updates, or ext-proc restarts), which will result in stale datastore.

Name: k8sPod.Name,
Address: k8sPod.Status.PodIP + ":" + strconv.Itoa(int(inferencePool.Spec.TargetPortNumber)),
}
if !c.Datastore.LabelsMatch(k8sPod.ObjectMeta.Labels) || !podIsReady(k8sPod) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the we should only consider pods in the same namespace as the pool. I belie this is the default k8s service behavior.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. The previous endpointSliceReconciler just utilized the service that was given, and so did not need to scope to namespace, can update this should we choose to go down this path.

Address: k8sPod.Status.PodIP + ":" + strconv.Itoa(int(inferencePool.Spec.TargetPortNumber)),
}
if !c.Datastore.LabelsMatch(k8sPod.ObjectMeta.Labels) || !podIsReady(k8sPod) {
c.Datastore.pods.Delete(pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I believe deleting a non-existing pod is just a noop, so it's safe to just delete here. Consider adding a comment here

if condition.Status == corev1.ConditionTrue {
return true
}
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
break

if !c.Datastore.LabelsMatch(k8sPod.ObjectMeta.Labels) || !podIsReady(k8sPod) {
c.Datastore.pods.Delete(pod)
} else {
c.Datastore.pods.Store(pod, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add some logs here

@ahg-g
Copy link
Contributor

ahg-g commented Feb 3, 2025

#271 supersedes this PR

@ahg-g
Copy link
Contributor

ahg-g commented Feb 3, 2025

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 3, 2025
@danehans
Copy link
Contributor

danehans commented Feb 3, 2025

@kfswain can you confirm that this PR fixes #256?

@kfswain
Copy link
Collaborator Author

kfswain commented Feb 3, 2025

@kfswain can you confirm that this PR fixes #256?

Yeah it would. There is an open question on scalability, however.

@ahg-g
Copy link
Contributor

ahg-g commented Feb 3, 2025

There are a couple of issues with this approach, scalability and data staleness.

Regarding scalability, this approach lists all pods in the cluster and does the filtering client-side. This we can address as a follow up, and so it is not my biggest concern.

My main concern is data staleness, this approach will cause missed pod events and so will result in stale cache data. Here are some examples:

  1. The model server pods are created and ready, but the inferencePool object isn’t created yet;
  2. The selector on the InferencePool object is updated;
  3. Races between reconciling pods and the inference pool object on ext-proc pod restarts.

In such cases, our cache will not be fresh until the pods get an update event to trigger another reconciliation on the pods, which in most cases will not happen soon enough. There is no escape from listing to address such issues. For example, listing on InferencePool reconciliation events.

As suggested in the original issue #256, we need to list. Since time is of essence for this release, I worked in parallel on the solution, where we rely on the underly informer cache and the provider lists from it directly: #271

@kfswain are you aligned with the above?

@kfswain
Copy link
Collaborator Author

kfswain commented Feb 4, 2025

There are a couple of issues with this approach, scalability and data staleness.

Regarding scalability, this approach lists all pods in the cluster and does the filtering client-side. This we can address as a follow up, and so it is not my biggest concern.

My main concern is data staleness, this approach will cause missed pod events and so will result in stale cache data. Here are some examples:

  1. The model server pods are created and ready, but the inferencePool object isn’t created yet;
  2. The selector on the InferencePool object is updated;
  3. Races between reconciling pods and the inference pool object on ext-proc pod restarts.

In such cases, our cache will not be fresh until the pods get an update event to trigger another reconciliation on the pods, which in most cases will not happen soon enough. There is no escape from listing to address such issues. For example, listing on InferencePool reconciliation events.

As suggested in the original issue #256, we need to list. Since time is of essence for this release, I worked in parallel on the solution, where we rely on the underly informer cache and the provider lists from it directly: #271

@kfswain are you aligned with the above?

I don't think so.

Point's 1 & 3 are essentially the same event. And both require the InferencePool data, as it has the selector. So we would just requeue.

Point 2 is a cache dumping event, for sure. But we won't know that happens till we reconcile on the inferencePool anyway. So we could just dump the cache in that same reconciliation loop, and list the pods using the new selector as a filter (so we aren't fetching all pods).

Sure we still catch all events for all pods in the cluster but that could be YAGNI'd until we hit that wall.

I won't die on this hill, but I do wonder why we didnt stick with: 790d3a2 which would have done all of this (the label selector we use to queue pods is pulled from the freshest InferencePool data). With the added benefit that this would have all been done in October.

Will leave comments in the other PR.

@kfswain kfswain closed this Feb 4, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Feb 4, 2025

Point's 1 & 3 are essentially the same event. And both require the InferencePool data, as it has the selector. So we would just requeue.

Point 2 is a cache dumping event, for sure. But we won't know that happens till we reconcile on the inferencePool anyway. So we could just dump the cache in that same reconciliation loop, and list the pods using the new selector as a filter (so we aren't fetching all pods).

I agree on points 1&3, they are addressed with requeuing (missed it in the code), but that is also expensive since it constantly requeue every pod in the cluster until the inferencePool is synced, for 2 we still need to list the pods on InferencePool reconcile events. But in any case, all three cases can be handled with listing the pods on InferencePool reconcile events (i.e., we don't need to requeue if we have to list anyway on inferencePool reconciliation), and I mentioned that in the above comment: "There is no escape from listing to address such issues. For example, listing on InferencePool reconciliation events."

I understand that we will list only relevant pods, but as I clarify below, my push back is about having to do all this and worry if we are handling all the cases needed to keep fresh an unnecessary intermediate cache.

I won't die on this hill, but I do wonder why we didnt stick with: 790d3a2 which would have done all of this (the label selector we use to queue pods is pulled from the freshest InferencePool data). With the added benefit that this would have all been done in October.

The change in direction was motived by the ease of use of controller-runtime over client-go for building reconcilers by abstracting away the complexities of managing worker queues and event registration (we had discussions on controller-runtime itself later, but that is beside the point). #271 doesn't reverse that, it argues for getting rid of the pod reconciler which 790d3a2 also employs; the idea is to avoid worrying about handling pod events in favor of directly listing from the informer cache whenever we needed to do probe metrics, it gets rid of the pod cache in the datastore and hopefully will allow us to remove the other cache we currently have in the provider.

Basically my argument is that this PR and 790d3a2 builds an intermediate cache that we need to worry about keeping fresh by handling pod and inferencePool events, while #271 completely avoids that and so should be more resilient.

My previous experience with kube-scheduler and a number of workload controllers tells me we better avoid reconciling on pods if not needed, for both scale and resiliency.

With the added benefit that this would have all been done in October.

790d3a2 has the same issues as this PR.

@kfswain
Copy link
Collaborator Author

kfswain commented Feb 4, 2025

that is also expensive since it constantly requeue every pod in the cluster until the inferencePool is synced,

the EPP is in a holding pattern until the inferencePool is synced anyway though.

There is no escape from listing to address such issues. For example, listing on InferencePool reconciliation events

The only time we can't escape the list is on a selector change, which I do not expect to be that often, and that should happen on inferencePool reconciliation, the event in which it occurs. What

#271 doesn't reverse that, it argues for getting rid of the pod reconciler

It employs the same informer logic. Now we just have no separation of concerns and a substantial drop in readability. And we are using informers in addition to controller runtime, so it's a strange hybrid of clients, that, again drop our code health.

we need to worry about keeping fresh by handling pod and inferencePool events

You still have to do this, and do this in #271..

But yes, there was an intermediary cache in those PRs, but that was a temporary solve because the provider.go file needed to be refactored anyway. I was making the smallest possible change without disturbing others work, and then we could have a PR to tie them all together (which is why I use our locally defined Pod struct as the key). I thought that was standard practice.

The reconciler manages the keys in the map, the metrics would populate the values for each key it finds (you do the same thing. They are on different control loops and manage different things.

We have to have some sort of datastore(cache) because we need to keep the metrics, we cant scrape metrics for every request. No different than what is happening here: https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/271/files#diff-8776d043539aff8c6475caca3d5a29688d12345b9527a795bac1780b4e5dd158R54

@ahg-g
Copy link
Contributor

ahg-g commented Feb 4, 2025

I want to take a step back and summarize the design pattern I had in mind, I don't think I concretely stated it: I was thinking of the informer as a cache for the pods we care about, it is always recent, listing the pods on it is efficient and it is a drop in replacement for the cache we had in the datastore. Also, since there is a metrics refresh loop, I thought that loop will be enough to refresh the metrics cache with the latest state of the pods (ready, removed, added etc.), and so we wouldn't need to setup a pod reconciler at all or think about on what events we need to update the cache.

I agree we can achieve the same result with a reconciliation on pods + a full refresh of the metrics cache on inferencePool reconciliation. I wasn't sure initially if that will cover all data staleness issues, but after this discussion, I believe it does. So this PR only misses the refresh of the cache on inference pool events, and I think that will also allow removing the requeueing. Removing the requeueing makes it possible to support multiple inferencePools in the future if we chose to.

I also agree that mixing client-go/informer and controller-runtime is not a good idea, I actually thought that controller-runtime uses informer underneath, but it turns out that it doesn't. controller-runtime has a similar cache library that can be used instead though and conveniently called "Cache".

I apologize for the back and forth, this has been a great learning experience!

@ahg-g
Copy link
Contributor

ahg-g commented Feb 4, 2025

@kfswain and I discussed offline, and proposed to re-open this and revert the other one since this one achieves the same behavior if we list and refresh the cache on inferencePool events. We can do that after the release.

One thing to be be cognizant of for the future is that it will be tricky to setup a server-side selector because we need to have the pool to know the selector at the time of initializing the controller.

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Feb 4, 2025
@k8s-ci-robot
Copy link
Contributor

@ahg-g: Reopened this PR.

In response to this:

@kfswain and I discussed offline, and proposed to re-open this and revert the other one since this one achieves the same behavior if we list and refresh the cache on inferencePool events.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2025
@k8s-ci-robot
Copy link
Contributor

@kfswain: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gateway-api-inference-extension-verify-main 2379e93 link true /test pull-gateway-api-inference-extension-verify-main
pull-gateway-api-inference-extension-test-unit-main 2379e93 link true /test pull-gateway-api-inference-extension-test-unit-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kfswain kfswain closed this Feb 6, 2025
@kfswain kfswain deleted the pod-recon branch February 10, 2025 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants