New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

removing endpointsliceReconcilier and replacing with pod #268

Closed

kfswain wants to merge 5 commits into kubernetes-sigs:main from kfswain:pod-recon

Collaborator

kfswain commented Jan 31, 2025

No description provided.

Contributor

k8s-ci-robot commented Jan 31, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from Jeffwan and liu-cong

January 31, 2025 21:25

k8s-ci-robot added cncf-cla: yes approved needs-rebase labels

netlify bot commented Jan 31, 2025 •

edited

Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`2379e93`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/679d428dd8cfff0008fef01d
😎 Deploy Preview	https://deploy-preview-268--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

k8s-ci-robot added the size/XL label

kfswain added 4 commits

January 31, 2025 21:26


          reversion to pod reconciliation

ecde9cb


          adding ready check and unit tests

94a475c


          updating test

dfb4620


          ablating unnecessary func

276b700

kfswain force-pushed the pod-recon branch from dca7137 to 276b700 Compare

January 31, 2025 21:26

k8s-ci-robot removed the needs-rebase label


          embedding ready status into update so non-ready pods are deleted

2379e93

kfswain changed the title ~~Pod recon~~ removing endpointsliceReconcilier and replacing with pod

ahg-g reviewed

View reviewed changes

pkg/ext-proc/backend/pod_reconciler.go

+              	Record    record.EventRecorder
+              }
+              func (c *PodReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {

Contributor

ahg-g Jan 31, 2025

I think a reconciler on pods is not necessary, the pods are already cached in the informer cache, this reconciler seems to just replicate what the informer cache does.

Calling a list using the provided selector should achieve the same effect as getting the list of pods from the datastore.

Contributor

ahg-g Feb 3, 2025

Another issue we will face is missed pod events (e.g., due to delayed InferencePool creation, selector updates, or ext-proc restarts), which will result in stale datastore.

liu-cong reviewed

View reviewed changes

pkg/ext-proc/backend/pod_reconciler.go

+              		Name:    k8sPod.Name,
+              		Address: k8sPod.Status.PodIP + ":" + strconv.Itoa(int(inferencePool.Spec.TargetPortNumber)),
+              	}
+              	if !c.Datastore.LabelsMatch(k8sPod.ObjectMeta.Labels) || !podIsReady(k8sPod) {

Contributor

liu-cong Feb 3, 2025

I think the we should only consider pods in the same namespace as the pool. I belie this is the default k8s service behavior.

Collaborator Author

kfswain Feb 3, 2025

Fair point. The previous endpointSliceReconciler just utilized the service that was given, and so did not need to scope to namespace, can update this should we choose to go down this path.

pkg/ext-proc/backend/pod_reconciler.go

+              		Address: k8sPod.Status.PodIP + ":" + strconv.Itoa(int(inferencePool.Spec.TargetPortNumber)),
+              	}
+              	if !c.Datastore.LabelsMatch(k8sPod.ObjectMeta.Labels) || !podIsReady(k8sPod) {
+              		c.Datastore.pods.Delete(pod)

Contributor

liu-cong Feb 3, 2025

nit: I believe deleting a non-existing pod is just a noop, so it's safe to just delete here. Consider adding a comment here

pkg/ext-proc/backend/pod_reconciler.go

+              			if condition.Status == corev1.ConditionTrue {
+              				return true
+              			}
+              			break

Contributor

liu-cong Feb 3, 2025

Suggested change

break

pkg/ext-proc/backend/pod_reconciler.go

+              	if !c.Datastore.LabelsMatch(k8sPod.ObjectMeta.Labels) || !podIsReady(k8sPod) {
+              		c.Datastore.pods.Delete(pod)
+              	} else {
+              		c.Datastore.pods.Store(pod, true)

Contributor

liu-cong Feb 3, 2025

nit: Add some logs here

ahg-g mentioned this pull request

Replace EndpointSlice reconciler with pod list backed by informer #271

Merged

Contributor

ahg-g commented Feb 3, 2025

#271 supersedes this PR

Contributor

ahg-g commented Feb 3, 2025

/hold

k8s-ci-robot added the do-not-merge/hold label

Contributor

danehans commented Feb 3, 2025

@kfswain can you confirm that this PR fixes #256?

Collaborator Author

kfswain commented Feb 3, 2025

@kfswain can you confirm that this PR fixes #256?

Yeah it would. There is an open question on scalability, however.

Contributor

ahg-g commented Feb 3, 2025 •

edited

Loading

There are a couple of issues with this approach, scalability and data staleness.

Regarding scalability, this approach lists all pods in the cluster and does the filtering client-side. This we can address as a follow up, and so it is not my biggest concern.

My main concern is data staleness, this approach will cause missed pod events and so will result in stale cache data. Here are some examples:

The model server pods are created and ready, but the inferencePool object isn’t created yet;
The selector on the InferencePool object is updated;
Races between reconciling pods and the inference pool object on ext-proc pod restarts.

In such cases, our cache will not be fresh until the pods get an update event to trigger another reconciliation on the pods, which in most cases will not happen soon enough. There is no escape from listing to address such issues. For example, listing on InferencePool reconciliation events.

As suggested in the original issue #256, we need to list. Since time is of essence for this release, I worked in parallel on the solution, where we rely on the underly informer cache and the provider lists from it directly: #271

@kfswain are you aligned with the above?

Collaborator Author

kfswain commented Feb 4, 2025 •

edited

Loading

There are a couple of issues with this approach, scalability and data staleness.

Regarding scalability, this approach lists all pods in the cluster and does the filtering client-side. This we can address as a follow up, and so it is not my biggest concern.

My main concern is data staleness, this approach will cause missed pod events and so will result in stale cache data. Here are some examples:

The model server pods are created and ready, but the inferencePool object isn’t created yet;

The selector on the InferencePool object is updated;

Races between reconciling pods and the inference pool object on ext-proc pod restarts.

In such cases, our cache will not be fresh until the pods get an update event to trigger another reconciliation on the pods, which in most cases will not happen soon enough. There is no escape from listing to address such issues. For example, listing on InferencePool reconciliation events.

As suggested in the original issue #256, we need to list. Since time is of essence for this release, I worked in parallel on the solution, where we rely on the underly informer cache and the provider lists from it directly: #271

@kfswain are you aligned with the above?

I don't think so.

Point's 1 & 3 are essentially the same event. And both require the InferencePool data, as it has the selector. So we would just requeue.

Point 2 is a cache dumping event, for sure. But we won't know that happens till we reconcile on the inferencePool anyway. So we could just dump the cache in that same reconciliation loop, and list the pods using the new selector as a filter (so we aren't fetching all pods).

Sure we still catch all events for all pods in the cluster but that could be YAGNI'd until we hit that wall.

I won't die on this hill, but I do wonder why we didnt stick with: 790d3a2 which would have done all of this (the label selector we use to queue pods is pulled from the freshest InferencePool data). With the added benefit that this would have all been done in October.

Will leave comments in the other PR.

kfswain closed this

Contributor

ahg-g commented Feb 4, 2025 •

edited

Loading

Point's 1 & 3 are essentially the same event. And both require the InferencePool data, as it has the selector. So we would just requeue.

Point 2 is a cache dumping event, for sure. But we won't know that happens till we reconcile on the inferencePool anyway. So we could just dump the cache in that same reconciliation loop, and list the pods using the new selector as a filter (so we aren't fetching all pods).

I agree on points 1&3, they are addressed with requeuing (missed it in the code), but that is also expensive since it constantly requeue every pod in the cluster until the inferencePool is synced, for 2 we still need to list the pods on InferencePool reconcile events. But in any case, all three cases can be handled with listing the pods on InferencePool reconcile events (i.e., we don't need to requeue if we have to list anyway on inferencePool reconciliation), and I mentioned that in the above comment: "There is no escape from listing to address such issues. For example, listing on InferencePool reconciliation events."

I understand that we will list only relevant pods, but as I clarify below, my push back is about having to do all this and worry if we are handling all the cases needed to keep fresh an unnecessary intermediate cache.

I won't die on this hill, but I do wonder why we didnt stick with: 790d3a2 which would have done all of this (the label selector we use to queue pods is pulled from the freshest InferencePool data). With the added benefit that this would have all been done in October.

The change in direction was motived by the ease of use of controller-runtime over client-go for building reconcilers by abstracting away the complexities of managing worker queues and event registration (we had discussions on controller-runtime itself later, but that is beside the point). #271 doesn't reverse that, it argues for getting rid of the pod reconciler which 790d3a2 also employs; the idea is to avoid worrying about handling pod events in favor of directly listing from the informer cache whenever we needed to do probe metrics, it gets rid of the pod cache in the datastore and hopefully will allow us to remove the other cache we currently have in the provider.

Basically my argument is that this PR and 790d3a2 builds an intermediate cache that we need to worry about keeping fresh by handling pod and inferencePool events, while #271 completely avoids that and so should be more resilient.

My previous experience with kube-scheduler and a number of workload controllers tells me we better avoid reconciling on pods if not needed, for both scale and resiliency.

With the added benefit that this would have all been done in October.

790d3a2 has the same issues as this PR.

Collaborator Author

kfswain commented Feb 4, 2025

that is also expensive since it constantly requeue every pod in the cluster until the inferencePool is synced,

the EPP is in a holding pattern until the inferencePool is synced anyway though.

There is no escape from listing to address such issues. For example, listing on InferencePool reconciliation events

The only time we can't escape the list is on a selector change, which I do not expect to be that often, and that should happen on inferencePool reconciliation, the event in which it occurs. What

#271 doesn't reverse that, it argues for getting rid of the pod reconciler

It employs the same informer logic. Now we just have no separation of concerns and a substantial drop in readability. And we are using informers in addition to controller runtime, so it's a strange hybrid of clients, that, again drop our code health.

we need to worry about keeping fresh by handling pod and inferencePool events

You still have to do this, and do this in #271..

But yes, there was an intermediary cache in those PRs, but that was a temporary solve because the provider.go file needed to be refactored anyway. I was making the smallest possible change without disturbing others work, and then we could have a PR to tie them all together (which is why I use our locally defined Pod struct as the key). I thought that was standard practice.

The reconciler manages the keys in the map, the metrics would populate the values for each key it finds (you do the same thing. They are on different control loops and manage different things.

We have to have some sort of datastore(cache) because we need to keep the metrics, we cant scrape metrics for every request. No different than what is happening here: https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/271/files#diff-8776d043539aff8c6475caca3d5a29688d12345b9527a795bac1780b4e5dd158R54

Contributor

ahg-g commented Feb 4, 2025 •

edited

Loading

I want to take a step back and summarize the design pattern I had in mind, I don't think I concretely stated it: I was thinking of the informer as a cache for the pods we care about, it is always recent, listing the pods on it is efficient and it is a drop in replacement for the cache we had in the datastore. Also, since there is a metrics refresh loop, I thought that loop will be enough to refresh the metrics cache with the latest state of the pods (ready, removed, added etc.), and so we wouldn't need to setup a pod reconciler at all or think about on what events we need to update the cache.

I agree we can achieve the same result with a reconciliation on pods + a full refresh of the metrics cache on inferencePool reconciliation. I wasn't sure initially if that will cover all data staleness issues, but after this discussion, I believe it does. So this PR only misses the refresh of the cache on inference pool events, and I think that will also allow removing the requeueing. Removing the requeueing makes it possible to support multiple inferencePools in the future if we chose to.

I also agree that mixing client-go/informer and controller-runtime is not a good idea, I actually thought that controller-runtime uses informer underneath, but it turns out that it doesn't. controller-runtime has a similar cache library that can be used instead though and conveniently called "Cache".

I apologize for the back and forth, this has been a great learning experience!

Contributor

ahg-g commented Feb 4, 2025 •

edited

Loading

@kfswain and I discussed offline, and proposed to re-open this and revert the other one since this one achieves the same behavior if we list and refresh the cache on inferencePool events. We can do that after the release.

One thing to be be cognizant of for the future is that it will be tricky to setup a server-side selector because we need to have the pool to know the selector at the time of initializing the controller.

/reopen

k8s-ci-robot reopened this

Contributor

k8s-ci-robot commented Feb 4, 2025

@ahg-g: Reopened this PR.

In response to this:

@kfswain and I discussed offline, and proposed to re-open this and revert the other one since this one achieves the same behavior if we list and refresh the cache on inferencePool events.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Contributor

k8s-ci-robot commented Feb 4, 2025

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the needs-rebase label

Contributor

k8s-ci-robot commented Feb 4, 2025

@kfswain: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-gateway-api-inference-extension-verify-main	`2379e93`	link	true	`/test pull-gateway-api-inference-extension-verify-main`
pull-gateway-api-inference-extension-test-unit-main	`2379e93`	link	true	`/test pull-gateway-api-inference-extension-test-unit-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

kfswain closed this

kfswain deleted the pod-recon branch

February 10, 2025 22:57

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved cncf-cla: yes do-not-merge/hold needs-rebase size/XL