fix: adds ErrorNotFound Handling for InferenceModel Reconciler #286

danehans · 2025-02-04T21:01:45Z

Updates InferenceModel reconciler to not return an error if the InferenceModel is not found.
Updates InferenceModel reconciler to remove the InferenceModel from the datatstore if it's not found.
Adds unit tests

Fixes #280

Signed-off-by: Daneyon Hansen <[email protected]>

k8s-ci-robot · 2025-02-04T21:01:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danehans

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [danehans]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-02-04T21:02:00Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`847343e`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67a2803b252a9100085d5458
😎 Deploy Preview	https://deploy-preview-286--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ahg-g

a couple of nits to use debugging levels.

ahg-g · 2025-02-07T05:34:52Z

pkg/ext-proc/backend/inferencemodel_reconciler.go

-	service := &v1alpha1.InferenceModel{}
-	if err := c.Get(ctx, req.NamespacedName, service); err != nil {
-		klog.Error(err, "unable to get InferencePool")
+	klog.V(1).Infof("Reconciling InferenceModel %v", req.NamespacedName)


use logging levels pls, DEFAULT in this case

++, setting this standard makes it easier for contributors to follow, and makes it more easy to reason about what the authors intent was wrt log level.

Edit: I see some areas in this file using V(1) already. We can kick this to a follow up PR. made: #307

sg, lets merge this and follow up with a PR for the logging levels.

ahg-g · 2025-02-07T05:35:45Z

pkg/ext-proc/backend/inferencemodel_reconciler.go

+	infModel := &v1alpha1.InferenceModel{}
+	if err := c.Get(ctx, req.NamespacedName, infModel); err != nil {
+		if errors.IsNotFound(err) {
+			klog.V(1).Infof("InferenceModel %v not found. Removing from datastore since object must be deleted", req.NamespacedName)


ditto to using debugging levels

kfswain

Mostly LGTM, have a comment wrt approach. Would like to close the loop there before we move forward. Thanks!!

kfswain · 2025-02-10T15:59:13Z

pkg/ext-proc/backend/inferencemodel_reconciler.go

-	service := &v1alpha1.InferenceModel{}
-	if err := c.Get(ctx, req.NamespacedName, service); err != nil {
-		klog.Error(err, "unable to get InferencePool")
+	klog.V(1).Infof("Reconciling InferenceModel %v", req.NamespacedName)


++, setting this standard makes it easier for contributors to follow, and makes it more easy to reason about what the authors intent was wrt log level.

Edit: I see some areas in this file using V(1) already. We can kick this to a follow up PR. made: #307

kfswain · 2025-02-10T16:07:25Z

pkg/ext-proc/backend/inferencemodel_reconciler.go

+	if err := c.Get(ctx, req.NamespacedName, infModel); err != nil {
+		if errors.IsNotFound(err) {
+			klog.V(1).Infof("InferenceModel %v not found. Removing from datastore since object must be deleted", req.NamespacedName)
+			c.Datastore.InferenceModels.Delete(infModel.Spec.ModelName)


I'm a little worried about doing this, but maybe it's unfounded.

My concern is that if for some reason there is a small lapse in communication with the API server, we could remove the inference model from the list of available models until the next reconciliation event (i believe the max time is 10 min, assuming there is no event that triggers reconciliation) So in the worst case, we could knock down a users service for 10 min, and then it pops back up unexpectedly, making for a rather nasty heisenbug. Its unfortunate that controller runtime doesnt let you separate by delete/create/update events, this would be much more easily remedied.

WDYT?

Should we instead just not return the err? As I believe that causes controller runtime to requeue the reconciliation obj. But that could lead to the issue of leaving up an IM that doesnt actually exist (granted we currently have that issue)

for the long term we could (not this PR):

We would then have to use: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/builder#TypedBuilder.WithEventFilter i think we could do something messy by making some sort of cleanupReconciler.

We could also just switch to listing the infModels per request and using the informer cache to do so.

But also its handled well in the non-controller runtime paradigm: https://github.com/kubernetes/sample-controller/blob/master/controller.go#L152

We also need to check for deletionTimestamp and remove the InferenceModel if set.

Taking a step back, do we actually need an inferenceModel reconciler? do we need to store the inferenceModel objects ourselves? If not, I think we can drop it and just Get it (controller-runtime has an informer cache underneath).

@kfswain we can setup a handlers for delete/create/update with controller-runtime, see for example kueue: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/core/resourceflavor_controller.go#L145

We also need to check for deletionTimestamp and remove the InferenceModel if set.

Taking a step back, do we actually need an inferenceModel reconciler? do we need to store the inferenceModel objects ourselves? If not, I think we can drop it and just Get it (controller-runtime has an informer cache underneath).

I was thinking about that also, were we to ever set up fairness, we would need to hold on to these objects, or at least their name as the key value for a cache of traffic data. Otherwise we could fetch as needed as long as its using a cache and not blasting the api-server. But is it any better than what we would do here?

yeah, we need that for fairness, we should keep it.

I think we need to setup a Delete handler to reliably delete the object from the store just like we do with Kueue.

Definitely agreed there. Made: #310.

I think for this PR we can reduce the logging noise, and then have a separate PR addressing deletion events.

Is that fair?

ahg-g · 2025-02-10T23:11:28Z

/lgtm

ahg-g · 2025-02-10T23:28:15Z

Sent #317 which should allow the failed test to pass

ahg-g · 2025-02-10T23:38:13Z

/retest

Adds ErrorNotFound Handling for Reconciler

Loading
Loading status checks…

847343e

Signed-off-by: Daneyon Hansen <[email protected]>

k8s-ci-robot requested a review from ahg-g February 4, 2025 21:01

k8s-ci-robot added the cncf-cla: yes label Feb 4, 2025

k8s-ci-robot requested a review from kfswain February 4, 2025 21:01

k8s-ci-robot added the approved label Feb 4, 2025

k8s-ci-robot added the size/L label Feb 4, 2025

danehans changed the title ~~Adds ErrorNotFound Handling for InferenceModel Reconciler~~ fix: adds ErrorNotFound Handling for InferenceModel Reconciler Feb 5, 2025

ahg-g reviewed Feb 7, 2025

View reviewed changes

kfswain reviewed Feb 10, 2025

View reviewed changes

kfswain mentioned this pull request Feb 10, 2025

Properly handle delete events in reconcilers #310

Closed

k8s-ci-robot assigned ahg-g Feb 10, 2025

k8s-ci-robot added the lgtm label Feb 10, 2025

ahg-g mentioned this pull request Feb 10, 2025

Remove gci linter #317

Merged

k8s-ci-robot merged commit 6c22d92 into kubernetes-sigs:main Feb 10, 2025
7 of 8 checks passed

ahg-g mentioned this pull request Feb 11, 2025

Delete InferenceModels from the datastore when deletionTimestamp is set #319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: adds ErrorNotFound Handling for InferenceModel Reconciler #286

fix: adds ErrorNotFound Handling for InferenceModel Reconciler #286

danehans commented Feb 4, 2025

k8s-ci-robot commented Feb 4, 2025

netlify bot commented Feb 4, 2025 •

edited

Loading

ahg-g left a comment

ahg-g Feb 7, 2025

kfswain Feb 10, 2025

ahg-g Feb 10, 2025

ahg-g Feb 7, 2025

kfswain left a comment •

edited

Loading

kfswain Feb 10, 2025

kfswain Feb 10, 2025

kfswain Feb 10, 2025

ahg-g Feb 10, 2025

ahg-g Feb 10, 2025

kfswain Feb 10, 2025

ahg-g Feb 10, 2025

kfswain Feb 10, 2025

ahg-g Feb 10, 2025

ahg-g commented Feb 10, 2025

ahg-g commented Feb 10, 2025

ahg-g commented Feb 10, 2025

fix: adds ErrorNotFound Handling for InferenceModel Reconciler #286

fix: adds ErrorNotFound Handling for InferenceModel Reconciler #286

Conversation

danehans commented Feb 4, 2025

k8s-ci-robot commented Feb 4, 2025

netlify bot commented Feb 4, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

ahg-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfswain left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Feb 10, 2025

ahg-g commented Feb 10, 2025

ahg-g commented Feb 10, 2025

netlify bot commented Feb 4, 2025 •

edited

Loading

kfswain left a comment •

edited

Loading