Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: adds ErrorNotFound Handling for InferenceModel Reconciler #286

Merged
merged 1 commit into from
Feb 10, 2025

Conversation

danehans
Copy link
Contributor

@danehans danehans commented Feb 4, 2025

  • Updates InferenceModel reconciler to not return an error if the InferenceModel is not found.
  • Updates InferenceModel reconciler to remove the InferenceModel from the datatstore if it's not found.
  • Adds unit tests

Fixes #280

Signed-off-by: Daneyon Hansen <[email protected]>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danehans

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from ahg-g February 4, 2025 21:01
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 4, 2025
@k8s-ci-robot k8s-ci-robot requested a review from kfswain February 4, 2025 21:01
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2025
Copy link

netlify bot commented Feb 4, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 847343e
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67a2803b252a9100085d5458
😎 Deploy Preview https://deploy-preview-286--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 4, 2025
@danehans danehans changed the title Adds ErrorNotFound Handling for InferenceModel Reconciler fix: adds ErrorNotFound Handling for InferenceModel Reconciler Feb 5, 2025
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of nits to use debugging levels.

service := &v1alpha1.InferenceModel{}
if err := c.Get(ctx, req.NamespacedName, service); err != nil {
klog.Error(err, "unable to get InferencePool")
klog.V(1).Infof("Reconciling InferenceModel %v", req.NamespacedName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logging levels pls, DEFAULT in this case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, setting this standard makes it easier for contributors to follow, and makes it more easy to reason about what the authors intent was wrt log level.

Edit: I see some areas in this file using V(1) already. We can kick this to a follow up PR. made: #307

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg, lets merge this and follow up with a PR for the logging levels.

infModel := &v1alpha1.InferenceModel{}
if err := c.Get(ctx, req.NamespacedName, infModel); err != nil {
if errors.IsNotFound(err) {
klog.V(1).Infof("InferenceModel %v not found. Removing from datastore since object must be deleted", req.NamespacedName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto to using debugging levels

Copy link
Collaborator

@kfswain kfswain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, have a comment wrt approach. Would like to close the loop there before we move forward. Thanks!!

service := &v1alpha1.InferenceModel{}
if err := c.Get(ctx, req.NamespacedName, service); err != nil {
klog.Error(err, "unable to get InferencePool")
klog.V(1).Infof("Reconciling InferenceModel %v", req.NamespacedName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, setting this standard makes it easier for contributors to follow, and makes it more easy to reason about what the authors intent was wrt log level.

Edit: I see some areas in this file using V(1) already. We can kick this to a follow up PR. made: #307

if err := c.Get(ctx, req.NamespacedName, infModel); err != nil {
if errors.IsNotFound(err) {
klog.V(1).Infof("InferenceModel %v not found. Removing from datastore since object must be deleted", req.NamespacedName)
c.Datastore.InferenceModels.Delete(infModel.Spec.ModelName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about doing this, but maybe it's unfounded.

My concern is that if for some reason there is a small lapse in communication with the API server, we could remove the inference model from the list of available models until the next reconciliation event (i believe the max time is 10 min, assuming there is no event that triggers reconciliation) So in the worst case, we could knock down a users service for 10 min, and then it pops back up unexpectedly, making for a rather nasty heisenbug. Its unfortunate that controller runtime doesnt let you separate by delete/create/update events, this would be much more easily remedied.

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead just not return the err? As I believe that causes controller runtime to requeue the reconciliation obj. But that could lead to the issue of leaving up an IM that doesnt actually exist (granted we currently have that issue)

for the long term we could (not this PR):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to check for deletionTimestamp and remove the InferenceModel if set.

Taking a step back, do we actually need an inferenceModel reconciler? do we need to store the inferenceModel objects ourselves? If not, I think we can drop it and just Get it (controller-runtime has an informer cache underneath).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfswain we can setup a handlers for delete/create/update with controller-runtime, see for example kueue: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/core/resourceflavor_controller.go#L145

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to check for deletionTimestamp and remove the InferenceModel if set.

Taking a step back, do we actually need an inferenceModel reconciler? do we need to store the inferenceModel objects ourselves? If not, I think we can drop it and just Get it (controller-runtime has an informer cache underneath).

I was thinking about that also, were we to ever set up fairness, we would need to hold on to these objects, or at least their name as the key value for a cache of traffic data. Otherwise we could fetch as needed as long as its using a cache and not blasting the api-server. But is it any better than what we would do here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we need that for fairness, we should keep it.

I think we need to setup a Delete handler to reliably delete the object from the store just like we do with Kueue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agreed there. Made: #310.

I think for this PR we can reduce the logging noise, and then have a separate PR addressing deletion events.

Is that fair?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg

@ahg-g
Copy link
Contributor

ahg-g commented Feb 10, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2025
@ahg-g ahg-g mentioned this pull request Feb 10, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Feb 10, 2025

Sent #317 which should allow the failed test to pass

@ahg-g
Copy link
Contributor

ahg-g commented Feb 10, 2025

/retest

@k8s-ci-robot k8s-ci-robot merged commit 6c22d92 into kubernetes-sigs:main Feb 10, 2025
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EPP Logs InferenceModel Not Found
4 participants