-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: adds ErrorNotFound Handling for InferenceModel Reconciler #286
Conversation
Signed-off-by: Daneyon Hansen <[email protected]>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danehans The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of nits to use debugging levels.
service := &v1alpha1.InferenceModel{} | ||
if err := c.Get(ctx, req.NamespacedName, service); err != nil { | ||
klog.Error(err, "unable to get InferencePool") | ||
klog.V(1).Infof("Reconciling InferenceModel %v", req.NamespacedName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use logging levels pls, DEFAULT in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++, setting this standard makes it easier for contributors to follow, and makes it more easy to reason about what the authors intent was wrt log level.
Edit: I see some areas in this file using V(1) already. We can kick this to a follow up PR. made: #307
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg, lets merge this and follow up with a PR for the logging levels.
infModel := &v1alpha1.InferenceModel{} | ||
if err := c.Get(ctx, req.NamespacedName, infModel); err != nil { | ||
if errors.IsNotFound(err) { | ||
klog.V(1).Infof("InferenceModel %v not found. Removing from datastore since object must be deleted", req.NamespacedName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto to using debugging levels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, have a comment wrt approach. Would like to close the loop there before we move forward. Thanks!!
service := &v1alpha1.InferenceModel{} | ||
if err := c.Get(ctx, req.NamespacedName, service); err != nil { | ||
klog.Error(err, "unable to get InferencePool") | ||
klog.V(1).Infof("Reconciling InferenceModel %v", req.NamespacedName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++, setting this standard makes it easier for contributors to follow, and makes it more easy to reason about what the authors intent was wrt log level.
Edit: I see some areas in this file using V(1) already. We can kick this to a follow up PR. made: #307
if err := c.Get(ctx, req.NamespacedName, infModel); err != nil { | ||
if errors.IsNotFound(err) { | ||
klog.V(1).Infof("InferenceModel %v not found. Removing from datastore since object must be deleted", req.NamespacedName) | ||
c.Datastore.InferenceModels.Delete(infModel.Spec.ModelName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about doing this, but maybe it's unfounded.
My concern is that if for some reason there is a small lapse in communication with the API server, we could remove the inference model from the list of available models until the next reconciliation event (i believe the max time is 10 min, assuming there is no event that triggers reconciliation) So in the worst case, we could knock down a users service for 10 min, and then it pops back up unexpectedly, making for a rather nasty heisenbug. Its unfortunate that controller runtime doesnt let you separate by delete/create/update events, this would be much more easily remedied.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead just not return the err? As I believe that causes controller runtime to requeue the reconciliation obj. But that could lead to the issue of leaving up an IM that doesnt actually exist (granted we currently have that issue)
for the long term we could (not this PR):
-
We would then have to use: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/builder#TypedBuilder.WithEventFilter i think we could do something messy by making some sort of
cleanupReconciler
. -
We could also just switch to listing the infModels per request and using the informer cache to do so.
-
But also its handled well in the non-controller runtime paradigm: https://github.com/kubernetes/sample-controller/blob/master/controller.go#L152
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to check for deletionTimestamp and remove the InferenceModel if set.
Taking a step back, do we actually need an inferenceModel reconciler? do we need to store the inferenceModel objects ourselves? If not, I think we can drop it and just Get
it (controller-runtime has an informer cache underneath).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kfswain we can setup a handlers for delete/create/update with controller-runtime, see for example kueue: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/core/resourceflavor_controller.go#L145
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to check for deletionTimestamp and remove the InferenceModel if set.
Taking a step back, do we actually need an inferenceModel reconciler? do we need to store the inferenceModel objects ourselves? If not, I think we can drop it and just
Get
it (controller-runtime has an informer cache underneath).
I was thinking about that also, were we to ever set up fairness, we would need to hold on to these objects, or at least their name as the key value for a cache of traffic data. Otherwise we could fetch as needed as long as its using a cache and not blasting the api-server. But is it any better than what we would do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we need that for fairness, we should keep it.
I think we need to setup a Delete handler to reliably delete the object from the store just like we do with Kueue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely agreed there. Made: #310.
I think for this PR we can reduce the logging noise, and then have a separate PR addressing deletion events.
Is that fair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg
/lgtm |
Sent #317 which should allow the failed test to pass |
/retest |
Fixes #280