Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added inferencemodel predicate + minor changes in logging #397

Merged

Conversation

nirrozenbaum
Copy link
Contributor

added event predicate to inferencemodel reconciler.
the predicate verifies the inferencemodel ns equals the configured inferencepool ns and that it references the name of the configured inferencepool.
to avoid tricky bugs the predicate runs both on old object and new object in update func predicate.

fix #149

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 24, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @nirrozenbaum. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 24, 2025
Copy link

netlify bot commented Feb 24, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 5cc5fcb
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67bdc200039fb80008df414d
😎 Deploy Preview https://deploy-preview-397--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@nirrozenbaum
Copy link
Contributor Author

@kfswain @danehans

logger := log.FromContext(ctx)
loggerDefault := logger.V(logutil.DEFAULT)
loggerDefault.Info("Reconciling InferenceModel", "name", req.NamespacedName)
logger := log.FromContext(ctx).WithValues("Request.Namespace", req.Namespace, "Request.Name", req.Name).V(logutil.DEFAULT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nir, I am updating the logging in #393, do you mind reverting the changes in the Reconcile func just so we don't cause lots of merge conflicts?

c.Datastore.ModelDelete(infModel.Spec.ModelName)
}

func (c *InferenceModelReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&v1alpha1.InferenceModel{}).
WithEventFilter(predicate.Funcs{
Copy link
Contributor

@ahg-g ahg-g Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on the fence with this change. This filter is done client side, so in practice we are not significantly moving the needle by doing this at the scale of number of InferenceModels we expect in a cluster since Reconcile will be doing it anyway, and so this is repeating the same logic. But my biggest issue is that we don't have proper integration test coverage yet to test such changes, and we can't test this with unit tests.

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on this fence with this change. This filter is done client side, so in practice we are not significantly moving the needle by doing this at the scale of number of InferenceModels we expect in a cluster since Reconcile will be doing it anyway, and so this is repeating the same logic. But my biggest issue is that we don't have proper integration test coverage yet to test such changes, and we can't test this with unit tests.

@ahg-g that is not so accurate. it might be the same logic, but it is done in a different place in the chain of calls. there are many good reasons to use the predicate.

when using a predicate the event is filtered before its object is being inserted into the controller workingqueue. this is the earliest stage we can filter event if we know we're not interested in it. on the other hand, when NOT using the predicate and keeping the code as is, the object is pushed to the controller wokringqueue, which leads to unnecessary CPU and memory utilization as well as unnecessary API Server calls.

using predicates can reduce the load on the API server and the controller's event watcher, which can be crucial in large clusters with many resources.

in addition it might harm situations when using customized configurations of the controller, like using rate limiting. if the rate limiting is configured to have 'X' events per sec, the redundant events will reduce the performance significantly in case we filter in the Reconcile function, while if we use the predicate the object is not inserted into the workingqueue and therefore not taken into consideration when calculating rate limiting.

another very good reason to use predicate is the separation of concerns. Using WithEventFilter is beneficial for keeping the Reconcile function focused on the actual business logic of handling resource state, rather than being concerned with filtering irrelevant events.

there are more good reasons but to summarize -
best practices recommend using predicate when knowing upfront if an event is irrelevant. you're correct that without unit test it's hard to quantify the difference, but others have done it before us so you can easily check in your favorite search engine and see.
I'll stop here. hope that convinced you.

as for the other comment - of course I don't mind reverting other changes.
I assume that can be pushed later in a different PR.
as in the other PR, this needs to be rebased due to #398.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked for documentation on this but couldnt find any... And in order to corroborate any statement, can someone point to a source that verifies if WithEventFilter runs server side, or client side?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithEventFilter runs client-side (this logic has access to the epp objects and state!). This filtering has no impact on api-server load.

The filter that one can setup to run server side should be set via the ByObject cache option: https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/cache/cache.go#L240

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, then there isn't an optimization, and we are actually using more lines of code to add complexity to understanding the reconciliation logic(A future reader could no longer read just the Reconcile func to understand the full context), right?

Copy link
Contributor

@ahg-g ahg-g Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and adding another comment - of course this filtering has impact on api server load!
for each such event the processing includes calling api server to fetch the object only to later understand it's irrelevant.

In the current code, all Get calls in Reconcile happen on the local cache, there are no api-server calls for those, the api-server load with and without this PR is the same.

Yes, with WithEventFilter, the filtering happens before the workqueue, but my point is this happens client-side, meaning all updates will be sent over the wire from the api-server to the controller, and in all cases it will be cached and indexed (the filter this PR adds is the one executed as resource event handler, between steps 6 and 7 in the diagram above).

I agree it saves reconcile calls and lighten up the workqueue, but my question on whether this will actually move the needle is because how many updates/adds/deletes do we expect InferenceModel objects to have for this to make a difference (vs for example objects that get updated/created/deleted frequently, like Pods).

In any case, I am not against this change, anywhere we can be more efficient and allow us to scale better is a good thing, but my main issue as I mentioned in the first comment was that we don't have proper integration test coverage yet to test such a change (specifically the delete and update paths), and we can't test it with unit tests. I know it is a small change, but I am starting to feel anxious about our low test coverage overall in this project, hence the concern. In #393 I tried to improve unit test coverage for the controllers and datastore pkgs for example. I think our integration tests need work to allow us to more easily add tests to cover paths that can't be covered with unit tests such as this one. Some of the work that I think we can do to improve integration tests is to allow in each test case to create/delete infModel and infPool objects, also Pods should be created on the api-server instead of injecting them in to the store directly.

Bu I am ok to move forward with this if others don't feel strongly about the test coverage concern above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree about the tests coverage, but I think it’s not related to this PR which obviously introduces improvement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the the time element has been consistently undervalued wrt controller logic.
We have been having these conversations since October .

My argument is not against this PR specifically. It's that we have limited contribution resources, and dedicating them to discuss micro-optimizations I think does not meaningfully improve the project. Yes, this is better, but by how much? Based on all the time invested, will we actually see an RoI in this? How often will the EPP's performance be dictated by this bottleneck? Could we instead have been focused elsewhere and generate more value?

@nirrozenbaum , you're contributions are greatly appreciated, and I'm happy to approve this PR, I just want to see us move towards more efforts that I think are a little more worthy of all you smart folks' time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kfswain right. I’ve done some small PRs to get familiar with the code. I definitely agree and plan to contribute in more meaningful places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think it’s not related to this PR which obviously introduces improvement.

it kinda is in the sense it adds code that is not covered in any test, I hope we can pull together here and improve the integration tests because this will allow us to do more changes with confidence :)

I will leave the approval to @kfswain
/lgtm

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 24, 2025
@nirrozenbaum nirrozenbaum force-pushed the inferencemodel-predicate branch from 681a67e to f90e309 Compare February 25, 2025 12:31
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 25, 2025
@nirrozenbaum nirrozenbaum force-pushed the inferencemodel-predicate branch from f90e309 to 5cc5fcb Compare February 25, 2025 13:12
@@ -177,7 +177,7 @@ func run() error {

// Register ext-proc server.
if err := mgr.Add(serverRunner.AsRunnable(ctrl.Log.WithName("ext-proc"))); err != nil {
setupLog.Error(err, "Failed to register ext-proc server")
setupLog.Error(err, "Failed to register ext-proc gRPC server")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the articulation here, tyty.

@ahg-g
Copy link
Contributor

ahg-g commented Feb 25, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 25, 2025
@kfswain
Copy link
Collaborator

kfswain commented Feb 25, 2025

/lgtm

thanks for the contributions!!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 25, 2025
@ahg-g
Copy link
Contributor

ahg-g commented Feb 25, 2025

/approve

since @kfswain lgtmed

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, nirrozenbaum

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 25, 2025
@k8s-ci-robot k8s-ci-robot merged commit 2ad70e3 into kubernetes-sigs:main Feb 25, 2025
8 checks passed
@nirrozenbaum nirrozenbaum deleted the inferencemodel-predicate branch February 26, 2025 08:08
kaushikmitr pushed a commit to kaushikmitr/llm-instance-gateway that referenced this pull request Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filter InferenceModels from Reconciliation
4 participants