Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove fatal log calls in executable code #265

Merged
merged 1 commit into from
Feb 13, 2025

Conversation

tchap
Copy link
Contributor

@tchap tchap commented Jan 30, 2025

This is related to #251

K8s logging guidelines don't allow fatal logging calls. They can be replaced by the documented calls, but just returning errors properly is a cleaner solution. That is what this PR implements.

ErrorS is being used instead of fatal calls.

errgroup.Group is being used to manage goroutines and terminate on error.

Testing

Regarding testing, I don't have infrastructure to run more than unit tests right now, but I build the executable and tested it manually in the console, both for signal received scenario and failing to bind to a port. If there is a better solution to test this, let me know.

Signal

$ ./ext-proc -poolName vllm-llama2-7b-pool -v 3 -serviceName vllm-llama2-7b-pool -grpcPort 9002 -grpcHealthPort 9003                                                                                     79487 main.go:110] Flags: add_dir_header=false; alsologtostderr=false; grpcHealthPort=9003; grpcPort=9002; kubeconfig=; log_backtrace_at=:0; log_dir=; log_file=; log_file_max_size=1800; logtostderr=true; metricsPort=9090; one_output=false; poolName=vllm-llama2-7b-pool; poolNamespace=default; refreshMetricsInterval=50ms; refreshPodsInterval=10s; serviceName=vllm-llama2-7b-pool; skip_headers=false; skip_log_headers=false; stderrthreshold=2; targetPodHeader=target-pod; v=3; vmodule=; zone=;
I0131 00:27:28.161098   79487 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I0131 00:27:28.161112   79487 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
I0131 00:27:28.161117   79487 envvar.go:172] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
I0131 00:27:28.161121   79487 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=false
I0131 00:27:28.163076   79487 main.go:159] "Health server starting..."
I0131 00:27:28.163480   79487 runserver.go:120] "Ext-proc server starting..."
I0131 00:27:28.163497   79487 runserver.go:178] "Controller manager starting..."
I0131 00:27:28.163661   79487 main.go:195] "Metrics HTTP handler starting..."
I0131 00:27:28.163931   79487 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"
I0131 00:27:28.164167   79487 runserver.go:133] "Ext-proc server listening" port=9002
I0131 00:27:28.164194   79487 main.go:168] "Health server listening" port=9003
I0131 00:27:28.164299   79487 main.go:204] "Metrics HTTP handler listening" port=9090
I0131 00:27:28.165215   79487 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8080" secure=false
I0131 00:27:28.165704   79487 provider.go:68] Initialized pods and metrics: []
I0131 00:27:28.165868   79487 controller.go:198] "Starting EventSource" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice" source="kind source: *v1.EndpointSlice"
I0131 00:27:28.166035   79487 controller.go:198] "Starting EventSource" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel" source="kind source: *v1alpha1.InferenceModel"
I0131 00:27:28.166051   79487 controller.go:198] "Starting EventSource" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool" source="kind source: *v1alpha1.InferencePool"
I0131 00:27:28.185116   79487 reflector.go:313] Starting reflector *v1.EndpointSlice (10h55m52.073023241s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
I0131 00:27:28.185136   79487 reflector.go:349] Listing and watching *v1.EndpointSlice from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
E0131 00:27:28.186356   79487 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"InferencePool\" in version \"inference.networking.x-k8s.io/v1alpha1\"" logger="controller-runtime.source.EventHandler" kind="InferencePool.inference.networking.x-k8s.io"
E0131 00:27:28.189261   79487 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"InferenceModel\" in version \"inference.networking.x-k8s.io/v1alpha1\"" logger="controller-runtime.source.EventHandler" kind="InferenceModel.inference.networking.x-k8s.io"
I0131 00:27:28.190927   79487 reflector.go:376] Caches populated for *v1.EndpointSlice from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
I0131 00:27:28.285202   79487 controller.go:233] "Starting Controller" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice"
I0131 00:27:28.285251   79487 controller.go:242] "Starting workers" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice" worker count=1
^CI0131 00:27:32.013831   79487 runserver.go:155] "Ext-proc server shutting down..."
I0131 00:27:32.013838   79487 main.go:176] "Health server shutting down..."
I0131 00:27:32.013849   79487 internal.go:538] "Stopping and waiting for non leader election runnables"
I0131 00:27:32.013866   79487 main.go:223] "Metrics HTTP handler shutting down..."
I0131 00:27:32.013884   79487 internal.go:542] "Stopping and waiting for leader election runnables"
I0131 00:27:32.013909   79487 controller.go:262] "Shutdown signal received, waiting for all workers to finish" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice"
I0131 00:27:32.013951   79487 controller.go:233] "Starting Controller" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool"
I0131 00:27:32.013957   79487 main.go:233] "Metrics HTTP handler terminated"
I0131 00:27:32.013972   79487 controller.go:233] "Starting Controller" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel"
I0131 00:27:32.013979   79487 controller.go:242] "Starting workers" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool" worker count=1
I0131 00:27:32.013999   79487 controller.go:264] "All workers finished" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice"
I0131 00:27:32.014003   79487 controller.go:262] "Shutdown signal received, waiting for all workers to finish" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool"
I0131 00:27:32.014024   79487 runserver.go:166] "Ext-proc server terminated"
I0131 00:27:32.014054   79487 controller.go:264] "All workers finished" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool"
I0131 00:27:32.014006   79487 controller.go:242] "Starting workers" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel" worker count=1
I0131 00:27:32.014116   79487 controller.go:262] "Shutdown signal received, waiting for all workers to finish" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel"
I0131 00:27:32.014073   79487 main.go:186] "Health server terminated"
I0131 00:27:32.014169   79487 controller.go:264] "All workers finished" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel"
I0131 00:27:32.014203   79487 internal.go:550] "Stopping and waiting for caches"
I0131 00:27:32.014353   79487 reflector.go:319] Stopping reflector *v1.EndpointSlice (10h55m52.073023241s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
I0131 00:27:32.014425   79487 internal.go:554] "Stopping and waiting for webhooks"
I0131 00:27:32.014457   79487 internal.go:557] "Stopping and waiting for HTTP servers"
I0131 00:27:32.014486   79487 server.go:254] "Shutting down metrics server with timeout of 1 minute" logger="controller-runtime.metrics"
I0131 00:27:32.014551   79487 internal.go:561] "Wait completed, proceeding to shutdown the manager"
I0131 00:27:32.014570   79487 runserver.go:184] "Controller manager terminated"
I0131 00:27:32.014585   79487 main.go:152] "All components terminated"

$ echo $?
0

Failing to bind

$ ./ext-proc -poolName vllm-llama2-7b-pool -v 3 -serviceName vllm-llama2-7b-pool -grpcPort 9002 -grpcHealthPort 9002                                                                                     78423 main.go:110] Flags: add_dir_header=false; alsologtostderr=false; grpcHealthPort=9002; grpcPort=9002; kubeconfig=; log_backtrace_at=:0; log_dir=; log_file=; log_file_max_size=1800; logtostderr=true; metricsPort=9090; one_output=false; poolName=vllm-llama2-7b-pool; poolNamespace=default; refreshMetricsInterval=50ms; refreshPodsInterval=10s; serviceName=vllm-llama2-7b-pool; skip_headers=false; skip_log_headers=false; stderrthreshold=2; targetPodHeader=target-pod; v=3; vmodule=; zone=;
I0131 00:24:40.837170   78423 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
I0131 00:24:40.837184   78423 envvar.go:172] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
I0131 00:24:40.837190   78423 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=false
I0131 00:24:40.837195   78423 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I0131 00:24:40.837500   78423 main.go:159] "Health server starting..."
I0131 00:24:40.837507   78423 runserver.go:178] "Controller manager starting..."
I0131 00:24:40.837526   78423 runserver.go:120] "Ext-proc server starting..."
I0131 00:24:40.837599   78423 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"
I0131 00:24:40.837640   78423 main.go:195] "Metrics HTTP handler starting..."
I0131 00:24:40.838002   78423 main.go:168] "Health server listening" port=9002
E0131 00:24:40.838056   78423 runserver.go:126] "Ext-proc server failed to listen" err="listen tcp :9002: bind: address already in use" port=9002
I0131 00:24:40.838100   78423 main.go:204] "Metrics HTTP handler listening" port=9090
I0131 00:24:40.838172   78423 main.go:176] "Health server shutting down..."
I0131 00:24:40.838113   78423 internal.go:538] "Stopping and waiting for non leader election runnables"
I0131 00:24:40.838190   78423 controller.go:198] "Starting EventSource" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool" source="kind source: *v1alpha1.InferencePool"
I0131 00:24:40.838195   78423 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8080" secure=false
I0131 00:24:40.838241   78423 main.go:186] "Health server terminated"
I0131 00:24:40.838377   78423 internal.go:542] "Stopping and waiting for leader election runnables"
I0131 00:24:40.838549   78423 controller.go:198] "Starting EventSource" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice" source="kind source: *v1.EndpointSlice"
I0131 00:24:40.838574   78423 controller.go:233] "Starting Controller" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool"
I0131 00:24:40.838606   78423 controller.go:242] "Starting workers" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool" worker count=1
I0131 00:24:40.838622   78423 controller.go:233] "Starting Controller" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel"
I0131 00:24:40.838609   78423 controller.go:198] "Starting EventSource" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel" source="kind source: *v1alpha1.InferenceModel"
I0131 00:24:40.838645   78423 controller.go:242] "Starting workers" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel" worker count=1
I0131 00:24:40.838626   78423 controller.go:262] "Shutdown signal received, waiting for all workers to finish" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool"
I0131 00:24:40.838609   78423 controller.go:233] "Starting Controller" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice"
I0131 00:24:40.838699   78423 controller.go:242] "Starting workers" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice" worker count=1
I0131 00:24:40.838671   78423 controller.go:262] "Shutdown signal received, waiting for all workers to finish" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel"
I0131 00:24:40.838742   78423 controller.go:264] "All workers finished" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool"
I0131 00:24:40.838792   78423 controller.go:264] "All workers finished" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel"
I0131 00:24:40.838762   78423 controller.go:262] "Shutdown signal received, waiting for all workers to finish" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice"
I0131 00:24:40.838814   78423 controller.go:264] "All workers finished" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice"
I0131 00:24:40.838825   78423 internal.go:550] "Stopping and waiting for caches"
I0131 00:24:40.839869   78423 main.go:223] "Metrics HTTP handler shutting down..."
I0131 00:24:40.839906   78423 main.go:233] "Metrics HTTP handler terminated"
E0131 00:24:40.852622   78423 kind.go:76] "failed to get informer from cache" err="unable to retrieve the complete list of server APIs: inference.networking.x-k8s.io/v1alpha1: no matches for inference.networking.x-k8s.io/v1alpha1, Resource=" logger="controller-runtime.source.EventHandler"
E0131 00:24:40.852710   78423 kind.go:76] "failed to get informer from cache" err="Timeout: failed waiting for *v1.EndpointSlice Informer to sync" logger="controller-runtime.source.EventHandler"
I0131 00:24:40.852796   78423 reflector.go:313] Starting reflector *v1.EndpointSlice (9h18m22.86490117s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
I0131 00:24:40.852813   78423 reflector.go:319] Stopping reflector *v1.EndpointSlice (9h18m22.86490117s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
E0131 00:24:40.855061   78423 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"InferenceModel\" in version \"inference.networking.x-k8s.io/v1alpha1\"" logger="controller-runtime.source.EventHandler" kind="InferenceModel.inference.networking.x-k8s.io"
I0131 00:24:40.855079   78423 internal.go:554] "Stopping and waiting for webhooks"
I0131 00:24:40.855097   78423 internal.go:557] "Stopping and waiting for HTTP servers"
I0131 00:24:40.855423   78423 server.go:254] "Shutting down metrics server with timeout of 1 minute" logger="controller-runtime.metrics"
I0131 00:24:40.855772   78423 internal.go:561] "Wait completed, proceeding to shutdown the manager"
I0131 00:24:40.855794   78423 runserver.go:184] "Controller manager terminated"
I0131 00:24:40.855826   78423 main.go:152] "All components terminated"

$ echo $?
1

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 30, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and kfswain January 30, 2025 23:31
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 30, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @tchap. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 30, 2025
Copy link

netlify bot commented Jan 30, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit faa2c5e
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67ad3c2abae2fc0009ecf96e
😎 Deploy Preview https://deploy-preview-265--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@ahg-g
Copy link
Contributor

ahg-g commented Jan 31, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 31, 2025
@danehans
Copy link
Contributor

danehans commented Feb 3, 2025

@tchap please excuse our delay as we are preparing to release v0.1.0-rc1.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 3, 2025
@tchap tchap force-pushed the drop-fatal-logging branch from e7ac50e to 3471446 Compare February 3, 2025 21:37
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 3, 2025
@tchap
Copy link
Contributor Author

tchap commented Feb 5, 2025

Is there actually any reason why all "runnables" are not being managed by controller-runtime? It could be nice to just wrap everything in Start(ctx context.Context) error runnable interface and pass it to the controller. Or is there anything I am not seeing right now?

@danehans
Copy link
Contributor

danehans commented Feb 6, 2025

d7fdc8f added the runnables.

@benjamin do you have any insight that you can share?

I'm in favor of wrapping the other servers in controller runtime Start() as long as all goroutines are cleaned up properly.

@tchap tchap force-pushed the drop-fatal-logging branch from 3471446 to bed112a Compare February 6, 2025 10:20
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 6, 2025
@tchap tchap force-pushed the drop-fatal-logging branch 3 times, most recently from 050c6ae to 78b367f Compare February 6, 2025 12:01
@tchap
Copy link
Contributor Author

tchap commented Feb 6, 2025

Ok, everything is being managed by controller-runtime now.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2025
@tchap tchap force-pushed the drop-fatal-logging branch from db1aafa to 03f317a Compare February 7, 2025 07:16
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 7, 2025
@tchap tchap force-pushed the drop-fatal-logging branch from 03f317a to 4666c76 Compare February 12, 2025 23:33
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 12, 2025
@tchap
Copy link
Contributor Author

tchap commented Feb 12, 2025

I would really like to get this merged so that I can continue refactoring logging. I've been rebasing this quite a lot 😅

@tchap
Copy link
Contributor Author

tchap commented Feb 12, 2025

I have to say I am not entirely in love with how the manager is created in runserver, then used in main, I would like to try to improve that in a subsequent PR, but didn't want to rewrite everything here as the PR is already targeting multiple things.

All Fatal log call are removed. Also all runnable components
are now managed by controller-runtime, implementing
manager.Runnable interface.
@tchap tchap force-pushed the drop-fatal-logging branch from 4666c76 to faa2c5e Compare February 13, 2025 00:26
@ahg-g
Copy link
Contributor

ahg-g commented Feb 13, 2025

Thanks for doing this, very neat!

I ran an e2e test on this PR and also checked the metrics.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, tchap

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2025
@k8s-ci-robot k8s-ci-robot merged commit 0662f1f into kubernetes-sigs:main Feb 13, 2025
8 checks passed
@tchap tchap deleted the drop-fatal-logging branch February 13, 2025 08:44
coolkp pushed a commit to coolkp/llm-instance-gateway that referenced this pull request Feb 13, 2025
All Fatal log call are removed. Also all runnable components
are now managed by controller-runtime, implementing
manager.Runnable interface.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants