Remove fatal log calls in executable code #265

tchap · 2025-01-30T23:31:48Z

This is related to #251

K8s logging guidelines don't allow fatal logging calls. They can be replaced by the documented calls, but just returning errors properly is a cleaner solution. That is what this PR implements.

ErrorS is being used instead of fatal calls.

errgroup.Group is being used to manage goroutines and terminate on error.

Testing

Regarding testing, I don't have infrastructure to run more than unit tests right now, but I build the executable and tested it manually in the console, both for signal received scenario and failing to bind to a port. If there is a better solution to test this, let me know.

Signal

$ ./ext-proc -poolName vllm-llama2-7b-pool -v 3 -serviceName vllm-llama2-7b-pool -grpcPort 9002 -grpcHealthPort 9003                                                                                     79487 main.go:110] Flags: add_dir_header=false; alsologtostderr=false; grpcHealthPort=9003; grpcPort=9002; kubeconfig=; log_backtrace_at=:0; log_dir=; log_file=; log_file_max_size=1800; logtostderr=true; metricsPort=9090; one_output=false; poolName=vllm-llama2-7b-pool; poolNamespace=default; refreshMetricsInterval=50ms; refreshPodsInterval=10s; serviceName=vllm-llama2-7b-pool; skip_headers=false; skip_log_headers=false; stderrthreshold=2; targetPodHeader=target-pod; v=3; vmodule=; zone=;
I0131 00:27:28.161098   79487 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I0131 00:27:28.161112   79487 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
I0131 00:27:28.161117   79487 envvar.go:172] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
I0131 00:27:28.161121   79487 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=false
I0131 00:27:28.163076   79487 main.go:159] "Health server starting..."
I0131 00:27:28.163480   79487 runserver.go:120] "Ext-proc server starting..."
I0131 00:27:28.163497   79487 runserver.go:178] "Controller manager starting..."
I0131 00:27:28.163661   79487 main.go:195] "Metrics HTTP handler starting..."
I0131 00:27:28.163931   79487 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"
I0131 00:27:28.164167   79487 runserver.go:133] "Ext-proc server listening" port=9002
I0131 00:27:28.164194   79487 main.go:168] "Health server listening" port=9003
I0131 00:27:28.164299   79487 main.go:204] "Metrics HTTP handler listening" port=9090
I0131 00:27:28.165215   79487 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8080" secure=false
I0131 00:27:28.165704   79487 provider.go:68] Initialized pods and metrics: []
I0131 00:27:28.165868   79487 controller.go:198] "Starting EventSource" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice" source="kind source: *v1.EndpointSlice"
I0131 00:27:28.166035   79487 controller.go:198] "Starting EventSource" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel" source="kind source: *v1alpha1.InferenceModel"
I0131 00:27:28.166051   79487 controller.go:198] "Starting EventSource" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool" source="kind source: *v1alpha1.InferencePool"
I0131 00:27:28.185116   79487 reflector.go:313] Starting reflector *v1.EndpointSlice (10h55m52.073023241s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
I0131 00:27:28.185136   79487 reflector.go:349] Listing and watching *v1.EndpointSlice from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
E0131 00:27:28.186356   79487 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"InferencePool\" in version \"inference.networking.x-k8s.io/v1alpha1\"" logger="controller-runtime.source.EventHandler" kind="InferencePool.inference.networking.x-k8s.io"
E0131 00:27:28.189261   79487 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"InferenceModel\" in version \"inference.networking.x-k8s.io/v1alpha1\"" logger="controller-runtime.source.EventHandler" kind="InferenceModel.inference.networking.x-k8s.io"
I0131 00:27:28.190927   79487 reflector.go:376] Caches populated for *v1.EndpointSlice from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
I0131 00:27:28.285202   79487 controller.go:233] "Starting Controller" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice"
I0131 00:27:28.285251   79487 controller.go:242] "Starting workers" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice" worker count=1
^CI0131 00:27:32.013831   79487 runserver.go:155] "Ext-proc server shutting down..."
I0131 00:27:32.013838   79487 main.go:176] "Health server shutting down..."
I0131 00:27:32.013849   79487 internal.go:538] "Stopping and waiting for non leader election runnables"
I0131 00:27:32.013866   79487 main.go:223] "Metrics HTTP handler shutting down..."
I0131 00:27:32.013884   79487 internal.go:542] "Stopping and waiting for leader election runnables"
I0131 00:27:32.013909   79487 controller.go:262] "Shutdown signal received, waiting for all workers to finish" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice"
I0131 00:27:32.013951   79487 controller.go:233] "Starting Controller" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool"
I0131 00:27:32.013957   79487 main.go:233] "Metrics HTTP handler terminated"
I0131 00:27:32.013972   79487 controller.go:233] "Starting Controller" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel"
I0131 00:27:32.013979   79487 controller.go:242] "Starting workers" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool" worker count=1
I0131 00:27:32.013999   79487 controller.go:264] "All workers finished" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice"
I0131 00:27:32.014003   79487 controller.go:262] "Shutdown signal received, waiting for all workers to finish" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool"
I0131 00:27:32.014024   79487 runserver.go:166] "Ext-proc server terminated"
I0131 00:27:32.014054   79487 controller.go:264] "All workers finished" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool"
I0131 00:27:32.014006   79487 controller.go:242] "Starting workers" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel" worker count=1
I0131 00:27:32.014116   79487 controller.go:262] "Shutdown signal received, waiting for all workers to finish" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel"
I0131 00:27:32.014073   79487 main.go:186] "Health server terminated"
I0131 00:27:32.014169   79487 controller.go:264] "All workers finished" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel"
I0131 00:27:32.014203   79487 internal.go:550] "Stopping and waiting for caches"
I0131 00:27:32.014353   79487 reflector.go:319] Stopping reflector *v1.EndpointSlice (10h55m52.073023241s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
I0131 00:27:32.014425   79487 internal.go:554] "Stopping and waiting for webhooks"
I0131 00:27:32.014457   79487 internal.go:557] "Stopping and waiting for HTTP servers"
I0131 00:27:32.014486   79487 server.go:254] "Shutting down metrics server with timeout of 1 minute" logger="controller-runtime.metrics"
I0131 00:27:32.014551   79487 internal.go:561] "Wait completed, proceeding to shutdown the manager"
I0131 00:27:32.014570   79487 runserver.go:184] "Controller manager terminated"
I0131 00:27:32.014585   79487 main.go:152] "All components terminated"

$ echo $?
0

Failing to bind

$ ./ext-proc -poolName vllm-llama2-7b-pool -v 3 -serviceName vllm-llama2-7b-pool -grpcPort 9002 -grpcHealthPort 9002                                                                                     78423 main.go:110] Flags: add_dir_header=false; alsologtostderr=false; grpcHealthPort=9002; grpcPort=9002; kubeconfig=; log_backtrace_at=:0; log_dir=; log_file=; log_file_max_size=1800; logtostderr=true; metricsPort=9090; one_output=false; poolName=vllm-llama2-7b-pool; poolNamespace=default; refreshMetricsInterval=50ms; refreshPodsInterval=10s; serviceName=vllm-llama2-7b-pool; skip_headers=false; skip_log_headers=false; stderrthreshold=2; targetPodHeader=target-pod; v=3; vmodule=; zone=;
I0131 00:24:40.837170   78423 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
I0131 00:24:40.837184   78423 envvar.go:172] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
I0131 00:24:40.837190   78423 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=false
I0131 00:24:40.837195   78423 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I0131 00:24:40.837500   78423 main.go:159] "Health server starting..."
I0131 00:24:40.837507   78423 runserver.go:178] "Controller manager starting..."
I0131 00:24:40.837526   78423 runserver.go:120] "Ext-proc server starting..."
I0131 00:24:40.837599   78423 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"
I0131 00:24:40.837640   78423 main.go:195] "Metrics HTTP handler starting..."
I0131 00:24:40.838002   78423 main.go:168] "Health server listening" port=9002
E0131 00:24:40.838056   78423 runserver.go:126] "Ext-proc server failed to listen" err="listen tcp :9002: bind: address already in use" port=9002
I0131 00:24:40.838100   78423 main.go:204] "Metrics HTTP handler listening" port=9090
I0131 00:24:40.838172   78423 main.go:176] "Health server shutting down..."
I0131 00:24:40.838113   78423 internal.go:538] "Stopping and waiting for non leader election runnables"
I0131 00:24:40.838190   78423 controller.go:198] "Starting EventSource" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool" source="kind source: *v1alpha1.InferencePool"
I0131 00:24:40.838195   78423 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8080" secure=false
I0131 00:24:40.838241   78423 main.go:186] "Health server terminated"
I0131 00:24:40.838377   78423 internal.go:542] "Stopping and waiting for leader election runnables"
I0131 00:24:40.838549   78423 controller.go:198] "Starting EventSource" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice" source="kind source: *v1.EndpointSlice"
I0131 00:24:40.838574   78423 controller.go:233] "Starting Controller" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool"
I0131 00:24:40.838606   78423 controller.go:242] "Starting workers" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool" worker count=1
I0131 00:24:40.838622   78423 controller.go:233] "Starting Controller" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel"
I0131 00:24:40.838609   78423 controller.go:198] "Starting EventSource" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel" source="kind source: *v1alpha1.InferenceModel"
I0131 00:24:40.838645   78423 controller.go:242] "Starting workers" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel" worker count=1
I0131 00:24:40.838626   78423 controller.go:262] "Shutdown signal received, waiting for all workers to finish" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool"
I0131 00:24:40.838609   78423 controller.go:233] "Starting Controller" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice"
I0131 00:24:40.838699   78423 controller.go:242] "Starting workers" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice" worker count=1
I0131 00:24:40.838671   78423 controller.go:262] "Shutdown signal received, waiting for all workers to finish" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel"
I0131 00:24:40.838742   78423 controller.go:264] "All workers finished" controller="inferencepool" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferencePool"
I0131 00:24:40.838792   78423 controller.go:264] "All workers finished" controller="inferencemodel" controllerGroup="inference.networking.x-k8s.io" controllerKind="InferenceModel"
I0131 00:24:40.838762   78423 controller.go:262] "Shutdown signal received, waiting for all workers to finish" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice"
I0131 00:24:40.838814   78423 controller.go:264] "All workers finished" controller="endpointslice" controllerGroup="discovery.k8s.io" controllerKind="EndpointSlice"
I0131 00:24:40.838825   78423 internal.go:550] "Stopping and waiting for caches"
I0131 00:24:40.839869   78423 main.go:223] "Metrics HTTP handler shutting down..."
I0131 00:24:40.839906   78423 main.go:233] "Metrics HTTP handler terminated"
E0131 00:24:40.852622   78423 kind.go:76] "failed to get informer from cache" err="unable to retrieve the complete list of server APIs: inference.networking.x-k8s.io/v1alpha1: no matches for inference.networking.x-k8s.io/v1alpha1, Resource=" logger="controller-runtime.source.EventHandler"
E0131 00:24:40.852710   78423 kind.go:76] "failed to get informer from cache" err="Timeout: failed waiting for *v1.EndpointSlice Informer to sync" logger="controller-runtime.source.EventHandler"
I0131 00:24:40.852796   78423 reflector.go:313] Starting reflector *v1.EndpointSlice (9h18m22.86490117s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
I0131 00:24:40.852813   78423 reflector.go:319] Stopping reflector *v1.EndpointSlice (9h18m22.86490117s) from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
E0131 00:24:40.855061   78423 kind.go:71] "if kind is a CRD, it should be installed before calling Start" err="no matches for kind \"InferenceModel\" in version \"inference.networking.x-k8s.io/v1alpha1\"" logger="controller-runtime.source.EventHandler" kind="InferenceModel.inference.networking.x-k8s.io"
I0131 00:24:40.855079   78423 internal.go:554] "Stopping and waiting for webhooks"
I0131 00:24:40.855097   78423 internal.go:557] "Stopping and waiting for HTTP servers"
I0131 00:24:40.855423   78423 server.go:254] "Shutting down metrics server with timeout of 1 minute" logger="controller-runtime.metrics"
I0131 00:24:40.855772   78423 internal.go:561] "Wait completed, proceeding to shutdown the manager"
I0131 00:24:40.855794   78423 runserver.go:184] "Controller manager terminated"
I0131 00:24:40.855826   78423 main.go:152] "All components terminated"

$ echo $?
1

k8s-ci-robot · 2025-01-30T23:31:57Z

Hi @tchap. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-01-30T23:32:04Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`faa2c5e`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67ad3c2abae2fc0009ecf96e
😎 Deploy Preview	https://deploy-preview-265--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ahg-g · 2025-01-31T00:17:05Z

/ok-to-test

danehans · 2025-02-03T21:30:46Z

@tchap please excuse our delay as we are preparing to release v0.1.0-rc1.

tchap · 2025-02-05T23:44:07Z

Is there actually any reason why all "runnables" are not being managed by controller-runtime? It could be nice to just wrap everything in Start(ctx context.Context) error runnable interface and pass it to the controller. Or is there anything I am not seeing right now?

danehans · 2025-02-06T01:17:26Z

d7fdc8f added the runnables.

@benjamin do you have any insight that you can share?

I'm in favor of wrapping the other servers in controller runtime Start() as long as all goroutines are cleaned up properly.

tchap · 2025-02-06T12:05:10Z

Ok, everything is being managed by controller-runtime now.

internal/runnable/grpc.go

tchap · 2025-02-12T23:40:52Z

I would really like to get this merged so that I can continue refactoring logging. I've been rebasing this quite a lot 😅

tchap · 2025-02-12T23:45:49Z

I have to say I am not entirely in love with how the manager is created in runserver, then used in main, I would like to try to improve that in a subsequent PR, but didn't want to rewrite everything here as the PR is already targeting multiple things.

All Fatal log call are removed. Also all runnable components are now managed by controller-runtime, implementing manager.Runnable interface.

ahg-g · 2025-02-13T04:53:36Z

Thanks for doing this, very neat!

I ran an e2e test on this PR and also checked the metrics.

/lgtm
/approve

k8s-ci-robot · 2025-02-13T04:53:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, tchap

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

All Fatal log call are removed. Also all runnable components are now managed by controller-runtime, implementing manager.Runnable interface.

k8s-ci-robot added the cncf-cla: yes label Jan 30, 2025

k8s-ci-robot requested review from ahg-g and kfswain January 30, 2025 23:31

k8s-ci-robot added the needs-ok-to-test label Jan 30, 2025

k8s-ci-robot added the size/L label Jan 30, 2025

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jan 31, 2025

k8s-ci-robot added the needs-rebase label Feb 3, 2025

tchap force-pushed the drop-fatal-logging branch from e7ac50e to 3471446 Compare February 3, 2025 21:37

k8s-ci-robot added needs-rebase and removed needs-rebase labels Feb 3, 2025

tchap force-pushed the drop-fatal-logging branch from 3471446 to bed112a Compare February 6, 2025 10:20

k8s-ci-robot removed the needs-rebase label Feb 6, 2025

tchap force-pushed the drop-fatal-logging branch 3 times, most recently from 050c6ae to 78b367f Compare February 6, 2025 12:01

ahg-g reviewed Feb 7, 2025

View reviewed changes

internal/runnable/grpc.go Outdated Show resolved Hide resolved

k8s-ci-robot added the needs-rebase label Feb 7, 2025

tchap force-pushed the drop-fatal-logging branch from db1aafa to 03f317a Compare February 7, 2025 07:16

k8s-ci-robot added needs-rebase and removed needs-rebase labels Feb 7, 2025

tchap force-pushed the drop-fatal-logging branch from 03f317a to 4666c76 Compare February 12, 2025 23:33

k8s-ci-robot removed the needs-rebase label Feb 12, 2025

Remove fatal logging in executable code

Loading
Loading status checks…

faa2c5e

All Fatal log call are removed. Also all runnable components are now managed by controller-runtime, implementing manager.Runnable interface.

tchap force-pushed the drop-fatal-logging branch from 4666c76 to faa2c5e Compare February 13, 2025 00:26

k8s-ci-robot assigned ahg-g Feb 13, 2025

k8s-ci-robot added the lgtm label Feb 13, 2025

k8s-ci-robot added the approved label Feb 13, 2025

k8s-ci-robot merged commit 0662f1f into kubernetes-sigs:main Feb 13, 2025
8 checks passed

tchap deleted the drop-fatal-logging branch February 13, 2025 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove fatal log calls in executable code #265

Remove fatal log calls in executable code #265

tchap commented Jan 30, 2025

k8s-ci-robot commented Jan 30, 2025

netlify bot commented Jan 30, 2025 •

edited

Loading

ahg-g commented Jan 31, 2025

danehans commented Feb 3, 2025

tchap commented Feb 5, 2025

danehans commented Feb 6, 2025

tchap commented Feb 6, 2025

tchap commented Feb 12, 2025

tchap commented Feb 12, 2025

ahg-g commented Feb 13, 2025

k8s-ci-robot commented Feb 13, 2025

Remove fatal log calls in executable code #265

Remove fatal log calls in executable code #265

Conversation

tchap commented Jan 30, 2025

Testing

Signal

Failing to bind

k8s-ci-robot commented Jan 30, 2025

netlify bot commented Jan 30, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

ahg-g commented Jan 31, 2025

danehans commented Feb 3, 2025

tchap commented Feb 5, 2025

danehans commented Feb 6, 2025

tchap commented Feb 6, 2025

tchap commented Feb 12, 2025

tchap commented Feb 12, 2025

ahg-g commented Feb 13, 2025

k8s-ci-robot commented Feb 13, 2025

netlify bot commented Jan 30, 2025 •

edited

Loading