output kube-system logs from workload clusters #1121

devigned · 2021-01-13T23:28:12Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
It's tough to diagnose what's happening in some flaky e2e tests. By collecting kube-system logs we can get a better idea of why tests are failing.

This PR intercepts the call to CollectWorkloadClusterLogs so that it can inject code to pull all of the logs for the kube-system pods.

Also, added a slightly higher verbosity to controller manager logging (v=4).

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

Collect kube-system logs for workload clusters in e2e tests

CecileRobertMichon · 2021-01-13T23:38:31Z

why not use the AzureLogCollector for this instead of replacing the CollectWorkloadClusterLogs func?

edit: this might not be possible, just wondering if it is...

CecileRobertMichon · 2021-01-13T23:43:58Z

test/e2e/e2e_suite_test.go

+	aboveMachinesPath := strings.Replace(outputPath, "/machines", "", 1)
+	workload := acp.GetWorkloadCluster(ctx, namespace, name)
+	pods := &corev1.PodList{}
+	Expect(workload.GetClient().List(ctx, pods, client.InNamespace(kubesystem))).To(Succeed())


none of this is CAPZ specific and it would useful to have kube-system pod logs for all CAPI clusters, would it make sense to make this change directly in the framework's log collector instead?

It would, but I want it today, not in the next CAPI release :).

I will gladly open a similar PR in CAPI.

devigned · 2021-01-14T13:21:35Z

why not use the AzureLogCollector for this instead of replacing the CollectWorkloadClusterLogs func?

@CecileRobertMichon, great idea. That was the first thing I tried to do as it seemed like the right extension point for this functionality. Unfortuantely, the ClusterLogCollector interface doesn't provide a func that fits log collection at the cluster level. It is targeted to machine log collection as evidenced by its only func CollectMachineLog.

https://github.com/kubernetes-sigs/cluster-api/blob/daba8fea8d536b896c2ac28d97d97385cb3f3e71/test/framework/cluster_proxy.go#L80-L85

I'd like to add another hook to this collector for collecting cluster level logs. Possibly, also a hook for setting up namespace watches when the cluster first starts so that the collector could start listening to logs or any other logging stuff before waiting on the workload cluster to be built and all of the workloads to be in a ready state, as it is today with clusterctl.ApplyClusterTemplateAndWait.

wdyt?

CecileRobertMichon · 2021-01-14T18:16:24Z

Sounds great. Let's do both, this PR can help us short term to debug the test failures, and we can add the hook to the CAPI framework in parallel.

Might also be cool to get/describe the pods in kube-system so we get a quick view to know if anything was in crashloop or error state.

devigned · 2021-01-14T19:47:49Z

/retest

devigned · 2021-01-14T19:48:10Z

/test pull-cluster-api-provider-azure-e2e

devigned · 2021-01-14T20:57:18Z

pull-cluster-api-provider-azure-e2e failed due to GatewayTimeout and 504 errors. Seems really odd. Retrying.

I0114 20:24:21.774423       1 node_lifecycle_controller.go:1429] Initializing eviction metric for zone: eastus2:�:2
I0114 20:24:21.774484       1 node_lifecycle_controller.go:1429] Initializing eviction metric for zone: eastus2:�:3
I0114 20:24:21.774815       1 event.go:291] "Event occurred" object="capz-e2e-9m4soa-mp-0000003" kind="Node" apiVersion="v1" type="Normal" reason="RegisteredNode" message="Node capz-e2e-9m4soa-mp-0000003 event: Registered Node capz-e2e-9m4soa-mp-0000003 in Controller"
W0114 20:24:21.774956       1 node_lifecycle_controller.go:1044] Missing timestamp for Node capz-e2e-9m4soa-mp-0000002. Assuming now as a timestamp.
W0114 20:24:21.775031       1 node_lifecycle_controller.go:1044] Missing timestamp for Node capz-e2e-9m4soa-mp-0000003. Assuming now as a timestamp.
I0114 20:24:21.775098       1 node_lifecycle_controller.go:1245] Controller detected that zone eastus2:�:3 is now in state FullDisruption.
I0114 20:24:21.775114       1 node_lifecycle_controller.go:1245] Controller detected that zone eastus2:�:2 is now in state FullDisruption.
I0114 20:24:41.801199       1 node_lifecycle_controller.go:1245] Controller detected that zone eastus2:�:3 is now in state Normal.
I0114 20:24:41.801309       1 node_lifecycle_controller.go:1245] Controller detected that zone eastus2:�:2 is now in state Normal.
I0114 20:25:53.762842       1 event.go:291] "Event occurred" object="default/web" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled up replica set web-5d45b7f96d to 1"
I0114 20:25:53.785611       1 event.go:291] "Event occurred" object="default/web-5d45b7f96d" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: web-5d45b7f96d-9zlhf"
I0114 20:26:13.914021       1 event.go:291] "Event occurred" object="default/web-ilb" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0114 20:30:47.555451       1 azure_backoff.go:422] CreateOrUpdateVMSS: error CreateOrUpdate vmss(capz-e2e-9m4soa-mp-0): &{true 504 0001-01-01 00:00:00 +0000 UTC Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: {"error":{"code":"ResourceReadFailed","target":"capz-e2e-9m4soa-mp-0","message":"Policy required full resource content to evaluate the request. The request to GET resource 'https://management.azure.com/subscriptions/===REDACTED===/resourceGroups/capz-e2e-9m4soa/providers/Microsoft.Compute/virtualMachineScaleSets/capz-e2e-9m4soa-mp-0?api-version=2019-07-01' failed with status 'GatewayTimeout'."}}}
E0114 20:30:47.555499       1 azure_vmss.go:1179] ensureVMSSInPool CreateOrUpdateVMSS(capz-e2e-9m4soa-mp-0) with new backendPoolID /subscriptions/===REDACTED===/resourceGroups/capz-e2e-9m4soa/providers/Microsoft.Network/loadBalancers/capz-e2e-9m4soa-internal/backendAddressPools/capz-e2e-9m4soa, err: <nil>
E0114 20:30:47.555553       1 azure_loadbalancer.go:162] reconcileLoadBalancer(default/web-ilb) failed: Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: {"error":{"code":"ResourceReadFailed","target":"capz-e2e-9m4soa-mp-0","message":"Policy required full resource content to evaluate the request. The request to GET resource 'https://management.azure.com/subscriptions/===REDACTED===/resourceGroups/capz-e2e-9m4soa/providers/Microsoft.Compute/virtualMachineScaleSets/capz-e2e-9m4soa-mp-0?api-version=2019-07-01' failed with status 'GatewayTimeout'."}}
E0114 20:30:47.555600       1 controller.go:275] error processing service default/web-ilb (will retry): failed to ensure load balancer: Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: {"error":{"code":"ResourceReadFailed","target":"capz-e2e-9m4soa-mp-0","message":"Policy required full resource content to evaluate the request. The request to GET resource 'https://management.azure.com/subscriptions/===REDACTED===/resourceGroups/capz-e2e-9m4soa/providers/Microsoft.Compute/virtualMachineScaleSets/capz-e2e-9m4soa-mp-0?api-version=2019-07-01' failed with status 'GatewayTimeout'."}}
I0114 20:30:47.556263       1 event.go:291] "Event occurred" object="default/web-ilb" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: {\"error\":{\"code\":\"ResourceReadFailed\",\"target\":\"capz-e2e-9m4soa-mp-0\",\"message\":\"Policy required full resource content to evaluate the request. The request to GET resource 'https://management.azure.com/subscriptions/===REDACTED===/resourceGroups/capz-e2e-9m4soa/providers/Microsoft.Compute/virtualMachineScaleSets/capz-e2e-9m4soa-mp-0?api-version=2019-07-01' failed with status 'GatewayTimeout'.\"}}"
I0114 20:30:52.556306       1 event.go:291] "Event occurred" object="default/web-ilb" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"

CecileRobertMichon · 2021-01-14T21:59:35Z

/retest

last failure was unrelated to LB flake

devigned · 2021-01-14T22:28:35Z

/retest

devigned · 2021-01-15T20:33:12Z

/retest

CecileRobertMichon · 2021-01-15T21:58:17Z

updated k8s to 1.19.7 in #1126

devigned · 2021-01-15T22:21:03Z

I'm going to trim out all of the testing related stuff that was added to this PR.

/hold

devigned · 2021-01-19T20:12:08Z

/hold cancel

Trimmed down the PR to output verbose cloud provider logs and gathers all of kube-system for the workload clusters.

/assign @CecileRobertMichon @nader-ziada

nader-ziada · 2021-01-19T21:21:51Z

/lgtm

devigned · 2021-01-19T22:20:20Z

/retest

CecileRobertMichon · 2021-01-20T00:12:33Z

/lgtm
/approve

k8s-ci-robot · 2021-01-20T00:12:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [CecileRobertMichon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

devigned · 2021-01-20T01:03:49Z

/retest

devigned · 2021-01-20T01:46:39Z

/retest

devigned · 2021-01-20T12:37:31Z

/retest

k8s-ci-robot requested review from awesomenix and justaugustus January 13, 2021 23:28

k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 13, 2021

devigned force-pushed the cp-logs branch from 54347fd to 495143b Compare January 13, 2021 23:29

CecileRobertMichon reviewed Jan 13, 2021

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 14, 2021

devigned force-pushed the cp-logs branch from 66ae908 to e10f692 Compare January 14, 2021 20:54

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 14, 2021

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 15, 2021

devigned force-pushed the cp-logs branch 3 times, most recently from bdd0685 to 2cd905e Compare January 15, 2021 01:12

devigned mentioned this pull request Jan 15, 2021

disable external load balancer test #1123

Merged

3 tasks

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 15, 2021

devigned force-pushed the cp-logs branch from d0b1f5a to ca875db Compare January 19, 2021 20:10

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 19, 2021

k8s-ci-robot assigned CecileRobertMichon and nader-ziada Jan 19, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 19, 2021

output kube-system logs from workload clusters

ae02111

devigned force-pushed the cp-logs branch from ca875db to ae02111 Compare January 19, 2021 20:47

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 19, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 20, 2021

k8s-ci-robot merged commit a90bcc0 into kubernetes-sigs:master Jan 20, 2021

k8s-ci-robot added this to the v0.4.11 milestone Jan 20, 2021

devigned deleted the cp-logs branch January 20, 2021 13:14

devigned mentioned this pull request Jan 25, 2021

Add Azure activity logs to e2e artifacts #1136

Merged

3 tasks

output kube-system logs from workload clusters #1121

output kube-system logs from workload clusters #1121

Uh oh!

Conversation

devigned commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CecileRobertMichon commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CecileRobertMichon Jan 13, 2021

Choose a reason for hiding this comment

Uh oh!

devigned Jan 14, 2021

Choose a reason for hiding this comment

Uh oh!

devigned commented Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CecileRobertMichon commented Jan 14, 2021

Uh oh!

devigned commented Jan 14, 2021

Uh oh!

devigned commented Jan 14, 2021

Uh oh!

devigned commented Jan 14, 2021

Uh oh!

CecileRobertMichon commented Jan 14, 2021

Uh oh!

devigned commented Jan 14, 2021

Uh oh!

devigned commented Jan 15, 2021

Uh oh!

CecileRobertMichon commented Jan 15, 2021

Uh oh!

devigned commented Jan 15, 2021

Uh oh!

devigned commented Jan 19, 2021

Uh oh!

nader-ziada commented Jan 19, 2021

Uh oh!

devigned commented Jan 19, 2021

Uh oh!

CecileRobertMichon commented Jan 20, 2021

Uh oh!

k8s-ci-robot commented Jan 20, 2021

Uh oh!

devigned commented Jan 20, 2021

Uh oh!

devigned commented Jan 20, 2021

Uh oh!

devigned commented Jan 20, 2021

Uh oh!

Uh oh!

devigned commented Jan 13, 2021 •

edited

Loading

CecileRobertMichon commented Jan 13, 2021 •

edited

Loading

devigned commented Jan 14, 2021 •

edited

Loading