Skip to content

output kube-system logs from workload clusters #1121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 20, 2021

Conversation

devigned
Copy link
Contributor

@devigned devigned commented Jan 13, 2021

What type of PR is this?
/kind feature

What this PR does / why we need it:
It's tough to diagnose what's happening in some flaky e2e tests. By collecting kube-system logs we can get a better idea of why tests are failing.

This PR intercepts the call to CollectWorkloadClusterLogs so that it can inject code to pull all of the logs for the kube-system pods.

Also, added a slightly higher verbosity to controller manager logging (v=4).

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Collect kube-system logs for workload clusters in e2e tests

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/provider/azure Issues or PRs related to azure provider labels Jan 13, 2021
@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 13, 2021
@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented Jan 13, 2021

why not use the AzureLogCollector for this instead of replacing the CollectWorkloadClusterLogs func?

edit: this might not be possible, just wondering if it is...

aboveMachinesPath := strings.Replace(outputPath, "/machines", "", 1)
workload := acp.GetWorkloadCluster(ctx, namespace, name)
pods := &corev1.PodList{}
Expect(workload.GetClient().List(ctx, pods, client.InNamespace(kubesystem))).To(Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of this is CAPZ specific and it would useful to have kube-system pod logs for all CAPI clusters, would it make sense to make this change directly in the framework's log collector instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would, but I want it today, not in the next CAPI release :).

I will gladly open a similar PR in CAPI.

@devigned
Copy link
Contributor Author

devigned commented Jan 14, 2021

why not use the AzureLogCollector for this instead of replacing the CollectWorkloadClusterLogs func?

@CecileRobertMichon, great idea. That was the first thing I tried to do as it seemed like the right extension point for this functionality. Unfortuantely, the ClusterLogCollector interface doesn't provide a func that fits log collection at the cluster level. It is targeted to machine log collection as evidenced by its only func CollectMachineLog.

https://github.com/kubernetes-sigs/cluster-api/blob/daba8fea8d536b896c2ac28d97d97385cb3f3e71/test/framework/cluster_proxy.go#L80-L85

I'd like to add another hook to this collector for collecting cluster level logs. Possibly, also a hook for setting up namespace watches when the cluster first starts so that the collector could start listening to logs or any other logging stuff before waiting on the workload cluster to be built and all of the workloads to be in a ready state, as it is today with clusterctl.ApplyClusterTemplateAndWait.

wdyt?

@CecileRobertMichon
Copy link
Contributor

Sounds great. Let's do both, this PR can help us short term to debug the test failures, and we can add the hook to the CAPI framework in parallel.

Might also be cool to get/describe the pods in kube-system so we get a quick view to know if anything was in crashloop or error state.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 14, 2021
@devigned
Copy link
Contributor Author

/retest

@devigned
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 14, 2021
@devigned
Copy link
Contributor Author

pull-cluster-api-provider-azure-e2e failed due to GatewayTimeout and 504 errors. Seems really odd. Retrying.

I0114 20:24:21.774423       1 node_lifecycle_controller.go:1429] Initializing eviction metric for zone: eastus2:�:2
I0114 20:24:21.774484       1 node_lifecycle_controller.go:1429] Initializing eviction metric for zone: eastus2:�:3
I0114 20:24:21.774815       1 event.go:291] "Event occurred" object="capz-e2e-9m4soa-mp-0000003" kind="Node" apiVersion="v1" type="Normal" reason="RegisteredNode" message="Node capz-e2e-9m4soa-mp-0000003 event: Registered Node capz-e2e-9m4soa-mp-0000003 in Controller"
W0114 20:24:21.774956       1 node_lifecycle_controller.go:1044] Missing timestamp for Node capz-e2e-9m4soa-mp-0000002. Assuming now as a timestamp.
W0114 20:24:21.775031       1 node_lifecycle_controller.go:1044] Missing timestamp for Node capz-e2e-9m4soa-mp-0000003. Assuming now as a timestamp.
I0114 20:24:21.775098       1 node_lifecycle_controller.go:1245] Controller detected that zone eastus2:�:3 is now in state FullDisruption.
I0114 20:24:21.775114       1 node_lifecycle_controller.go:1245] Controller detected that zone eastus2:�:2 is now in state FullDisruption.
I0114 20:24:41.801199       1 node_lifecycle_controller.go:1245] Controller detected that zone eastus2:�:3 is now in state Normal.
I0114 20:24:41.801309       1 node_lifecycle_controller.go:1245] Controller detected that zone eastus2:�:2 is now in state Normal.
I0114 20:25:53.762842       1 event.go:291] "Event occurred" object="default/web" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled up replica set web-5d45b7f96d to 1"
I0114 20:25:53.785611       1 event.go:291] "Event occurred" object="default/web-5d45b7f96d" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: web-5d45b7f96d-9zlhf"
I0114 20:26:13.914021       1 event.go:291] "Event occurred" object="default/web-ilb" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0114 20:30:47.555451       1 azure_backoff.go:422] CreateOrUpdateVMSS: error CreateOrUpdate vmss(capz-e2e-9m4soa-mp-0): &{true 504 0001-01-01 00:00:00 +0000 UTC Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: {"error":{"code":"ResourceReadFailed","target":"capz-e2e-9m4soa-mp-0","message":"Policy required full resource content to evaluate the request. The request to GET resource 'https://management.azure.com/subscriptions/===REDACTED===/resourceGroups/capz-e2e-9m4soa/providers/Microsoft.Compute/virtualMachineScaleSets/capz-e2e-9m4soa-mp-0?api-version=2019-07-01' failed with status 'GatewayTimeout'."}}}
E0114 20:30:47.555499       1 azure_vmss.go:1179] ensureVMSSInPool CreateOrUpdateVMSS(capz-e2e-9m4soa-mp-0) with new backendPoolID /subscriptions/===REDACTED===/resourceGroups/capz-e2e-9m4soa/providers/Microsoft.Network/loadBalancers/capz-e2e-9m4soa-internal/backendAddressPools/capz-e2e-9m4soa, err: <nil>
E0114 20:30:47.555553       1 azure_loadbalancer.go:162] reconcileLoadBalancer(default/web-ilb) failed: Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: {"error":{"code":"ResourceReadFailed","target":"capz-e2e-9m4soa-mp-0","message":"Policy required full resource content to evaluate the request. The request to GET resource 'https://management.azure.com/subscriptions/===REDACTED===/resourceGroups/capz-e2e-9m4soa/providers/Microsoft.Compute/virtualMachineScaleSets/capz-e2e-9m4soa-mp-0?api-version=2019-07-01' failed with status 'GatewayTimeout'."}}
E0114 20:30:47.555600       1 controller.go:275] error processing service default/web-ilb (will retry): failed to ensure load balancer: Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: {"error":{"code":"ResourceReadFailed","target":"capz-e2e-9m4soa-mp-0","message":"Policy required full resource content to evaluate the request. The request to GET resource 'https://management.azure.com/subscriptions/===REDACTED===/resourceGroups/capz-e2e-9m4soa/providers/Microsoft.Compute/virtualMachineScaleSets/capz-e2e-9m4soa-mp-0?api-version=2019-07-01' failed with status 'GatewayTimeout'."}}
I0114 20:30:47.556263       1 event.go:291] "Event occurred" object="default/web-ilb" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: Retriable: true, RetryAfter: 0s, HTTPStatusCode: 504, RawError: {\"error\":{\"code\":\"ResourceReadFailed\",\"target\":\"capz-e2e-9m4soa-mp-0\",\"message\":\"Policy required full resource content to evaluate the request. The request to GET resource 'https://management.azure.com/subscriptions/===REDACTED===/resourceGroups/capz-e2e-9m4soa/providers/Microsoft.Compute/virtualMachineScaleSets/capz-e2e-9m4soa-mp-0?api-version=2019-07-01' failed with status 'GatewayTimeout'.\"}}"
I0114 20:30:52.556306       1 event.go:291] "Event occurred" object="default/web-ilb" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"

@CecileRobertMichon
Copy link
Contributor

/retest

last failure was unrelated to LB flake

@devigned
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 15, 2021
@devigned devigned force-pushed the cp-logs branch 3 times, most recently from bdd0685 to 2cd905e Compare January 15, 2021 01:12
@devigned
Copy link
Contributor Author

/retest

@CecileRobertMichon
Copy link
Contributor

updated k8s to 1.19.7 in #1126

@devigned
Copy link
Contributor Author

I'm going to trim out all of the testing related stuff that was added to this PR.

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 15, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 19, 2021
@devigned
Copy link
Contributor Author

/hold cancel

Trimmed down the PR to output verbose cloud provider logs and gathers all of kube-system for the workload clusters.

/assign @CecileRobertMichon @nader-ziada

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 19, 2021
@nader-ziada
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 19, 2021
@devigned
Copy link
Contributor Author

/retest

@CecileRobertMichon
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 20, 2021
@devigned
Copy link
Contributor Author

/retest

2 similar comments
@devigned
Copy link
Contributor Author

/retest

@devigned
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot merged commit a90bcc0 into kubernetes-sigs:master Jan 20, 2021
@k8s-ci-robot k8s-ci-robot added this to the v0.4.11 milestone Jan 20, 2021
@devigned devigned deleted the cp-logs branch January 20, 2021 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants