-
Notifications
You must be signed in to change notification settings - Fork 130
manifests/0000_90_kube-controller-manager-operator_05_alerts: Template console links in alert descriptions #837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
4a9d9e4
to
daae216
Compare
@@ -25,7 +25,8 @@ spec: | |||
- alert: PodDisruptionBudgetAtLimit | |||
annotations: | |||
summary: The pod disruption budget is preventing further disruption to pods. | |||
description: The pod disruption budget is at the minimum disruptions allowed level. The number of current healthy pods is equal to the desired healthy pods. | |||
description: |- | |||
The {{ $labels.poddisruptionbudget }} pod disruption budget in the {{ $labels.namespace}} namespace is at the maximum allowed disruption. The number of current healthy pods is equal to the desired healthy pods.{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/poddisruptionbudgets/{{ $labels.poddisruptionbudget }}{{ end }}{{ end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure what to do with the unit failure:
: TestYamlCorrectness expand_less 0s
{=== RUN TestYamlCorrectness
assets_test.go:2 ... === RUN TestYamlCorrectness
assets_test.go:2 ...}
the test-case's stdout includes:
=== RUN TestYamlCorrectness
assets_test.go:20: Unexpected error reading manifests from ../../manifests/: failed to render "0000_90_kube-controller-manager-operator_05_alerts.yaml": template: 0000_90_kube-controller-manager-operator_05_alerts.yaml:29: undefined variable "$labels"
I guess the that's this assets.New
call through assetFromTemplate
through renderFile
to this template.New
. I'm not clear on why this operator feels like these manifests should be Go templates. Maybe we can pivot to using ManifestsFromFiles
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is one option.
But we could also try to define the variables and actually try to render it in a test, similar to what we do with other templates
type TemplateData struct { |
It might be useful to have a test specifically for rendering the alerts to see that it resolves correctly.
@@ -25,7 +25,8 @@ spec: | |||
- alert: PodDisruptionBudgetAtLimit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New thread for Cluster Bot testing. As of daae216, with a launch 4.19,openshift/cluster-kube-controller-manager-operator#837 aws
cluster, make a PDB mad:
$ oc adm cordon -l node-role.kubernetes.io/worker=
$ oc -n openshift-monitoring delete pod prometheus-k8s-0
$ oc -n openshift-monitoring get poddisruptionbudget prometheus-k8s
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
prometheus-k8s 1 N/A 0 42m
I didn't wait for the alert to kick over into firing
, but checking on pending
, this looks... almost good to me:
the issue is the <span class="co-resource-item monitoring__resource-item--monitoring-alert co-resource-item--inline">
bit for the NS
injected into my attempt at constructing a console link.
To trip PodDisruptionBudgetLimit
I'll look to a different workload, since I don't want to completely break Prometheus (it would make it hard to test alert behavior):
$ oc adm cordon -l node-role.kubernetes.io/master=
$ oc -n openshift-console delete -l component=downloads pods
$ oc -n openshift-console get poddisruptionbudget downloads
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
downloads N/A 1 0 53m
In that case, the rendering looks great, although I'm not clear on why it's not seeing the NS
rendering issue:
I'm also not clear on how to trigger GarbageCollectorSyncFailed
to test its rendering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
daae216 -> 9331433 added some whitespace before a }}
to try to get closer to what the working PodDisruptionBudgetLimit
description
is doing:
$ git diff --word-diff daae2166273f0a..9331433c28f8eb3 -U0
diff --git a/manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml b/manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml
index 8135439..5f299b9 100644
--- a/manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml
+++ b/manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml
@@ -29 +29 @@ spec:
The {{ $labels.poddisruptionbudget }} pod disruption budget in the {{ [-$labels.namespace}}-]{+$labels.namespace }}+} namespace is at the maximum allowed disruption. The number of current healthy pods is equal to the desired healthy pods.{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/poddisruptionbudgets/{{ $labels.poddisruptionbudget }}{{ end }}{{ end }}
But sadly the NS
markup injected into the middle of the console PDB link is still there:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is not really optimal. As far as I can see, it should be pretty simple by adding poddisruptionbudget resource to here: https://github.com/openshift/monitoring-plugin/blob/6f948e4323bdf7c68e6b625ce3020116b5b4571a/web/src/components/alerting/AlertsDetailPage.tsx#L450
…e console links in alert descriptions Prometheus alerts support Go templating [1], and this commit uses that to provide more context like "which namespace?", "which PodDisruptionBudget?", "where can I find that PDB in the in-cluster web console?", and "what 'oc' command would I run to see garbage-collection sync logs?". This should make understanding the context of the alert more straightforward, with the responder having to dip into labels and guess. Using |- for trimmed, block style strings avoids YAML parsers choking on the "for more details: ..." colon with "mapping values are not allowed in this context" and similar. [1]: https://prometheus.io/docs/prometheus/latest/configuration/template_reference/
daae216
to
9331433
Compare
@wking: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/cc |
LGTM from a consumer perspective 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few rough edges, but the general idea is very nice!
@@ -25,7 +25,8 @@ spec: | |||
- alert: PodDisruptionBudgetAtLimit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is not really optimal. As far as I can see, it should be pretty simple by adding poddisruptionbudget resource to here: https://github.com/openshift/monitoring-plugin/blob/6f948e4323bdf7c68e6b625ce3020116b5b4571a/web/src/components/alerting/AlertsDetailPage.tsx#L450
@@ -25,7 +25,8 @@ spec: | |||
- alert: PodDisruptionBudgetAtLimit | |||
annotations: | |||
summary: The pod disruption budget is preventing further disruption to pods. | |||
description: The pod disruption budget is at the minimum disruptions allowed level. The number of current healthy pods is equal to the desired healthy pods. | |||
description: |- | |||
The {{ $labels.poddisruptionbudget }} pod disruption budget in the {{ $labels.namespace }} namespace is at the maximum allowed disruption. The number of current healthy pods is equal to the desired healthy pods.{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/poddisruptionbudgets/{{ $labels.poddisruptionbudget }}{{ end }}{{ end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we would not need the console url and let the console handle all the link rendering for the poddisruptionbudget
. It would also make the description nicer when looking at the alert detail in /monitoring/alertrules/1234
.
for: 15m | ||
labels: | ||
severity: critical | ||
- alert: GarbageCollectorSyncFailed | ||
annotations: | ||
summary: There was a problem with syncing the resources for garbage collection. | ||
description: Garbage Collector had a problem with syncing and monitoring the available resources. Please see KubeControllerManager logs for more details. | ||
description: |- | ||
Garbage Collector had a problem with syncing and monitoring the available resources. Please see KubeControllerManager logs for more details: 'oc -n {{ $labels.namespace }} logs -c {{ $labels.container }} {{ $labels.pod }}'{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/pods/{{ $labels.pod }}/logs?container={{ $labels.container }} {{ end }}{{ end }}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems too verbose to me. How to invoke the logs should be a responsibility of the runbook IMO.
But similar to the PDB case, we could link directly to the pod in the console: https://github.com/openshift/monitoring-plugin/blob/6f948e4323bdf7c68e6b625ce3020116b5b4571a/web/src/components/alerting/AlertsDetailPage.tsx#L450 without too much extra markup.
@@ -25,7 +25,8 @@ spec: | |||
- alert: PodDisruptionBudgetAtLimit | |||
annotations: | |||
summary: The pod disruption budget is preventing further disruption to pods. | |||
description: The pod disruption budget is at the minimum disruptions allowed level. The number of current healthy pods is equal to the desired healthy pods. | |||
description: |- | |||
The {{ $labels.poddisruptionbudget }} pod disruption budget in the {{ $labels.namespace}} namespace is at the maximum allowed disruption. The number of current healthy pods is equal to the desired healthy pods.{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/poddisruptionbudgets/{{ $labels.poddisruptionbudget }}{{ end }}{{ end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is one option.
But we could also try to define the variables and actually try to render it in a test, similar to what we do with other templates
type TemplateData struct { |
It might be useful to have a test specifically for rendering the alerts to see that it resolves correctly.
Prometheus alerts support Go templating, and this pull uses that to provide more context like "which namespace?", "which PodDisruptionBudget?", "where can I find that PDB in the in-cluster web console?", and "what
oc
command would I run to see garbage-collection sync logs?". This should make understanding the context of the alert more straightforward, with the responder having to dip into labels and guess.