Skip to content

manifests/0000_90_kube-controller-manager-operator_05_alerts: Template console links in alert descriptions #837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ spec:
- alert: PodDisruptionBudgetAtLimit
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New thread for Cluster Bot testing. As of daae216, with a launch 4.19,openshift/cluster-kube-controller-manager-operator#837 aws cluster, make a PDB mad:

$ oc adm cordon -l node-role.kubernetes.io/worker=
$ oc -n openshift-monitoring delete pod prometheus-k8s-0
$ oc -n openshift-monitoring get poddisruptionbudget prometheus-k8s
NAME             MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
prometheus-k8s   1               N/A               0                     42m

I didn't wait for the alert to kick over into firing, but checking on pending, this looks... almost good to me:

image

the issue is the <span class="co-resource-item monitoring__resource-item--monitoring-alert co-resource-item--inline"> bit for the NS injected into my attempt at constructing a console link.

To trip PodDisruptionBudgetLimit I'll look to a different workload, since I don't want to completely break Prometheus (it would make it hard to test alert behavior):

$ oc adm cordon -l node-role.kubernetes.io/master=
$ oc -n openshift-console delete -l component=downloads pods
$ oc -n openshift-console get poddisruptionbudget downloads
NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
downloads   N/A             1                 0                     53m

In that case, the rendering looks great, although I'm not clear on why it's not seeing the NS rendering issue:

image

I'm also not clear on how to trigger GarbageCollectorSyncFailed to test its rendering.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

daae216 -> 9331433 added some whitespace before a }} to try to get closer to what the working PodDisruptionBudgetLimit description is doing:

$ git diff --word-diff daae2166273f0a..9331433c28f8eb3 -U0
diff --git a/manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml b/manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml
index 8135439..5f299b9 100644
--- a/manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml
+++ b/manifests/0000_90_kube-controller-manager-operator_05_alerts.yaml
@@ -29 +29 @@ spec:
              The {{ $labels.poddisruptionbudget }} pod disruption budget in the {{ [-$labels.namespace}}-]{+$labels.namespace }}+} namespace is at the maximum allowed disruption. The number of current healthy pods is equal to the desired healthy pods.{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/poddisruptionbudgets/{{ $labels.poddisruptionbudget }}{{ end }}{{ end }}

But sadly the NS markup injected into the middle of the console PDB link is still there:

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is not really optimal. As far as I can see, it should be pretty simple by adding poddisruptionbudget resource to here: https://github.com/openshift/monitoring-plugin/blob/6f948e4323bdf7c68e6b625ce3020116b5b4571a/web/src/components/alerting/AlertsDetailPage.tsx#L450

annotations:
summary: The pod disruption budget is preventing further disruption to pods.
description: The pod disruption budget is at the minimum disruptions allowed level. The number of current healthy pods is equal to the desired healthy pods.
description: |-
The {{ $labels.poddisruptionbudget }} pod disruption budget in the {{ $labels.namespace }} namespace is at the maximum allowed disruption. The number of current healthy pods is equal to the desired healthy pods.{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/poddisruptionbudgets/{{ $labels.poddisruptionbudget }}{{ end }}{{ end }}
Copy link
Member

@atiratree atiratree Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we would not need the console url and let the console handle all the link rendering for the poddisruptionbudget. It would also make the description nicer when looking at the alert detail in /monitoring/alertrules/1234.

runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-controller-manager-operator/PodDisruptionBudgetAtLimit.md
expr: |
max by(namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_current_healthy == kube_poddisruptionbudget_status_desired_healthy and on (namespace, poddisruptionbudget) kube_poddisruptionbudget_status_expected_pods > 0)
Expand All @@ -35,17 +36,19 @@ spec:
- alert: PodDisruptionBudgetLimit
annotations:
summary: The pod disruption budget registers insufficient amount of pods.
description: The pod disruption budget is below the minimum disruptions allowed level and is not satisfied. The number of current healthy pods is less than the desired healthy pods.
description: |-
The {{ $labels.poddisruptionbudget }} pod disruption budget in the {{ $labels.namespace }} namespace exceeds the maximum allowed disruption and is not satisfied. The number of current healthy pods is {{ $value }} less than the desired healthy pods.{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/poddisruptionbudgets/{{ $labels.poddisruptionbudget }}{{ end }}{{ end }}
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-controller-manager-operator/PodDisruptionBudgetLimit.md
expr: |
max by (namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_current_healthy < kube_poddisruptionbudget_status_desired_healthy)
max by (namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_desired_healthy - kube_poddisruptionbudget_status_current_healthy) > 0
for: 15m
labels:
severity: critical
- alert: GarbageCollectorSyncFailed
annotations:
summary: There was a problem with syncing the resources for garbage collection.
description: Garbage Collector had a problem with syncing and monitoring the available resources. Please see KubeControllerManager logs for more details.
description: |-
Garbage Collector had a problem with syncing and monitoring the available resources. Please see KubeControllerManager logs for more details: 'oc -n {{ $labels.namespace }} logs -c {{ $labels.container }} {{ $labels.pod }}'{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url))) 0}} For more information refer to {{ label "url" (first $console_url) }}/k8s/ns/{{ $labels.namespace }}/pods/{{ $labels.pod }}/logs?container={{ $labels.container }} {{ end }}{{ end }}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too verbose to me. How to invoke the logs should be a responsibility of the runbook IMO.

But similar to the PDB case, we could link directly to the pod in the console: https://github.com/openshift/monitoring-plugin/blob/6f948e4323bdf7c68e6b625ce3020116b5b4571a/web/src/components/alerting/AlertsDetailPage.tsx#L450 without too much extra markup.

runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-controller-manager-operator/GarbageCollectorSyncFailed.md
expr: |
rate(garbagecollector_controller_resources_sync_error_total{}[5m]) > 0
Expand Down