Skip to content

Adjust ownership of alerts related to the observability-stack and add alert for Prometheus restarting #14195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ spec:
for: 10m
labels:
severity: critical
team: platform
team: delivery-operations-experience
- alert: AlertmanagerFailedToSendAlerts
annotations:
description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} failed to send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration }}.
Expand All @@ -42,4 +42,4 @@ spec:
for: 5m
labels:
severity: warning
team: platform
team: delivery-operations-experience
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,4 @@ spec:
for: 15m
labels:
severity: critical
team: platform
team: delivery-operations-experience
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ spec:
for: 15m
labels:
severity: warning
team: platform
team: delivery-operations-experience
- alert: PrometheusOperatorWatchErrors
annotations:
description: Errors while performing watch operations in controller {{$labels.controller}} in {{$labels.namespace}} namespace.
Expand All @@ -35,7 +35,7 @@ spec:
for: 15m
labels:
severity: warning
team: platform
team: delivery-operations-experience
- alert: PrometheusOperatorReconcileErrors
annotations:
description: '{{ $value | humanizePercentage }} of reconciling operations failed for {{ $labels.controller }} controller in {{ $labels.namespace }} namespace.'
Expand All @@ -45,7 +45,7 @@ spec:
for: 10m
labels:
severity: warning
team: platform
team: delivery-operations-experience
- alert: ConfigReloaderSidecarErrors
annotations:
description: |-
Expand All @@ -57,4 +57,4 @@ spec:
for: 10m
labels:
severity: warning
team: platform
team: delivery-operations-experience
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ spec:
for: 10m
labels:
severity: critical
team: platform
team: delivery-operations-experience
- alert: PrometheusRemoteStorageFailures
annotations:
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} failed to send {{ printf "%.1f" $value }}% of the samples to {{ $labels.remote_name}}:{{ $labels.url }}
Expand All @@ -47,7 +47,7 @@ spec:
for: 15m
labels:
severity: critical
team: platform
team: delivery-operations-experience
- alert: PrometheusRuleFailures
annotations:
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to evaluate {{ printf "%.0f" $value }} rules in the last 5m.
Expand All @@ -57,4 +57,14 @@ spec:
for: 15m
labels:
severity: warning
team: platform
team: delivery-operations-experience
- alert: PrometheusCrashlooped
annotations:
description: Prometheus' container restarted in the last 5m. While this alert will resolve itself if prometheus stopped crashing, it is important to understand why it crashed in the first place.
summary: Prometheus has just crashlooped.
expr: |
increase(kube_pod_container_status_restarts_total{cluster=~"$cluster", pod="prometheus-k8s-0", container="prometheus"}[5m]) > 0
for: 15m
labels:
severity: info
team: delivery-operations-experience