-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[ops] WebApp: More alerts to make rollouts safer #12514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
operations/observability/mixins/meta/rules/components/server/alerts.libsonnet
Outdated
Show resolved
Hide resolved
57f3ebf
to
eab6f6c
Compare
eab6f6c
to
8f29e34
Compare
{ | ||
alert: 'WebsocketConnectionRateHigh', | ||
// Reasoning: the values are taken from past data | ||
expr: 'sum(rate(server_websocket_connection_count[2m])) > 30', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2 minutes? That's somewhat non-standard as normally metrics tend to use 1, 3, 5 or 10 minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking out of curiosity: What's the reason behind choosing 1, 3, 5, 10? 🤔
fwiw I took 2 after playing around with the numbers and choose the one where the graph looked "nice" (e.g., not too spiky, but responsive enough).
* TODO(gpl) This will be true for US all the time. Can we exclude that cluster somehow? | ||
* { | ||
* alert: 'db-sync not running', | ||
* expr: 'sum (kube_pod_status_phase{pod=~"db-sync.*"}) by (pod) < 1', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics have the cluster
label on them. See possible values here https://grafana.gitpod.io/explore?orgId=1&left=%7B%22datasource%22:%22P4169E866C3094E38%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22editorMode%22:%22code%22,%22expr%22:%22sum%28increase%28kube_pod_status_phase%5B1m%5D%29%29%20by%20%28cluster%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D
Contains prod-meta-us02
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a concrete query that solves the problem? Might just be lagging prometheus-foo, here. 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, they don't have the cluster
label while evaluating alerts. Alerts are evaluated by Prometheus in the 'leaf' clusters, so they're not aware of their location :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, to detect if something is running, you'd use the up
metric. That basically works on the basis that it can be scraped. For some reason, that's not available for db-sync
. Will dig into ti more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That basically works on the basis that it can be scraped. For some reason
💡 Ah, yes! db-sync
has no metrics endpoint, yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW here is a good thread about this question.
operations/observability/mixins/meta/rules/components/server/alerts.libsonnet
Outdated
Show resolved
Hide resolved
operations/observability/mixins/meta/rules/components/server/alerts.libsonnet
Outdated
Show resolved
Hide resolved
operations/observability/mixins/meta/rules/components/server/alerts.libsonnet
Outdated
Show resolved
Hide resolved
team: 'webapp' | ||
}, | ||
annotations: { | ||
runbook_url: 'https://github.com/gitpod-io/runbooks/blob/main/runbooks/WebAppServicesHighCPUUsage.md', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the runbook contain? In a way, services would ideally run at 100% utilization all the time, that way there's nothing wasted.
I'm asking because this tends to be notoriously difficult to make actionable. One could argue that we should instead set CPU/Memory limits on pods to force them to crash and restart (and alert on the restarts) rather than alerting for high utilization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that there are better ways to setup alerts if you go by the book. But looking at past incidents I deem this a pretty good indicator that somthing is fishy. Especially after a recent deployment.
The runbook will never be substantial. This can serve as a first step, and if we find better measures, replace this error with something that make more sense. But for now we have something, at all. 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good as the first step!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this as a first pass. We can tweak if it's too noisy.
/hold in case you want to make any other changes
/unhold Let's move forward with this, and fine-tune as we go 👣 |
Description
This PR adds a list of alerts we'd so far do manually on deployments. All of these go to our team internal Slack channel for now, so we can fine-tune them before we promote them to on-call L1 (if at all).
[ ] db-syncI don't know how to make it not alert in regions where this pod is not running 🤷[ ] DB CPU usage stays within limitsdone in GCloud[ ] log error rate (replaces "GCloud Error Reporting")done in GCloudContext:
Related Issue(s)
Context: #10722
How to test
Release Notes
Documentation
Werft options: