[ops] WebApp: More alerts to make rollouts safer #12514

geropl · 2022-08-30T15:01:14Z

Description

This PR adds a list of alerts we'd so far do manually on deployments. All of these go to our team internal Slack channel for now, so we can fine-tune them before we promote them to on-call L1 (if at all).

API error rate
websocket connection rate
critical services running:
- messagebus
- ~~[ ] db-sync~~ I don't know how to make it not alert in regions where this pod is not running 🤷
- flag crashloopbackoff
services RAM/CPU usage trends
~~[ ] DB CPU usage stays within limits~~ done in GCloud
~~[ ] log error rate (replaces "GCloud Error Reporting")~~ done in GCloud

Context:

WebApp epic: Epic: Unattended Deployments #10722
work in progress: https://www.notion.so/Increase-WebApp-Deployment-Cadence-Monitoring-and-Alerting-ee3cabd091634ffaa37ff218da6057bd
Platform issue: https://github.com/gitpod-io/ops/issues/2873

Related Issue(s)

Context: #10722

How to test

Release Notes

NONE

Documentation

Werft options:

/werft with-preview

…e reqps

operations/observability/mixins/meta/rules/components/server/alerts.libsonnet

easyCZ · 2022-08-31T12:36:36Z

operations/observability/mixins/meta/rules/components/server/alerts.libsonnet

+          {
+            alert: 'WebsocketConnectionRateHigh',
+            // Reasoning: the values are taken from past data
+            expr: 'sum(rate(server_websocket_connection_count[2m])) > 30',


Why 2 minutes? That's somewhat non-standard as normally metrics tend to use 1, 3, 5 or 10 minutes

Asking out of curiosity: What's the reason behind choosing 1, 3, 5, 10? 🤔

fwiw I took 2 after playing around with the numbers and choose the one where the graph looked "nice" (e.g., not too spiky, but responsive enough).

easyCZ · 2022-08-31T12:38:50Z

operations/observability/mixins/meta/rules/components/server/alerts.libsonnet

+          * TODO(gpl) This will be true for US all the time. Can we exclude that cluster somehow?
+          * {
+          *   alert: 'db-sync not running',
+          *   expr: 'sum (kube_pod_status_phase{pod=~"db-sync.*"}) by (pod) < 1',


Metrics have the cluster label on them. See possible values here https://grafana.gitpod.io/explore?orgId=1&left=%7B%22datasource%22:%22P4169E866C3094E38%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22editorMode%22:%22code%22,%22expr%22:%22sum%28increase%28kube_pod_status_phase%5B1m%5D%29%29%20by%20%28cluster%29%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

Contains prod-meta-us02

Do you have a concrete query that solves the problem? Might just be lagging prometheus-foo, here. 🙃

Unfortunately, they don't have the cluster label while evaluating alerts. Alerts are evaluated by Prometheus in the 'leaf' clusters, so they're not aware of their location :/

Normally, to detect if something is running, you'd use the up metric. That basically works on the basis that it can be scraped. For some reason, that's not available for db-sync. Will dig into ti more.

That basically works on the basis that it can be scraped. For some reason

💡 Ah, yes! db-sync has no metrics endpoint, yet.

BTW here is a good thread about this question.

operations/observability/mixins/meta/rules/components/server/alerts.libsonnet

easyCZ · 2022-08-31T12:43:20Z

operations/observability/mixins/meta/rules/components/server/alerts.libsonnet

+              team: 'webapp'
+            },
+            annotations: {
+              runbook_url: 'https://github.com/gitpod-io/runbooks/blob/main/runbooks/WebAppServicesHighCPUUsage.md',


What would the runbook contain? In a way, services would ideally run at 100% utilization all the time, that way there's nothing wasted.

I'm asking because this tends to be notoriously difficult to make actionable. One could argue that we should instead set CPU/Memory limits on pods to force them to crash and restart (and alert on the restarts) rather than alerting for high utilization.

I understand that there are better ways to setup alerts if you go by the book. But looking at past incidents I deem this a pretty good indicator that somthing is fishy. Especially after a recent deployment.

The runbook will never be substantial. This can serve as a first step, and if we find better measures, replace this error with something that make more sense. But for now we have something, at all. 🙃

Sounds good as the first step!

easyCZ

I'm happy with this as a first pass. We can tweak if it's too noisy.

/hold in case you want to make any other changes

geropl · 2022-08-31T13:57:10Z

/unhold

Let's move forward with this, and fine-tune as we go 👣

[ops] Meta Overview/server: Fix unit of "API Request Error rate" to b…

83d29d4

…e reqps

roboquat added do-not-merge/work-in-progress release-note-none size/XXL labels Aug 30, 2022

ArthurSens reviewed Aug 30, 2022

View reviewed changes

operations/observability/mixins/meta/rules/components/server/alerts.libsonnet Outdated Show resolved Hide resolved

geropl added 4 commits August 31, 2022 10:03

[ops] WebApp: Internal alert on JSON RPC error rates

edfc822

[ops] WebApp: high websocket connection rate

e01c30b

[ops] WebApp: alert on messagebus not running

a94af2e

[ops] WebApp: alert if db-sync is not running

7f4684f

geropl force-pushed the gpl/deployment-alerts branch from 57f3ebf to eab6f6c Compare August 31, 2022 10:03

geropl marked this pull request as ready for review August 31, 2022 11:09

geropl requested a review from a team August 31, 2022 11:09

roboquat removed the do-not-merge/work-in-progress label Aug 31, 2022

github-actions bot added the team: webapp Issue belongs to the WebApp team label Aug 31, 2022

geropl added 2 commits August 31, 2022 11:25

[ops] WebApp: Alerts on exessive RAM and CPU usage

e899ee9

[ops] WebApp: Alert on services crashlooping

8f29e34

geropl force-pushed the gpl/deployment-alerts branch from eab6f6c to 8f29e34 Compare August 31, 2022 11:27

geropl mentioned this pull request Aug 31, 2022

Epic: Unattended Deployments #10722

Closed

10 tasks

easyCZ reviewed Aug 31, 2022

View reviewed changes

operations/observability/mixins/meta/rules/components/server/alerts.libsonnet Outdated Show resolved Hide resolved

easyCZ reviewed Aug 31, 2022

View reviewed changes

[ops] WebApp: review comments

bdf673b

easyCZ approved these changes Aug 31, 2022

View reviewed changes

roboquat added the do-not-merge/hold label Aug 31, 2022

roboquat removed the do-not-merge/hold label Aug 31, 2022

roboquat merged commit 73cbd09 into main Aug 31, 2022

roboquat deleted the gpl/deployment-alerts branch August 31, 2022 14:07

roboquat added the deployed: webapp Meta team change is running in production label Aug 31, 2022

roboquat added the deployed Change is completely running in production label Aug 31, 2022

geropl mentioned this pull request Sep 1, 2022

[ops] WebApp: Fix team-specific alerts #12588

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ops] WebApp: More alerts to make rollouts safer #12514

[ops] WebApp: More alerts to make rollouts safer #12514

geropl commented Aug 30, 2022 •

edited

Loading

easyCZ Aug 31, 2022

geropl Aug 31, 2022

easyCZ Aug 31, 2022

geropl Aug 31, 2022

ArthurSens Aug 31, 2022 •

edited

Loading

easyCZ Aug 31, 2022

geropl Aug 31, 2022

geropl Aug 31, 2022

easyCZ Aug 31, 2022

geropl Aug 31, 2022

easyCZ Aug 31, 2022

easyCZ left a comment

geropl commented Aug 31, 2022

[ops] WebApp: More alerts to make rollouts safer #12514

[ops] WebApp: More alerts to make rollouts safer #12514

Conversation

geropl commented Aug 30, 2022 • edited Loading

Description

Related Issue(s)

How to test

Release Notes

Documentation

Werft options:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurSens Aug 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

easyCZ left a comment

Choose a reason for hiding this comment

geropl commented Aug 31, 2022

geropl commented Aug 30, 2022 •

edited

Loading

ArthurSens Aug 31, 2022 •

edited

Loading