Skip to content

[ops] WebApp: More alerts to make rollouts safer #12514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 31, 2022
Merged

Conversation

geropl
Copy link
Member

@geropl geropl commented Aug 30, 2022

Description

This PR adds a list of alerts we'd so far do manually on deployments. All of these go to our team internal Slack channel for now, so we can fine-tune them before we promote them to on-call L1 (if at all).

  • API error rate
  • websocket connection rate
  • critical services running:
    • messagebus
    • [ ] db-sync I don't know how to make it not alert in regions where this pod is not running 🤷
    • flag crashloopbackoff
  • services RAM/CPU usage trends
  • [ ] DB CPU usage stays within limits done in GCloud
  • [ ] log error rate (replaces "GCloud Error Reporting") done in GCloud

Context:

Related Issue(s)

Context: #10722

How to test

Release Notes

NONE

Documentation

Werft options:

  • /werft with-preview

@geropl geropl force-pushed the gpl/deployment-alerts branch from 57f3ebf to eab6f6c Compare August 31, 2022 10:03
@geropl geropl marked this pull request as ready for review August 31, 2022 11:09
@geropl geropl requested a review from a team August 31, 2022 11:09
@github-actions github-actions bot added the team: webapp Issue belongs to the WebApp team label Aug 31, 2022
@geropl geropl force-pushed the gpl/deployment-alerts branch from eab6f6c to 8f29e34 Compare August 31, 2022 11:27
@geropl geropl mentioned this pull request Aug 31, 2022
10 tasks
{
alert: 'WebsocketConnectionRateHigh',
// Reasoning: the values are taken from past data
expr: 'sum(rate(server_websocket_connection_count[2m])) > 30',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2 minutes? That's somewhat non-standard as normally metrics tend to use 1, 3, 5 or 10 minutes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking out of curiosity: What's the reason behind choosing 1, 3, 5, 10? 🤔

fwiw I took 2 after playing around with the numbers and choose the one where the graph looked "nice" (e.g., not too spiky, but responsive enough).

* TODO(gpl) This will be true for US all the time. Can we exclude that cluster somehow?
* {
* alert: 'db-sync not running',
* expr: 'sum (kube_pod_status_phase{pod=~"db-sync.*"}) by (pod) < 1',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a concrete query that solves the problem? Might just be lagging prometheus-foo, here. 🙃

Copy link
Contributor

@ArthurSens ArthurSens Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, they don't have the cluster label while evaluating alerts. Alerts are evaluated by Prometheus in the 'leaf' clusters, so they're not aware of their location :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, to detect if something is running, you'd use the up metric. That basically works on the basis that it can be scraped. For some reason, that's not available for db-sync. Will dig into ti more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That basically works on the basis that it can be scraped. For some reason

💡 Ah, yes! db-sync has no metrics endpoint, yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW here is a good thread about this question.

team: 'webapp'
},
annotations: {
runbook_url: 'https://github.com/gitpod-io/runbooks/blob/main/runbooks/WebAppServicesHighCPUUsage.md',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the runbook contain? In a way, services would ideally run at 100% utilization all the time, that way there's nothing wasted.

I'm asking because this tends to be notoriously difficult to make actionable. One could argue that we should instead set CPU/Memory limits on pods to force them to crash and restart (and alert on the restarts) rather than alerting for high utilization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that there are better ways to setup alerts if you go by the book. But looking at past incidents I deem this a pretty good indicator that somthing is fishy. Especially after a recent deployment.

The runbook will never be substantial. This can serve as a first step, and if we find better measures, replace this error with something that make more sense. But for now we have something, at all. 🙃

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good as the first step!

Copy link
Member

@easyCZ easyCZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this as a first pass. We can tweak if it's too noisy.

/hold in case you want to make any other changes

@geropl
Copy link
Member Author

geropl commented Aug 31, 2022

/unhold

Let's move forward with this, and fine-tune as we go 👣

@roboquat roboquat merged commit 73cbd09 into main Aug 31, 2022
@roboquat roboquat deleted the gpl/deployment-alerts branch August 31, 2022 14:07
@roboquat roboquat added the deployed: webapp Meta team change is running in production label Aug 31, 2022
@roboquat roboquat added the deployed Change is completely running in production label Aug 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: webapp Meta team change is running in production deployed Change is completely running in production release-note-none size/XXL team: webapp Issue belongs to the WebApp team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants