-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[usage] Alert on Usage and Invoice Reconciliations #12919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/GitpodUsageScheduledReconciliationFailures.md | ||
summary: There are failed scheduled reconciliations in the usage component. | ||
description: We have accumulated {{ printf "%.2f" $value }} failures. This affects how stale usage data is and/or updating invoices in Stripe. | ||
runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/GitpodUsageReconcileUsageFailures.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the new name for this runbook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll need to update the runbooks correspondingly. For now these would only be Slack warnings in webapp so we've some time to polish the runbooks before we get it to actually page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expr: sum(increase(grpc_server_handled_total{grpc_service="usage.v1.BillingService", grpc_method="ReconcileInvoices", grpc_code!="OK"})) > 1 | ||
for: 30m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm relatively new to PromQL, but what does the for: 30m
do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says that the expr
above needs to "have values" - be firing for 30m to fire an alert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does that mean we alert when reconciliation fails for 30 minutes continuously, ie just one failure won't alert? (not sure how often we run reconciliation in prod currently.)
expr: sum(increase(grpc_server_handled_total{grpc_service="usage.v1.BillingService", grpc_method="ReconcileInvoices", grpc_code!="OK"})) > 1 | ||
for: 30m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does that mean we alert when reconciliation fails for 30 minutes continuously, ie just one failure won't alert? (not sure how often we run reconciliation in prod currently.)
Description
Alert on the two core RPCs used in usage reconciliation.
Related Issue(s)
How to test
Release Notes
Documentation
Werft options: