Skip to content

[usage] Add periodic job to detect invalid workspace instances in usage #12930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 14, 2022

Conversation

easyCZ
Copy link
Member

@easyCZ easyCZ commented Sep 13, 2022

Description

This PR does the following:

  1. Adds a scheduled job (every 15m) which detects if there are any instances in the usage table which are for workspace instances which have stopped, but do not have a stoppingTime set. In this case, these instances would cause runaway credit usage for our customers.
  2. When such an instance is detected, a metric is updated to allow us to alert on this.

To avoid code-duplication, the existing Ledger Job is refactored to introduce the following:

  1. Wrapper which can compose a Job to enable concurrent execution prevention
  2. Tests are updated to reflect this change

Related Issue(s)

Fixes #

How to test

Unit tests

Release Notes

NONE

Documentation

Werft options:

  • /werft with-preview

@easyCZ easyCZ requested a review from a team September 13, 2022 20:37
@github-actions github-actions bot added the team: webapp Issue belongs to the WebApp team label Sep 13, 2022
@easyCZ easyCZ force-pushed the mp/usage-detect-invalid-usage-instances branch from 66fb185 to 5f7583f Compare September 13, 2022 20:51
@svenefftinge
Copy link
Member

svenefftinge commented Sep 14, 2022

Maintaining the consistency of workspace instance data is the responsibility of bridge. I really think we should fix any issues there and if needed add such metrics there as well.

@easyCZ
Copy link
Member Author

easyCZ commented Sep 14, 2022

While I agree, and the fix for this should go into bridge, the detection here joins against usage and gives us a signal in relation to the usage table. As such, I'd like this signal to be on usage as that's the closest domain for it.

To put it differently, I see this as defensive from the usage side.

Copy link
Member

@svenefftinge svenefftinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this in a call and decided to go with this for now and see in a couple of weeks if it provides value and otherwise remove it again.

@roboquat roboquat merged commit 0d757c4 into main Sep 14, 2022
@roboquat roboquat deleted the mp/usage-detect-invalid-usage-instances branch September 14, 2022 08:08
@roboquat roboquat added deployed: webapp Meta team change is running in production deployed Change is completely running in production labels Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: webapp Meta team change is running in production deployed Change is completely running in production release-note-none size/L team: webapp Issue belongs to the WebApp team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants