Extend cron job to handle VM-based preview envs #9691
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR extends our platform-delete-preview-environments-cron job to also handle VM-based preview environments.
It introduces a lightweight modelling of our preview environments as
type PreviewEnvironment = CoreDevPreviewEnvironment | HarvesterPreviewEnvironment
and moves the related code to eachclass
. This should simplify removingCoreDevPreviewEnvironment
in the future.This only checks for staleness of Harvester-based preview environments based on the git branch and not the DB activity in the preview environment for now.
Deleting preview environments seems to sometimes fail. I changed to job so that it will fail the slice for deleting a specific preview environments, but not the entire job. That means we're more likely to get through the job. Any deletion that fails might work on the 2nd or 3rd try (in my experience). In the future we can try to improve the reliability of preview environment deletion, but for now I think it's worth merging the PR as is - I created a quick Honeycomb board to help us visualise failure rates of preview environment deletions here and I can set up a quick experimental trigger if the failure rate is above e.g. 15% so we can be proactive about this.
I also I added a bit of span attributes so we know the counts of preview environments in core-dev and harvester. Thought this would be a fun way to get historic counts and also set up a simple threshold-based trigger in Honeycomb (example query) so we can get a Slack message if we're running say 35 Harvester preview environments☺️
Lastly, I ended up making the job a bit less noisy slice-wise as it was becoming very hard to make sense of the Werft logs.
Sorry for a doing so much in one PR.
Related Issue(s)
Part of https://github.com/gitpod-io/ops/issues/1713
How to test
During development I ran it in DRY_RUN mode by modifying the variable and side-loading the file
werft run github -j .werft/platform-delete-preview-environments-cron.yaml -s .werft/platform-delete-preview-environments-cron.ts
Once I had verified the preview environments were indeed stale I ran it without DRY_RUN using
werft run github -j .werft/platform-delete-preview-environments-cron.yaml
.Release Notes
Documentation
N/A