previews: Several improvements to clean up logic #12784

ArthurSens · 2022-09-08T20:36:22Z

Description

Turns out this PR got a long bigger than initially expected. That is because I ended up making several changes in a single PR.

Introduce job that cleans-up individual previews.
Update preview clean-up cronjob to re-use the job for individial clean ups .
Update job that creates artificial preview to also delete the preview at the end.

Every change is done in separate commits, so please use them to make the review process easier 🙂

Related Issue(s)

Fixes #

How to test

You can test each change with different werft commands:

You can manually create previews and delete it individually with the following commands:

# Create new preview
werft run github -a with preview
# Delete preview
werft run github -j platform-delete-preview-environment.yaml -a preview="your preview name"

When triggering the cron job, notice that now it will trigger several new jobs. One for each stale preview

werft run github -j platform-delete-preview-environments-cron.yaml

When triggering the job that creates an artificial preview, notice that it will also trigger the deletion job at the end

werft run github -j platform-trigger-artificial-job.yaml

Release Notes

NONE

Documentation

Werft options:

/werft with-preview

ArthurSens · 2022-09-08T20:37:10Z

/hold because it wasn't tested enough

but getting some feedback regarding the approach would be nice 🙂

vulkoingim · 2022-09-09T08:33:20Z

I think this makes the cleanup script more complex, and we want the opposite.

How about we create a job that is delete-preview-environment that accepts a preview env name as input and would just run https://github.com/gitpod-io/gitpod/blob/main/.werft/platform-delete-preview-environments-cron.ts#L412-L423

We can call it at the end of the deploy-preview-env if a flag is set - we can use the type==artificial for that.

Also this would allow us to do a few more nice things:

Other jobs - e.g. the ws teams' integration tests can trigger it when it finishes so it cleans up the preview env
We can trigger the job on branch delete (i.e. when PR gets merged)
We can have the delete cron run that job for every individual preview env, instead of looping through them - this would simplify the script, reduce the overall runtime, and would surface issues when deleting a single preview env, instead of burying it in the logs of the bigger job.

ArthurSens · 2022-09-09T12:33:04Z

Other jobs - e.g. the ws teams' integration tests can trigger it when it finishes so it cleans up the preview env

I was expecting that we could add another type of preview created for integration tests actually, and they would be deleted with different premises

We can trigger the job on branch delete (i.e. when PR gets merged)

That sounds like a good idea, but it also sounds like a significant change 😅.

We can have the delete cron run that job for every individual preview env, instead of looping through them - this would simplify the script, reduce the overall runtime, and would surface issues when deleting a single preview env, instead of burying it in the logs of the bigger job.

I do like the current strategy of deleting in batches tho... We have a collection of premisses, and we periodically purge previews that follow our premises. Sounds pretty simple and effective.

I don't follow how having multiple different scripts simplify anything here 🤔

vulkoingim · 2022-09-09T13:16:20Z

I was expecting that we could add another type of preview created for integration tests actually, and they would be deleted with different premises

And you'll require more logic to determine if any of those environments are stale, as they might have different requirements. By shifting the responsibility of the clean-up of those environments to the job that is responsible for them - you are certain that whatever they were used for is done and they can be deleted.

E.g. - in its current form this PR will try to delete any env that matches that type. But the job that is spinning up the env might not have finished yet. If you want to be sure that it is finished - now you have to implement something else to check that. If it were an integration test (e.g.) - it might get interrupted in the middle and fail. (I imagine that you would like to change the order of evaluation, since now it won't have an effect and it won't get cleaned up earlier)

That sounds like a good idea, but it also sounds like a significant change 😅.

It's going to be an additional way to delete a preview env, not instead of the cron job. They serve different purposes.

I'm not saying to implement this now - but it will allow us to have this later very easily.

I do like the current strategy of deleting in batches tho... We have a collection of premisses, and we periodically purge previews that follow our premises. Sounds pretty simple and effective.

Nothing will change there for the cron job, except where the logic for deletion will be handled. The cron job will still loop through all preview envs and determine which one need to be removed - only it will trigger a different job for each env that will handle the deletion. And it will be faster as then multiple delete jobs will be running in parallel.

I don't follow how having multiple different scripts simplify anything here 🤔

The logic you introduce will not be needed anymore + we'll get the rest of the benefits that I described.

ArthurSens · 2022-09-09T17:04:31Z

And you'll require more logic to determine if any of those environments are stale, as they might have different requirements.

We'll just be adding another premise, right? I don't understand how complex this can be

By shifting the responsibility of the clean-up of those environments to the job that is responsible for them - you are certain that whatever they were used for is done and they can be deleted.

It is not uncommon that jobs fail mid-execution. In such cases, we'll not be cleaning up those previews. With a garbage collection strategy, we're safe to fail and everything will be cleaned up eventually.

In a scenario where we have no room for any stale preview, I'd understand that periodic clean-ups wouldn't work, but that's not our case.

E.g. - in its current form this PR will try to delete any env that matches that type. But the job that is spinning up the env might not have finished yet. If you want to be sure that it is finished

That's a good point! I haven't thought about it, I probably want to adjust my premise to account for preview age (30 mins should be totally fine).

Now you have to implement something else to check that. If it were an integration test (e.g.) - it might get interrupted in the middle and fail. (I imagine that you would like to change the order of evaluation, since now it won't have an effect and it won't get cleaned up earlier)

Yep, for different types of preview we'd have different premises. I don't think we have that many to say that we're increasing complexity that much. This PR is introducing "regular" and "artificial", nothing more than that. It might make sense to add another for "integration-test", but that's it. Do you see us adding dozens of types here?

I do like the current strategy of deleting in batches tho... We have a collection of premisses, and we periodically purge previews that follow our premises. Sounds pretty simple and effective.

Nothing will change there for the cron job, except where the logic for deletion will be handled. The cron job will still loop through all preview envs and determine which one need to be removed - only it will trigger a different job for each env that will handle the deletion. And it will be faster as then multiple delete jobs will be running in parallel.

Ah ok, now I'm getting it, the logic to delete a single preview will be reused by the cron. Sounds nice, we'll get parallelism power and we won't have duplicated code.

I don't follow how having multiple different scripts simplify anything here 🤔

The logic you introduce will not be needed anymore + we'll get the rest of the benefits that I described.

What if the job fails midway and we have dangling previews uncleaned? I really think this logic is still needed 😬

Signed-off-by: ArthurSens <[email protected]>

…etion Signed-off-by: ArthurSens <[email protected]>

ArthurSens · 2022-09-14T17:18:05Z

/unhold

Finally tested and happy with the results 🙂

mads-hartmann

I left a few comments - approving so you can decide if you want to fix them now or in a follow up PR ☺️

mads-hartmann · 2022-09-15T12:43:36Z

.werft/platform-delete-preview-environment.ts

+
+Tracing.initialize()
+    .then(() => {
+        werft = new Werft("delete-preview-environment-cron");


Suggested change

werft = new Werft("delete-preview-environment-cron");

werft = new Werft("delete-preview-environment");

mads-hartmann · 2022-09-15T12:47:20Z

.werft/platform-delete-preview-environments-cron.ts

-        await previewEnvironment.removeDNSRecords(sliceID);
-        await previewEnvironment.delete();
+        exec(
+            `sudo chown -R gitpod:gitpod /workspace && git config --global user.name roboquat && git config --global user.email [email protected]`,


Do we need this on every execution of removePreviewEnvironment? Could we move this to to the yaml file - that's where we have this "init" code in other jobs ☺️

mads-hartmann · 2022-09-15T12:54:27Z

.werft/platform-trigger-artificial-job.sh

+werft log result -d "build job" url "$job_url"
+
+while true; do
+    job_phase=$(werft job get "${BUILD_ID}" -o json | jq --raw-output '.phase')


Werft is currently a bit unreliable (see #12628 (comment)) so there is a good chance that this will fail every once in a while.

You could temporarily disable "fail on error" and check the exit code manually. Something like this (not tested)

set +e while true; do job_phase=$(werft job get "${BUILD_ID}" -o json | jq --raw-output '.phase') if [[ "$?" != "0" ]]; then echo "Failed to get job details. Sleeping and trying again" sleep 10 fi // .. the rest of your code done set -e

This is how workspace worked around it for now

Ah nice catch

Actually I looked it up and while set +e/-e doesn't make any difference in while loops 🤔

@ArthurSens I think that only applies to commands executed in the "condition" part of those expressions. E.g. try to run this script

#!/usr/bin/env bash set -euo pipefail function fail { if [[ $1 == "yes" ]]; then exit 1 else exit 0 fi } while $(fail "yes"); do echo "This shouldn't run" done while $(fail "no"); do echo "Running" sleep 1 fail "yes" done

You'll see

gitpod /workspace/empty (main) $ ./test.sh Running

E.g. that fail "yes" does cause the entire script to fail ☺️

Signed-off-by: ArthurSens <[email protected]>

ArthurSens requested a review from a team September 8, 2022 20:36

roboquat added release-note-none size/M labels Sep 8, 2022

github-actions bot added the team: devx label Sep 8, 2022

roboquat added the do-not-merge/hold label Sep 8, 2022

ArthurSens marked this pull request as draft September 12, 2022 19:06

roboquat added the do-not-merge/work-in-progress label Sep 12, 2022

ArthurSens force-pushed the as/artificial-clean-up branch from c5f5a2d to 3c71970 Compare September 12, 2022 19:07

roboquat added size/XL size/L and removed size/M size/XL labels Sep 12, 2022

ArthurSens force-pushed the as/artificial-clean-up branch from cb80f63 to e9edac2 Compare September 13, 2022 14:45

roboquat added size/XXL and removed size/L labels Sep 14, 2022

ArthurSens force-pushed the as/artificial-clean-up branch 3 times, most recently from 2fb90d1 to a5b6b6d Compare September 14, 2022 13:17

Create job to delete individual previews

748eb4d

Signed-off-by: ArthurSens <[email protected]>

ArthurSens force-pushed the as/artificial-clean-up branch from a5b6b6d to 837574a Compare September 14, 2022 13:55

Update clean-up cronjob to delete previews with parallel individual jobs

319a5b8

Signed-off-by: ArthurSens <[email protected]>

ArthurSens force-pushed the as/artificial-clean-up branch from 837574a to 3920b16 Compare September 14, 2022 15:01

Update artificial traffic job to delete the preview after job's compl…

5dcb250

…etion Signed-off-by: ArthurSens <[email protected]>

ArthurSens force-pushed the as/artificial-clean-up branch from 3920b16 to 5dcb250 Compare September 14, 2022 17:09

ArthurSens changed the title ~~Introduce preview types and more agressively clean up artificial previews~~ previews: Several improvements to clean up logic Sep 14, 2022

ArthurSens marked this pull request as ready for review September 14, 2022 17:17

roboquat removed the do-not-merge/work-in-progress label Sep 14, 2022

roboquat removed the do-not-merge/hold label Sep 14, 2022

mads-hartmann added the do-not-merge/hold label Sep 15, 2022

mads-hartmann approved these changes Sep 15, 2022

View reviewed changes

Addressing comments

35437aa

Signed-off-by: ArthurSens <[email protected]>

ArthurSens removed the do-not-merge/hold label Sep 15, 2022

ArthurSens mentioned this pull request Sep 15, 2022

Revert "previews: Trigger artificial traffic every 2 hours" #13003

Merged

roboquat merged commit 1e17d6e into main Sep 15, 2022

roboquat deleted the as/artificial-clean-up branch September 15, 2022 13:31

ArthurSens mentioned this pull request Sep 15, 2022

Add protection from werft failures #13005

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

previews: Several improvements to clean up logic #12784

previews: Several improvements to clean up logic #12784

ArthurSens commented Sep 8, 2022 •

edited

Loading

ArthurSens commented Sep 8, 2022 •

edited by werft-gitpod-dev-com bot

Loading

vulkoingim commented Sep 9, 2022 •

edited

Loading

ArthurSens commented Sep 9, 2022 •

edited by werft-gitpod-dev-com bot

Loading

vulkoingim commented Sep 9, 2022 •

edited

Loading

ArthurSens commented Sep 9, 2022 •

edited by werft-gitpod-dev-com bot

Loading

ArthurSens commented Sep 14, 2022 •

edited by werft-gitpod-dev-com bot

Loading

mads-hartmann left a comment

mads-hartmann Sep 15, 2022

mads-hartmann Sep 15, 2022

mads-hartmann Sep 15, 2022

ArthurSens Sep 15, 2022

ArthurSens Sep 15, 2022

mads-hartmann Sep 15, 2022

	werft = new Werft("delete-preview-environment-cron");
	werft = new Werft("delete-preview-environment");

previews: Several improvements to clean up logic #12784

previews: Several improvements to clean up logic #12784

Conversation

ArthurSens commented Sep 8, 2022 • edited Loading

Description

Related Issue(s)

How to test

Release Notes

Documentation

Werft options:

ArthurSens commented Sep 8, 2022 • edited by werft-gitpod-dev-com bot Loading

vulkoingim commented Sep 9, 2022 • edited Loading

ArthurSens commented Sep 9, 2022 • edited by werft-gitpod-dev-com bot Loading

vulkoingim commented Sep 9, 2022 • edited Loading

ArthurSens commented Sep 9, 2022 • edited by werft-gitpod-dev-com bot Loading

ArthurSens commented Sep 14, 2022 • edited by werft-gitpod-dev-com bot Loading

mads-hartmann left a comment

Choose a reason for hiding this comment

mads-hartmann Sep 15, 2022

Choose a reason for hiding this comment

mads-hartmann Sep 15, 2022

Choose a reason for hiding this comment

mads-hartmann Sep 15, 2022

Choose a reason for hiding this comment

ArthurSens Sep 15, 2022

Choose a reason for hiding this comment

ArthurSens Sep 15, 2022

Choose a reason for hiding this comment

mads-hartmann Sep 15, 2022

Choose a reason for hiding this comment

ArthurSens commented Sep 8, 2022 •

edited

Loading

ArthurSens commented Sep 8, 2022 •

edited by werft-gitpod-dev-com bot

Loading

vulkoingim commented Sep 9, 2022 •

edited

Loading

ArthurSens commented Sep 9, 2022 •

edited by werft-gitpod-dev-com bot

Loading

vulkoingim commented Sep 9, 2022 •

edited

Loading

ArthurSens commented Sep 9, 2022 •

edited by werft-gitpod-dev-com bot

Loading

ArthurSens commented Sep 14, 2022 •

edited by werft-gitpod-dev-com bot

Loading