-
Notifications
You must be signed in to change notification settings - Fork 1.3k
previews: Several improvements to clean up logic #12784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/hold because it wasn't tested enough but getting some feedback regarding the approach would be nice 🙂 |
I think this makes the cleanup script more complex, and we want the opposite. How about we create a job that is delete-preview-environment that accepts a preview env name as input and would just run https://github.com/gitpod-io/gitpod/blob/main/.werft/platform-delete-preview-environments-cron.ts#L412-L423 We can call it at the end of the deploy-preview-env if a flag is set - we can use the Also this would allow us to do a few more nice things:
|
I was expecting that we could add another type of preview created for integration tests actually, and they would be deleted with different premises
That sounds like a good idea, but it also sounds like a significant change 😅.
I do like the current strategy of deleting in batches tho... We have a collection of premisses, and we periodically purge previews that follow our premises. Sounds pretty simple and effective. I don't follow how having multiple different scripts simplify anything here 🤔 |
And you'll require more logic to determine if any of those environments are stale, as they might have different requirements. By shifting the responsibility of the clean-up of those environments to the job that is responsible for them - you are certain that whatever they were used for is done and they can be deleted. E.g. - in its current form this PR will try to delete any env that matches that type. But the job that is spinning up the env might not have finished yet. If you want to be sure that it is finished - now you have to implement something else to check that. If it were an integration test (e.g.) - it might get interrupted in the middle and fail. (I imagine that you would like to change the order of evaluation, since now it won't have an effect and it won't get cleaned up earlier)
It's going to be an additional way to delete a preview env, not instead of the cron job. They serve different purposes. I'm not saying to implement this now - but it will allow us to have this later very easily.
Nothing will change there for the cron job, except where the logic for deletion will be handled. The cron job will still loop through all preview envs and determine which one need to be removed - only it will trigger a different job for each env that will handle the deletion. And it will be faster as then multiple delete jobs will be running in parallel.
The logic you introduce will not be needed anymore + we'll get the rest of the benefits that I described. |
We'll just be adding another premise, right? I don't understand how complex this can be
It is not uncommon that jobs fail mid-execution. In such cases, we'll not be cleaning up those previews. With a garbage collection strategy, we're safe to fail and everything will be cleaned up eventually. In a scenario where we have no room for any stale preview, I'd understand that periodic clean-ups wouldn't work, but that's not our case.
That's a good point! I haven't thought about it, I probably want to adjust my premise to account for preview age (30 mins should be totally fine).
Yep, for different types of preview we'd have different premises. I don't think we have that many to say that we're increasing complexity that much. This PR is introducing "regular" and "artificial", nothing more than that. It might make sense to add another for "integration-test", but that's it. Do you see us adding dozens of types here?
Ah ok, now I'm getting it, the logic to delete a single preview will be reused by the cron. Sounds nice, we'll get parallelism power and we won't have duplicated code.
What if the job fails midway and we have dangling previews uncleaned? I really think this logic is still needed 😬 |
c5f5a2d
to
3c71970
Compare
cb80f63
to
e9edac2
Compare
2fb90d1
to
a5b6b6d
Compare
Signed-off-by: ArthurSens <[email protected]>
a5b6b6d
to
837574a
Compare
Signed-off-by: ArthurSens <[email protected]>
837574a
to
3920b16
Compare
…etion Signed-off-by: ArthurSens <[email protected]>
3920b16
to
5dcb250
Compare
/unhold Finally tested and happy with the results 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments - approving so you can decide if you want to fix them now or in a follow up PR
|
||
Tracing.initialize() | ||
.then(() => { | ||
werft = new Werft("delete-preview-environment-cron"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
werft = new Werft("delete-preview-environment-cron"); | |
werft = new Werft("delete-preview-environment"); |
await previewEnvironment.removeDNSRecords(sliceID); | ||
await previewEnvironment.delete(); | ||
exec( | ||
`sudo chown -R gitpod:gitpod /workspace && git config --global user.name roboquat && git config --global user.email [email protected]`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this on every execution of removePreviewEnvironment? Could we move this to to the yaml file - that's where we have this "init" code in other jobs
werft log result -d "build job" url "$job_url" | ||
|
||
while true; do | ||
job_phase=$(werft job get "${BUILD_ID}" -o json | jq --raw-output '.phase') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Werft is currently a bit unreliable (see #12628 (comment)) so there is a good chance that this will fail every once in a while.
You could temporarily disable "fail on error" and check the exit code manually. Something like this (not tested)
set +e
while true; do
job_phase=$(werft job get "${BUILD_ID}" -o json | jq --raw-output '.phase')
if [[ "$?" != "0" ]]; then
echo "Failed to get job details. Sleeping and trying again"
sleep 10
fi
// .. the rest of your code
done
set -e
This is how workspace worked around it for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurSens I think that only applies to commands executed in the "condition" part of those expressions. E.g. try to run this script
#!/usr/bin/env bash
set -euo pipefail
function fail {
if [[ $1 == "yes" ]]; then
exit 1
else
exit 0
fi
}
while $(fail "yes"); do
echo "This shouldn't run"
done
while $(fail "no"); do
echo "Running"
sleep 1
fail "yes"
done
You'll see
gitpod /workspace/empty (main) $ ./test.sh
Running
E.g. that fail "yes"
does cause the entire script to fail
Signed-off-by: ArthurSens <[email protected]>
Description
Turns out this PR got a long bigger than initially expected. That is because I ended up making several changes in a single PR.
Every change is done in separate commits, so please use them to make the review process easier 🙂
Related Issue(s)
Fixes #
How to test
You can test each change with different werft commands:
Release Notes
Documentation
Werft options: