Skip to content

previews: Several improvements to clean up logic #12784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 15, 2022
Merged

Conversation

ArthurSens
Copy link
Contributor

@ArthurSens ArthurSens commented Sep 8, 2022

Description

Turns out this PR got a long bigger than initially expected. That is because I ended up making several changes in a single PR.

  1. Introduce job that cleans-up individual previews.
  2. Update preview clean-up cronjob to re-use the job for individial clean ups .
  3. Update job that creates artificial preview to also delete the preview at the end.

Every change is done in separate commits, so please use them to make the review process easier 🙂

Related Issue(s)

Fixes #

How to test

You can test each change with different werft commands:

  1. You can manually create previews and delete it individually with the following commands:
# Create new preview
werft run github -a with preview
# Delete preview
werft run github -j platform-delete-preview-environment.yaml -a preview="your preview name"
  1. When triggering the cron job, notice that now it will trigger several new jobs. One for each stale preview
werft run github -j platform-delete-preview-environments-cron.yaml
  1. When triggering the job that creates an artificial preview, notice that it will also trigger the deletion job at the end
werft run github -j platform-trigger-artificial-job.yaml

Release Notes

NONE

Documentation

Werft options:

  • /werft with-preview

@ArthurSens
Copy link
Contributor Author

ArthurSens commented Sep 8, 2022

/hold because it wasn't tested enough

but getting some feedback regarding the approach would be nice 🙂

@vulkoingim
Copy link
Contributor

vulkoingim commented Sep 9, 2022

I think this makes the cleanup script more complex, and we want the opposite.

How about we create a job that is delete-preview-environment that accepts a preview env name as input and would just run https://github.com/gitpod-io/gitpod/blob/main/.werft/platform-delete-preview-environments-cron.ts#L412-L423

We can call it at the end of the deploy-preview-env if a flag is set - we can use the type==artificial for that.

Also this would allow us to do a few more nice things:

  • Other jobs - e.g. the ws teams' integration tests can trigger it when it finishes so it cleans up the preview env
  • We can trigger the job on branch delete (i.e. when PR gets merged)
  • We can have the delete cron run that job for every individual preview env, instead of looping through them - this would simplify the script, reduce the overall runtime, and would surface issues when deleting a single preview env, instead of burying it in the logs of the bigger job.

@ArthurSens
Copy link
Contributor Author

ArthurSens commented Sep 9, 2022

  • Other jobs - e.g. the ws teams' integration tests can trigger it when it finishes so it cleans up the preview env

I was expecting that we could add another type of preview created for integration tests actually, and they would be deleted with different premises

  • We can trigger the job on branch delete (i.e. when PR gets merged)

That sounds like a good idea, but it also sounds like a significant change 😅.

  • We can have the delete cron run that job for every individual preview env, instead of looping through them - this would simplify the script, reduce the overall runtime, and would surface issues when deleting a single preview env, instead of burying it in the logs of the bigger job.

I do like the current strategy of deleting in batches tho... We have a collection of premisses, and we periodically purge previews that follow our premises. Sounds pretty simple and effective.

I don't follow how having multiple different scripts simplify anything here 🤔

@vulkoingim
Copy link
Contributor

vulkoingim commented Sep 9, 2022

I was expecting that we could add another type of preview created for integration tests actually, and they would be deleted with different premises

And you'll require more logic to determine if any of those environments are stale, as they might have different requirements. By shifting the responsibility of the clean-up of those environments to the job that is responsible for them - you are certain that whatever they were used for is done and they can be deleted.

E.g. - in its current form this PR will try to delete any env that matches that type. But the job that is spinning up the env might not have finished yet. If you want to be sure that it is finished - now you have to implement something else to check that. If it were an integration test (e.g.) - it might get interrupted in the middle and fail. (I imagine that you would like to change the order of evaluation, since now it won't have an effect and it won't get cleaned up earlier)

That sounds like a good idea, but it also sounds like a significant change 😅.

It's going to be an additional way to delete a preview env, not instead of the cron job. They serve different purposes.

I'm not saying to implement this now - but it will allow us to have this later very easily.

I do like the current strategy of deleting in batches tho... We have a collection of premisses, and we periodically purge previews that follow our premises. Sounds pretty simple and effective.

Nothing will change there for the cron job, except where the logic for deletion will be handled. The cron job will still loop through all preview envs and determine which one need to be removed - only it will trigger a different job for each env that will handle the deletion. And it will be faster as then multiple delete jobs will be running in parallel.

I don't follow how having multiple different scripts simplify anything here 🤔

The logic you introduce will not be needed anymore + we'll get the rest of the benefits that I described.

@ArthurSens
Copy link
Contributor Author

ArthurSens commented Sep 9, 2022

And you'll require more logic to determine if any of those environments are stale, as they might have different requirements.

We'll just be adding another premise, right? I don't understand how complex this can be

By shifting the responsibility of the clean-up of those environments to the job that is responsible for them - you are certain that whatever they were used for is done and they can be deleted.

It is not uncommon that jobs fail mid-execution. In such cases, we'll not be cleaning up those previews. With a garbage collection strategy, we're safe to fail and everything will be cleaned up eventually.

In a scenario where we have no room for any stale preview, I'd understand that periodic clean-ups wouldn't work, but that's not our case.

E.g. - in its current form this PR will try to delete any env that matches that type. But the job that is spinning up the env might not have finished yet. If you want to be sure that it is finished

That's a good point! I haven't thought about it, I probably want to adjust my premise to account for preview age (30 mins should be totally fine).

Now you have to implement something else to check that. If it were an integration test (e.g.) - it might get interrupted in the middle and fail. (I imagine that you would like to change the order of evaluation, since now it won't have an effect and it won't get cleaned up earlier)

Yep, for different types of preview we'd have different premises. I don't think we have that many to say that we're increasing complexity that much. This PR is introducing "regular" and "artificial", nothing more than that. It might make sense to add another for "integration-test", but that's it. Do you see us adding dozens of types here?

I do like the current strategy of deleting in batches tho... We have a collection of premisses, and we periodically purge previews that follow our premises. Sounds pretty simple and effective.

Nothing will change there for the cron job, except where the logic for deletion will be handled. The cron job will still loop through all preview envs and determine which one need to be removed - only it will trigger a different job for each env that will handle the deletion. And it will be faster as then multiple delete jobs will be running in parallel.

Ah ok, now I'm getting it, the logic to delete a single preview will be reused by the cron. Sounds nice, we'll get parallelism power and we won't have duplicated code.

I don't follow how having multiple different scripts simplify anything here 🤔

The logic you introduce will not be needed anymore + we'll get the rest of the benefits that I described.

What if the job fails midway and we have dangling previews uncleaned? I really think this logic is still needed 😬

@ArthurSens ArthurSens marked this pull request as draft September 12, 2022 19:06
@ArthurSens ArthurSens force-pushed the as/artificial-clean-up branch from c5f5a2d to 3c71970 Compare September 12, 2022 19:07
@ArthurSens ArthurSens force-pushed the as/artificial-clean-up branch from cb80f63 to e9edac2 Compare September 13, 2022 14:45
@roboquat roboquat added size/XXL and removed size/L labels Sep 14, 2022
@ArthurSens ArthurSens force-pushed the as/artificial-clean-up branch 3 times, most recently from 2fb90d1 to a5b6b6d Compare September 14, 2022 13:17
@ArthurSens ArthurSens force-pushed the as/artificial-clean-up branch from a5b6b6d to 837574a Compare September 14, 2022 13:55
@ArthurSens ArthurSens force-pushed the as/artificial-clean-up branch from 837574a to 3920b16 Compare September 14, 2022 15:01
@ArthurSens ArthurSens force-pushed the as/artificial-clean-up branch from 3920b16 to 5dcb250 Compare September 14, 2022 17:09
@ArthurSens ArthurSens changed the title Introduce preview types and more agressively clean up artificial previews previews: Several improvements to clean up logic Sep 14, 2022
@ArthurSens ArthurSens marked this pull request as ready for review September 14, 2022 17:17
@ArthurSens
Copy link
Contributor Author

ArthurSens commented Sep 14, 2022

/unhold

Finally tested and happy with the results 🙂

Copy link
Contributor

@mads-hartmann mads-hartmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments - approving so you can decide if you want to fix them now or in a follow up PR ☺️


Tracing.initialize()
.then(() => {
werft = new Werft("delete-preview-environment-cron");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
werft = new Werft("delete-preview-environment-cron");
werft = new Werft("delete-preview-environment");

await previewEnvironment.removeDNSRecords(sliceID);
await previewEnvironment.delete();
exec(
`sudo chown -R gitpod:gitpod /workspace && git config --global user.name roboquat && git config --global user.email [email protected]`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this on every execution of removePreviewEnvironment? Could we move this to to the yaml file - that's where we have this "init" code in other jobs ☺️

werft log result -d "build job" url "$job_url"

while true; do
job_phase=$(werft job get "${BUILD_ID}" -o json | jq --raw-output '.phase')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Werft is currently a bit unreliable (see #12628 (comment)) so there is a good chance that this will fail every once in a while.

You could temporarily disable "fail on error" and check the exit code manually. Something like this (not tested)

set +e
while true; do
    job_phase=$(werft job get "${BUILD_ID}" -o json | jq --raw-output '.phase')
    if [[ "$?" != "0" ]]; then
        echo "Failed to get job details. Sleeping and trying again"
        sleep 10
    fi
    // .. the rest of your code
done
set -e

This is how workspace worked around it for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I looked it up and while set +e/-e doesn't make any difference in while loops 🤔
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurSens I think that only applies to commands executed in the "condition" part of those expressions. E.g. try to run this script

#!/usr/bin/env bash

set -euo pipefail

function fail {
    if [[ $1 == "yes" ]]; then
        exit 1
    else
        exit 0
    fi
}

while $(fail "yes"); do
    echo "This shouldn't run"
done

while $(fail "no"); do
    echo "Running"
    sleep 1
    fail "yes"
done

You'll see

gitpod /workspace/empty (main) $ ./test.sh 
Running

E.g. that fail "yes" does cause the entire script to fail ☺️

Signed-off-by: ArthurSens <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants