Skip to content

[bridge] Marks stuck stopping and pending instances as stopped #13350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 29, 2022

Conversation

laushinka
Copy link
Contributor

@laushinka laushinka commented Sep 27, 2022

Description

After validating through logging of the past two weeks (since 15.09.2022), this change marks stopping and pending states after the timeout duration to be stopped, as well as running states that the ws-manager does not know about.
Findings from the logging:

  1. Most of the logs were for stopping state, and we could not find a clear explanation whether they were cases that we want to be stopped. Instances could be in a stopping state for longer than we expect (in the previous PR it was 10 seconds after stoppingTime) because instances with a big back-up size can take longer.
  2. A lot of the pending states were stuck in that state for days, and we do want them stopped.
  3. We found a few creating states that ended up running, and these we should not stop.

We are still unsure why these instances end up in these states. Therefore we:

  1. Will focus on only stopping stuck instances in pending and stopping states after the timeout duration, and running states. These we know for sure we want stopped if ws-manager does not know about them.
  2. To understand the why, we will create an issue.

Related Issue(s)

Fixes #11397

How to test

  1. kubectl cordon the preview node
  2. Run a workspace. It should be stuck in pending.
  3. Change the creationTime in d_b_workspace_instance to 1 hour before (or wait 1 hour). It should be marked as stopped with a message "Stopped by ws-manager-bridge. Previously in phase pending.
  4. Once done, kubectl uncordon the node, otherwise builds will fail because the node can't be scheduled.

Release Notes

NONE

Documentation

Werft options:

  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-integration-tests=all
    Valid options are all, workspace, webapp, ide

@werft-gitpod-dev-com
Copy link

started the job as gitpod-build-lau-pending-stopping-11397.4 because the annotations in the pull request description changed
(with .werft/ from main)

@laushinka laushinka force-pushed the lau/pending-stopping-11397 branch 2 times, most recently from 1ae156c to c54c7ca Compare September 27, 2022 10:04
@laushinka laushinka marked this pull request as ready for review September 27, 2022 10:36
@laushinka laushinka requested a review from a team September 27, 2022 10:36
@laushinka laushinka requested a review from geropl September 27, 2022 10:36
@github-actions github-actions bot added the team: webapp Issue belongs to the WebApp team label Sep 27, 2022
@laushinka laushinka force-pushed the lau/pending-stopping-11397 branch from 8603bc6 to 9c6d7cc Compare September 27, 2022 14:53
@laushinka laushinka requested a review from geropl September 28, 2022 09:45
continue;
}

log.info(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laushinka Sorry for the long turnaround. I like the new layout of the loop! But we need it to include this case: If an instance in runningInstancesIdx is running, we still need to stop it unconditionally, as we do it at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense that we should handle the running state as we do it now. Thanks, will add it!

@laushinka laushinka force-pushed the lau/pending-stopping-11397 branch from 09aa893 to 062a78c Compare September 29, 2022 09:19
@laushinka laushinka requested a review from geropl September 29, 2022 10:37
@geropl
Copy link
Member

geropl commented Sep 29, 2022

Code LGTM, thx @laushinka ! 🙏
Will test now...

@laushinka
Copy link
Contributor Author

laushinka commented Sep 29, 2022

Code LGTM, thx @laushinka ! 🙏 Will test now...

@geropl Cool! Let me know if the steps I described were helpful, or if you test in a different way 🙏🏽

Copy link
Member

@geropl geropl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, tested and works as expected! 🥇

Let me know if the steps I described were helpful, or if you test in a different way

Did as you described, just had two meetings in between 😉

@roboquat roboquat merged commit e00bffa into main Sep 29, 2022
@roboquat roboquat deleted the lau/pending-stopping-11397 branch September 29, 2022 13:16
@roboquat roboquat added deployed: webapp Meta team change is running in production deployed Change is completely running in production labels Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: webapp Meta team change is running in production deployed Change is completely running in production release-note-none size/M team: webapp Issue belongs to the WebApp team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

workspaces which cannot be scheduled to a cluster get stuck in Pending
3 participants