Skip to content

ws-mananger: refactor volume snapshot watcher #13269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 26, 2022

Conversation

jenting
Copy link
Contributor

@jenting jenting commented Sep 25, 2022

Description

Refactor volume snapshot watcher.

Using the Go channel to notify different event workers that the volume snapshot is ready seems risky.
We observe that at gen67 cluster, some workspaces with PVC can't be terminated. The symptom is the workspace pod starts the disposal process, and the volume snapshot is ready to use. But somehow, the event does not receive by the receiver or does not send to the sender.

We rewrite the watcher with the volume snapshot client watch + retry watcher to make sure the watcher happens in each event worker. To minimize the chance that the event does not receive by the receiver or does not send to the sender.
Moreover, we fall back to exponential retry to get the volume snapshot object if we can't create the retry watcher.

Related Issue(s)

Fixes #13280

How to test

Create an ephemeral cluster, and run loadgen test with 100 regular workspaces + 100 regular workspaces with PVC.
Ensure no workspace pod is terminating for a long time.

Release Notes

None

Documentation

None

Werft options:

  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-integration-tests=workspace
    Valid options are all, workspace, webapp, ide

@werft-gitpod-dev-com
Copy link

started the job as gitpod-build-jenting-pvc-stuck-on-stopping.12 because the annotations in the pull request description changed
(with .werft/ from main)

@werft-gitpod-dev-com
Copy link

started the job as gitpod-build-jenting-pvc-stuck-on-stopping.13 because the annotations in the pull request description changed
(with .werft/ from main)

@jenting jenting force-pushed the jenting/pvc-stuck-on-stopping branch 2 times, most recently from 53c892d to d123c2f Compare September 26, 2022 03:36
@werft-gitpod-dev-com
Copy link

started the job as gitpod-build-jenting-pvc-stuck-on-stopping.17 because the annotations in the pull request description changed
(with .werft/ from main)

@jenting jenting force-pushed the jenting/pvc-stuck-on-stopping branch from d123c2f to 210f1fd Compare September 26, 2022 07:03
@jenting
Copy link
Contributor Author

jenting commented Sep 26, 2022

I use the us68 cluster to test loadgen with 100 regular workspaces + 100 regular workspaces with PVC, I can reproduce the workspace in the terminating state.

After I patch the ws-manager with this PR and rerun the same test (100 regular workspaces + 100 regular workspace with PVC), I can't reproduce the workspace in terminating state.

Therefore, I think it's good to review.

@jenting jenting marked this pull request as ready for review September 26, 2022 07:05
@jenting jenting requested a review from a team September 26, 2022 07:05
@github-actions github-actions bot added the team: workspace Issue belongs to the Workspace team label Sep 26, 2022
@jenting jenting force-pushed the jenting/pvc-stuck-on-stopping branch from 210f1fd to ce5afda Compare September 26, 2022 09:38
Copy link
Contributor

@sagor999 sagor999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

LGTM, but hold for minor nit.

@jenting jenting force-pushed the jenting/pvc-stuck-on-stopping branch from ce5afda to 4e6268b Compare September 26, 2022 22:21
@jenting
Copy link
Contributor Author

jenting commented Sep 26, 2022

nit addressed.
/unhold

@roboquat roboquat merged commit 01fda65 into main Sep 26, 2022
@roboquat roboquat deleted the jenting/pvc-stuck-on-stopping branch September 26, 2022 22:31
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note-none size/XL team: workspace Issue belongs to the Workspace team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[PVC] The workspace with PVC keeps in terminating state
3 participants