ws-mananger: refactor volume snapshot watcher #13269

jenting · 2022-09-25T15:37:20Z

Description

Refactor volume snapshot watcher.

Using the Go channel to notify different event workers that the volume snapshot is ready seems risky.
We observe that at gen67 cluster, some workspaces with PVC can't be terminated. The symptom is the workspace pod starts the disposal process, and the volume snapshot is ready to use. But somehow, the event does not receive by the receiver or does not send to the sender.

We rewrite the watcher with the volume snapshot client watch + retry watcher to make sure the watcher happens in each event worker. To minimize the chance that the event does not receive by the receiver or does not send to the sender.
Moreover, we fall back to exponential retry to get the volume snapshot object if we can't create the retry watcher.

Related Issue(s)

Fixes #13280

How to test

Create an ephemeral cluster, and run loadgen test with 100 regular workspaces + 100 regular workspaces with PVC.
Ensure no workspace pod is terminating for a long time.

Release Notes

None

Documentation

None

Werft options:

/werft with-local-preview
If enabled this will build install/preview
/werft with-preview
/werft with-integration-tests=workspace
Valid options are all, workspace, webapp, ide

werft-gitpod-dev-com · 2022-09-26T00:55:37Z

started the job as gitpod-build-jenting-pvc-stuck-on-stopping.12 because the annotations in the pull request description changed
(with .werft/ from main)

werft-gitpod-dev-com · 2022-09-26T00:55:45Z

started the job as gitpod-build-jenting-pvc-stuck-on-stopping.13 because the annotations in the pull request description changed
(with .werft/ from main)

werft-gitpod-dev-com · 2022-09-26T04:41:30Z

started the job as gitpod-build-jenting-pvc-stuck-on-stopping.17 because the annotations in the pull request description changed
(with .werft/ from main)

jenting · 2022-09-26T07:05:24Z

I use the us68 cluster to test loadgen with 100 regular workspaces + 100 regular workspaces with PVC, I can reproduce the workspace in the terminating state.

After I patch the ws-manager with this PR and rerun the same test (100 regular workspaces + 100 regular workspace with PVC), I can't reproduce the workspace in terminating state.

Therefore, I think it's good to review.

sagor999

/hold

LGTM, but hold for minor nit.

components/ws-manager/pkg/manager/monitor.go

Signed-off-by: JenTing Hsiao <[email protected]>

jenting · 2022-09-26T22:31:02Z

nit addressed.
/unhold

roboquat added do-not-merge/work-in-progress release-note-none size/XL labels Sep 25, 2022

jenting force-pushed the jenting/pvc-stuck-on-stopping branch from 9c3f992 to 2f5182f Compare September 26, 2022 00:45

jenting force-pushed the jenting/pvc-stuck-on-stopping branch 2 times, most recently from 53c892d to d123c2f Compare September 26, 2022 03:36

jenting force-pushed the jenting/pvc-stuck-on-stopping branch from d123c2f to 210f1fd Compare September 26, 2022 07:03

jenting marked this pull request as ready for review September 26, 2022 07:05

jenting requested a review from a team September 26, 2022 07:05

roboquat removed the do-not-merge/work-in-progress label Sep 26, 2022

github-actions bot added the team: workspace Issue belongs to the Workspace team label Sep 26, 2022

jenting force-pushed the jenting/pvc-stuck-on-stopping branch from 210f1fd to ce5afda Compare September 26, 2022 09:38

sagor999 approved these changes Sep 26, 2022

View reviewed changes

components/ws-manager/pkg/manager/monitor.go Outdated Show resolved Hide resolved

roboquat added the do-not-merge/hold label Sep 26, 2022

ws-mananger: refactor volume snapshot watcher

4e6268b

Signed-off-by: JenTing Hsiao <[email protected]>

jenting force-pushed the jenting/pvc-stuck-on-stopping branch from ce5afda to 4e6268b Compare September 26, 2022 22:21

roboquat removed the do-not-merge/hold label Sep 26, 2022

roboquat merged commit 01fda65 into main Sep 26, 2022

roboquat deleted the jenting/pvc-stuck-on-stopping branch September 26, 2022 22:31

roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ws-mananger: refactor volume snapshot watcher #13269

ws-mananger: refactor volume snapshot watcher #13269

jenting commented Sep 25, 2022 •

edited

Loading

werft-gitpod-dev-com bot commented Sep 26, 2022

werft-gitpod-dev-com bot commented Sep 26, 2022

werft-gitpod-dev-com bot commented Sep 26, 2022

jenting commented Sep 26, 2022

sagor999 left a comment

jenting commented Sep 26, 2022

ws-mananger: refactor volume snapshot watcher #13269

ws-mananger: refactor volume snapshot watcher #13269

Conversation

jenting commented Sep 25, 2022 • edited Loading

Description

Related Issue(s)

How to test

Release Notes

Documentation

Werft options:

werft-gitpod-dev-com bot commented Sep 26, 2022

werft-gitpod-dev-com bot commented Sep 26, 2022

werft-gitpod-dev-com bot commented Sep 26, 2022

jenting commented Sep 26, 2022

sagor999 left a comment

Choose a reason for hiding this comment

jenting commented Sep 26, 2022

jenting commented Sep 25, 2022 •

edited

Loading