-
Notifications
You must be signed in to change notification settings - Fork 1.3k
ws-mananger: refactor volume snapshot watcher #13269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9c3f992
to
2f5182f
Compare
started the job as gitpod-build-jenting-pvc-stuck-on-stopping.12 because the annotations in the pull request description changed |
started the job as gitpod-build-jenting-pvc-stuck-on-stopping.13 because the annotations in the pull request description changed |
53c892d
to
d123c2f
Compare
started the job as gitpod-build-jenting-pvc-stuck-on-stopping.17 because the annotations in the pull request description changed |
d123c2f
to
210f1fd
Compare
I use the us68 cluster to test loadgen with 100 regular workspaces + 100 regular workspaces with PVC, I can reproduce the workspace in the terminating state. After I patch the ws-manager with this PR and rerun the same test (100 regular workspaces + 100 regular workspace with PVC), I can't reproduce the workspace in terminating state. Therefore, I think it's good to review. |
210f1fd
to
ce5afda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
LGTM, but hold for minor nit.
Signed-off-by: JenTing Hsiao <[email protected]>
ce5afda
to
4e6268b
Compare
nit addressed. |
Description
Refactor volume snapshot watcher.
Using the Go channel to notify different event workers that the volume snapshot is ready seems risky.
We observe that at gen67 cluster, some workspaces with PVC can't be terminated. The symptom is the workspace pod starts the disposal process, and the volume snapshot is ready to use. But somehow, the event does not receive by the receiver or does not send to the sender.
We rewrite the watcher with the volume snapshot client watch + retry watcher to make sure the watcher happens in each event worker. To minimize the chance that the event does not receive by the receiver or does not send to the sender.
Moreover, we fall back to exponential retry to get the volume snapshot object if we can't create the retry watcher.
Related Issue(s)
Fixes #13280
How to test
Create an ephemeral cluster, and run loadgen test with 100 regular workspaces + 100 regular workspaces with PVC.
Ensure no workspace pod is terminating for a long time.
Release Notes
Documentation
None
Werft options:
If enabled this will build
install/preview
Valid options are
all
,workspace
,webapp
,ide