Skip to content

Add persistent volume support for workspaces #9242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 3, 2022
Merged

Conversation

sagor999
Copy link
Contributor

@sagor999 sagor999 commented Apr 12, 2022

Description

Add support for persistent volume claim use for workspaces instead of using local storage.
Usage is gated behind "persistent volume claim" feature flag.

Related Issue(s)

Fixes #9142
Fixed #9442

How to test

Spin up new workspace preview environment.
go into admin panel and enable feature flag
start new workspace. it will start using dedicated persistent volume.
observe new pvc created (kubectl get pvc -A)
delete workspace. pvc will be deleted.

image build is using old codepath and not affected by this change.
prebuilds are using old codepath and are not affected by this change.

limitations:
current implementation doesn't support restoring from previous backup. That will be added separately.

Release Notes

[experimental] Add persistent volume support for workspaces

Documentation

@sagor999 sagor999 changed the title Pavel/9142 pvc2 Add persistent volume support for workspaces Apr 13, 2022
@sagor999 sagor999 force-pushed the pavel/9142-pvc2 branch 3 times, most recently from 1af7aa8 to 8589a81 Compare April 14, 2022 18:14
@sagor999
Copy link
Contributor Author

sagor999 commented Apr 14, 2022

/werft run

👍 started the job as gitpod-build-pavel-9142-pvc2.13

@sagor999 sagor999 force-pushed the pavel/9142-pvc2 branch 3 times, most recently from b987d4d to 294bfbc Compare April 19, 2022 22:05
@roboquat roboquat added size/XXL and removed size/XL labels Apr 19, 2022
@sagor999
Copy link
Contributor Author

sagor999 commented Apr 19, 2022

/werft run with-clean-slate-deployment

👍 started the job as gitpod-build-pavel-9142-pvc2.31

@sagor999 sagor999 marked this pull request as ready for review April 20, 2022 00:17
@sagor999 sagor999 requested a review from a team April 20, 2022 00:17
@sagor999 sagor999 requested review from aledbf and Furisto as code owners April 20, 2022 00:17
@sagor999 sagor999 requested review from a team April 20, 2022 00:17
@github-actions github-actions bot added team: IDE team: delivery Issue belongs to the self-hosted team labels Apr 20, 2022
@sagor999
Copy link
Contributor Author

@csweichel @aledbf @Furisto @gitpod-io/engineering-self-hosted @gitpod-io/engineering-ide
Please review. This PR has been sitting ready for review for over 10 days now, and it takes a lot of my time on keeping it up to date with all the changes that are coming in. So would greatly appreciate 👀 on this PR.
If there are any issues, please raise them, and I will take care of them right away.

@kylos101 for visibility.

@utam0k
Copy link
Contributor

utam0k commented May 2, 2022

/werft run with-clean-slate-deployment

👍 started the job as gitpod-build-pavel-9142-pvc2.44
(with .werft/ from main)

@utam0k
Copy link
Contributor

utam0k commented May 2, 2022

I tried it in the preview environment, but it stopped at the Preparing workspace and did not proceed.
PVC was pending all the time.

staging-pavel-9142-pvc2        ws-db9ba9eb-2670-4f1e-9d68-a17e45816063   Pending                                                                                       10m

Copy link
Member

@akosyakov akosyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not try, but I looked at supervisor changes and they look backward compatible to how we check for content readiness.

Copy link
Contributor

@csweichel csweichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thank you for incorporating the feedback.

In a follow-up PR we need to have tests CDWP and status tests for the PVC workspaces.

// persistent_volume_claim means that we use PVC instead of HostPath to mount /workspace folder and content-init
// happens inside workspacekit instead of in ws-daemon. We also use k8s Snapshots to store\restore workspace content
// instead of GCS\tar.
bool persistent_volume_claim = 9;
Copy link
Contributor

@utam0k utam0k May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using enumerations instead of using bool? e.g. VolumeType, StorageType etc
https://developers.google.com/protocol-buffers/docs/proto3#enum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is applicable here, since this bool flag is used as feature flag. Once PVC feature is stable and working, we will remove old code and remove this feature flag and this new code will become default code path.

@@ -1221,6 +1221,12 @@ func startContentInit(ctx context.Context, cfg *Config, wg *sync.WaitGroup, cst
}()

fn := "/workspace/.gitpod/content.json"
fnReady := "/workspace/.gitpod/ready"
if _, err := os.Stat("/.workspace/.gitpod/content.json"); !os.IsNotExist(err) {
fn = "/.workspace/.gitpod/content.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it explicit that these paths come from a PVC mount, why not make it refer to a variable shared with the following location? This would surely help the code reader.
https://github.com/gitpod-io/gitpod/pull/9242/files#diff-46a025104f556cc7faed4319cb1c9d2bd0b5097231b42707825aaf0df24d7337R477

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will be switching default to always use /.workspace folder, so that we don't need to do all those checks in the future

@Furisto
Copy link
Member

Furisto commented May 2, 2022

I tried it here, but the workspace never gets ready. timed out waiting for the condition

@sagor999
Copy link
Contributor Author

sagor999 commented May 2, 2022

@Furisto @utam0k thank you for reviews! For preview environment, please use this branch:
https://github.com/gitpod-io/workspace-preview/pull/33
As it setups storage class to use for PVC (without it k8s doesn't know which storage class to use, and that's why PVC is stuck in pending)
And you cannot use PVC feature just yet in our dev env, as platform team would need to setup storageclass\snapshot CRD in there first. I will be reaching out to them regarding this once this feature (PVC support) is a bit more feature complete.

@sagor999
Copy link
Contributor Author

sagor999 commented May 2, 2022

/werft run with-clean-slate-deployment

👍 started the job as gitpod-build-pavel-9142-pvc2.46
(with .werft/ from main)

@jenting
Copy link
Contributor

jenting commented May 3, 2022

And you cannot use PVC feature just yet in our dev env, as platform team would need to setup storageclass\snapshot CRD in there first. I will be reaching out to them regarding this once this feature (PVC support) is a bit more feature complete.

I thought the core-dev have the pre-builtin CRD already.
image

Copy link
Contributor

@jenting jenting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions:

  • What if the PVC storage full, will the workspace hang? Or will it auto increase the PVC size? (I guess no)?
  • If I understand correctly, once the workspace stop, it'd still backup the content to the remote like S3, correct?

@@ -227,6 +227,103 @@ func (s *Provider) GetContentLayer(ctx context.Context, owner, workspaceID strin
return nil, nil, xerrors.Errorf("no backup or valid initializer present")
}

// GetContentLayerPVC provides the content layer for a workspace that uses PVC feature
func (s *Provider) GetContentLayerPVC(ctx context.Context, owner, workspaceID string, initializer *csapi.WorkspaceInitializer) (l []Layer, manifest *csapi.WorkspaceContentManifest, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: we could refactor these functions GetContentLayer and GetContentLayerPVC into one in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Once PVC is stable and we switched over to it, old code will be removed. (so that we will have only one version of that function)

@sagor999
Copy link
Contributor Author

sagor999 commented May 3, 2022

@jenting

* What if the PVC storage full, will the workspace hang? Or will it auto increase the PVC size? (I guess no)?

It depends on workspace, but it might experience issues. It is the same behavior as pre PVC though (since workspaces have same 30GB storage quota set on them). And we wouldn't auto increase PVC size.

* If I understand correctly, once the workspace stop, it'd still backup the content to the remote like S3, correct?

When workspace stops, we will create snapshot volume out of PVC and that will be our backup. (part of a different PR). We are moving away from S3\Object storage.

@sagor999
Copy link
Contributor Author

sagor999 commented May 3, 2022

@aledbf just need your approval.

@roboquat roboquat merged commit 91548a6 into main May 3, 2022
@roboquat roboquat deleted the pavel/9142-pvc2 branch May 3, 2022 21:38
@jenting
Copy link
Contributor

jenting commented May 3, 2022

When workspace stops, we will create snapshot volume out of PVC and that will be our backup. (part of a different PR). We are moving away from S3\Object storage.

If I understand correct, create a snapshot volume is leverage the Kubernetes snapshot controller + volumesnapshot/volumesnapshotcontent/volumesnapshotclass CRDs. However, it's the in-cluster CRDs and the volume snapshot could be stored in the external storage disk, then how can we handle a case that we launch a new cluster and shift the traffic to new cluster? If means that the old and new cluster must uses the same external storage, but it also means that the external storage needs to support mounts by multiple node/zone/...

@sagor999
Copy link
Contributor Author

sagor999 commented May 4, 2022

However, it's the in-cluster CRDs and the volume snapshot could be stored in the external storage disk

Not quite. We are using GCP's snapshot feature for this. So when we create volumesnapshot, it actually creates GCP`s snapshot of the disk from which PVC was mounted from.
So snapshots live outside of cluster lifecycle.
When we bring up new cluster, we create snapshot volume object referencing the one from GCP.
And also due to that we can mount snapshot from one region to another region (though incurring trans-region cost).

@jenting
Copy link
Contributor

jenting commented May 4, 2022

However, it's the in-cluster CRDs and the volume snapshot could be stored in the external storage disk

Not quite. We are using GCP's snapshot feature for this. So when we create volumesnapshot, it actually creates GCP`s snapshot of the disk from which PVC was mounted from. So snapshots live outside of cluster lifecycle. When we bring up new cluster, we create snapshot volume object referencing the one from GCP. And also due to that we can mount snapshot from one region to another region (though incurring trans-region cost).

Awesome 💯 , then we must make sure the self-hosted works as well 🙏

@roboquat roboquat added deployed: IDE IDE change is running in production deployed: workspace Workspace team change is running in production labels May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: IDE IDE change is running in production deployed: workspace Workspace team change is running in production release-note size/XXL team: delivery Issue belongs to the self-hosted team team: IDE team: workspace Issue belongs to the Workspace team
Projects
None yet
9 participants