Skip to content

Replace k8s client get requests with volume cache #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 30, 2020

Conversation

RaunakShah
Copy link
Contributor

@RaunakShah RaunakShah commented Jun 30, 2020

This change replaces API server Get() requests with access to the volume cache.
Addresses #82

Testing:
~2x increase in throughput of external-provisioner. "Throttling ..." messages due to lib-external-provisioner no longer seen due to Get() requests to API server.
With a dummy CSI controller that returns success for every CreateVolume and DeleteVolume call
Use 64-threads to create and delete PVCs in a loop for 5 minutes. Tracking the number of operations successfully executed within this time helps calculate the throughput.

  1. Without this change:
    Operations completed - 728
    Ops/sec - 2.435

  2. With this change:
    Operations completed - 1280
    Ops/sec - 4.281

$ make test 
go mod tidy
go test ./controller
ok  	sigs.k8s.io/sig-storage-lib-external-provisioner/v5/controller	14.030s
go test ./allocator
ok  	sigs.k8s.io/sig-storage-lib-external-provisioner/v5/allocator	0.134s

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 30, 2020
@k8s-ci-robot k8s-ci-robot requested review from jsafrane and wongma7 June 30, 2020 16:25
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 30, 2020
@wongma7
Copy link

wongma7 commented Jun 30, 2020

These Gets intentionally bypass the cache to avoid duplicating subsequent Create/Delete calls. In terms of apiserver load/throttling, it seems obvious that they should be removed.

Keep the Get call: the controller will always call 1 Get + 1 Create/Delete.
Remove the Get call: the controller will usually call 1 Create/Delete;
and if the cache is stale and the controller has been waiting for the lock will sometimes call 2 Create/Delete (the second of which will no-op/return error anyway)

The reason these Gets are here is to mimic the in-tree controller: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/persistentvolume/pv_controller.go#L1214

@jsafrane do you think we still need these Get calls? IMO no.

@wongma7
Copy link

wongma7 commented Jun 30, 2020

Also, I don't think the 'waiting for locks' scenario is even possible anymore ever since we moved to a workqueue model

@jsafrane
Copy link
Contributor

jsafrane commented Jul 1, 2020

The comment about locks is probably obsolete.

Still, direct GET is useful. Consider this scenario:

  1. The provisioner provisions a PVC, saves the provisioned PV to API server
  2. The provisioner gets "PVC changed" event for the same PVC (e.g. something adds a label or an annotation)
  3. To the provisioner it looks like an unprovisioned PVC, so it tries to provision it again. It may not have the PV in its informer cache yet, so ti GETs it from the API server to be 100% sure it does not exist yet.

If you make sure that the provisioner has cache composed both from the informer / API server watch and list of recently provisioned volumes, then yes, you can remove the GET call. I am not sure how safe it is to adds items directly to informer's Store (volumes.Add()), maybe it's worth exploring.

@@ -187,6 +188,8 @@ type ProvisionController struct {
claimsInProgress sync.Map

volumeStore VolumeStore

volumeLister corelisters.PersistentVolumeLister
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why volumeLister? There is volumeInformer and volumes already available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like i totally missed volumes. I've only tested this with an explicit lister so will try that out and update the commit.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 1, 2020
@RaunakShah
Copy link
Contributor Author

@jsafrane @wongma7 does CSI spec dictating that CreateVolume needs to be idempotent have any bearing here? In the situation that @jsafrane described - if the PV is not found in the cache, external-provisioner will try to create the volume again. Since that's supposed to be idempotent, the CSI driver should respond with a successful message. The subsequent attempt to Create() the PV on the API server will return a success since it already exists (https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/volume_store.go#L155)

@jsafrane
Copy link
Contributor

jsafrane commented Jul 3, 2020

Yes, CSI should be idempotent, however, we should not exercise it too often as calling CreateVolume may be an expensive operation for some storage backends (far more expensive than Kubernetes API call).

Did you check if it's possible to add newly provisioned PV into volumes directly? That could be a pretty cheap solution.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 7, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 7, 2020
@@ -1449,20 +1454,7 @@ func (ctrl *ProvisionController) deleteVolumeOperation(ctx context.Context, volu
operation := fmt.Sprintf("delete %q", volume.Name)
glog.Info(logOperation(operation, "started"))

// This method may have been waiting for a volume lock for some time.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this due to the earlier discussion on locks. deleteVolumeOperation takes the volume object as an input parameter

@RaunakShah
Copy link
Contributor Author

@jsafrane i've changed the PR to use the volumes cache instead of lister, please take a look!

@@ -705,7 +705,13 @@ func NewProvisionController(
// PersistentVolumes

volumeHandler := cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { controller.enqueueVolume(obj) },
AddFunc: func(obj interface{}) {
if err = controller.volumes.Add(obj); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The volume coming from informer already is in the informer cache, you don't need to add it again.

What you need to do is to add the volume to the store after the provisioner saved the volume to the API server here

klog.V(5).Infof("Volume %s saved", volume.Name)

and here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RaunakShah RaunakShah force-pushed the add_pvlister branch 2 times, most recently from 4a08db9 to 6fd1e9f Compare July 9, 2020 18:01
@RaunakShah
Copy link
Contributor Author

That's weird, the tests are passing locally:

go test ./controller
ok      sigs.k8s.io/sig-storage-lib-external-provisioner/v6/controller  14.219s
go test ./allocator
ok      sigs.k8s.io/sig-storage-lib-external-provisioner/v6/allocator   (cached)

@RaunakShah RaunakShah changed the title Replace k8s client get requests with volume lister Replace k8s client get requests with volume cache Jul 15, 2020
@RaunakShah
Copy link
Contributor Author

/retest

@jsafrane
Copy link
Contributor

I'm afraid the unit test error is genuine - please check what did you change in the provisioner metrics.

@RaunakShah
Copy link
Contributor Author

@jsafrane actually i think it may just be a flaky test!
I'm able to reproduce the failure on ToT as well. 4/10 attempts fail with the same error:

--- FAIL: TestController (6.61s)
    --- FAIL: TestController/try_to_provision_for_claim-1_but_fail_to_save_the_pv_object (0.21s)
        controller_test.go:684: expected metrics:
             {provisioned:map[class-1:{success:0 failed:1}] deleted:map[]}
             but got:
             {provisioned:map[] deleted:map[]}

@jsafrane
Copy link
Contributor

/retest

@@ -1437,6 +1437,9 @@ func (ctrl *ProvisionController) provisionClaimOperation(ctx context.Context, cl
glog.Info(logOperation(operation, "succeeded"))

if err := ctrl.volumeStore.StoreVolume(claim, volume); err != nil {
if volumAddErr := ctrl.volumes.Add(volume.Name); volumAddErr != nil {
Copy link
Contributor

@jsafrane jsafrane Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add the volume to cache on StoreVolume success, not after failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! I have no idea how it got in there

@RaunakShah
Copy link
Contributor Author

/test pull-sig-storage-lib-external-provisioner-unit

@jsafrane
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 30, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane, RaunakShah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 30, 2020
@k8s-ci-robot k8s-ci-robot merged commit bb73692 into kubernetes-sigs:master Jul 30, 2020
@@ -1439,6 +1439,9 @@ func (ctrl *ProvisionController) provisionClaimOperation(ctx context.Context, cl
if err := ctrl.volumeStore.StoreVolume(claim, volume); err != nil {
return ProvisioningFinished, err
}
if err = ctrl.volumes.Add(volume.Name); err != nil {
utilruntime.HandleError(err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Add fails all the time with:

couldn't create key for object pvc-17e4c2d0-114d-4260-b587-c39a86c8dfe5: object has no meta: object does not implement the Object interfaces

You need to add the volume object, not its name...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix is in #92

@@ -1523,6 +1517,9 @@ func (ctrl *ProvisionController) deleteVolumeOperation(ctx context.Context, volu

glog.Info(logOperation(operation, "persistentvolume deleted"))

if err = ctrl.volumes.Delete(volume.Name); err != nil {
utilruntime.HandleError(err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants