-
Notifications
You must be signed in to change notification settings - Fork 143
We need to ensure that we are not reading stale PV objects #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We need to ensure that we are not reading stale PV objects #96
Conversation
/assign @msau42 |
pkg/controller/controller.go
Outdated
@@ -341,6 +341,15 @@ func (ctrl *resizeController) pvNeedResize(pvc *v1.PersistentVolumeClaim, pv *v1 | |||
return false | |||
} | |||
|
|||
// lets make sure that we are reading latest PV object so as any PV updates that might have happened |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you call resize twice on the same PV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general once kubelet finishes resizing then all resizing related conditions are removed from PVC. But what is happening in e2e is - we take down the pod and then we expand the PVC and we wait for "FileSystemResizePending" condition on PVC, but if resize is called again in resize controller then that condition might get replaced by "Resizing" condition and hence e2e may flake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean we should keep an internal cache and update it after we make an API update, similar to what we do in pv controller? That way, we don't need to do a GET request on every loop, which will limit our scalibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the reason I think this will eliminate this kind of flakes is because - the previous operation on PVC can't finish without writing the object to etcd. The only reason we would return true from here is if informers haven't been updated yet, but if we did read from apiserver directly, this problem should go away.
Alternatively we will have to maintain local cache like provisioner/snapshotter etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about local caching but I am not sure if it is worth doing in this case. The PV size comparison is always done after PVC size comparsion, so in this case PVC does have a volume expansion pending (like we know pvc.Status.Capacity < pvc.Spec.Capacity
). So this API call should only be made twice per PVC size change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spawning 10 threads to resize PVCs in a loop for 120s. Each thread moves onto the next resize operation once the PVC enters FileSystemResizePending
condition. Dummy vSphere CSI driver that returns success for ControllerExpandVolume
calls.
- Without this change, using image
quay.io/k8scsi/csi-resizer:v1.0.0-rc2
:
Ran 134 operations with throughput of 1.127 ops/sec - With this change:
Ran 88 operations with a throughput of 0.79 ops/sec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand correctly, this PR overall reduced number of API requests from external-resizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite the opposite unfortunately :) With the rc image, i can get 134 calls to ControllerExpandVolume
but that drops to 88 with this PR, the reason being that each ControllerExpandVolume
request takes longer to process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsafrane pointed out that we could do same thing @RaunakShah did in provisioner - kubernetes-sigs/sig-storage-lib-external-provisioner#83 . lets see if this helps.
ccf4ef5
to
97eda0d
Compare
97eda0d
to
a5c04c5
Compare
/retest |
1 similar comment
/retest |
@msau42 can you PTAL. I have moved the PR to use |
pkg/controller/controller.go
Outdated
|
||
pvc, ok := pvcObject.(*v1.PersistentVolumeClaim) | ||
if !ok { | ||
klog.Errorf("expected PVC got: %v", pvcObject) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we let the caller log the error? Ditto throughout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
pkg/controller/controller.go
Outdated
@@ -407,7 +434,7 @@ func (ctrl *resizeController) resizePVC(pvc *v1.PersistentVolumeClaim, pv *v1.Pe | |||
ctrl.eventRecorder.Eventf(pvc, v1.EventTypeWarning, util.VolumeResizeFailed, err.Error()) | |||
} | |||
|
|||
return err | |||
return ctrl.claims.Update(updatedPVC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already update it in resizeVolume
. Do we need to update it here too? Would it be simpler if we just always update it after immediately doing a successful api update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did not update PVC in resizeVolume
function. We updated PV in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment applies. Can we update the cache immediately after the update/patch operation, instead of the callers needing to realize that the function updated the api?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved code that updates the cache to be in same place where PV or PVC object is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! I think this looks cleaner
a5c04c5
to
ce0524f
Compare
/retest |
pkg/controller/controller.go
Outdated
@@ -407,7 +434,7 @@ func (ctrl *resizeController) resizePVC(pvc *v1.PersistentVolumeClaim, pv *v1.Pe | |||
ctrl.eventRecorder.Eventf(pvc, v1.EventTypeWarning, util.VolumeResizeFailed, err.Error()) | |||
} | |||
|
|||
return err | |||
return ctrl.claims.Update(updatedPVC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment applies. Can we update the cache immediately after the update/patch operation, instead of the callers needing to realize that the function updated the api?
pkg/controller/controller.go
Outdated
@@ -375,6 +387,12 @@ func (ctrl *resizeController) resizePVC(pvc *v1.PersistentVolumeClaim, pv *v1.Pe | |||
pvc = updatedPVC | |||
} | |||
|
|||
err := ctrl.claims.Update(pvc) | |||
if err != nil { | |||
klog.Errorf("error updating PVC %s in cache: %v", util.PVCKey(pvc), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we return this error and let the caller log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/controller/controller.go
Outdated
klog.Errorf("Update capacity of PV %q to %s failed: %v", pv.Name, newSize.String(), err) | ||
return newSize, fsResizeRequired, err | ||
} | ||
err = ctrl.volumes.Update(updatedPV) | ||
if err != nil { | ||
klog.Errorf("error updating PV %s: %v", updatedPV.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we return the error and let the caller log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
pkg/controller/controller.go
Outdated
newPVC := pvc.DeepCopy() | ||
newPVC.Status.Capacity[v1.ResourceStorage] = newSize | ||
newPVC.Status.Conditions = util.MergeResizeConditionsOfPVC(pvc.Status.Conditions, []v1.PersistentVolumeClaimCondition{}) | ||
if _, err := util.PatchPVCStatus(pvc, newPVC, ctrl.kubeClient); err != nil { | ||
updatedPVC, err := util.PatchPVCStatus(pvc, newPVC, ctrl.kubeClient) | ||
if err != nil { | ||
klog.Errorf("Mark PVC %q as resize finished failed: %v", util.PVCKey(pvc), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
ce0524f
to
0325630
Compare
pkg/controller/controller.go
Outdated
// if this error was a in-use error then it must be tracked so as we don't retry without | ||
// first verifying if volume is in-use | ||
if inUseError(err) { | ||
ctrl.usedPVCs.addPVCWithInUseError(pvc) | ||
} | ||
return newSize, fsResizeRequired, fmt.Errorf("resize volume %s failed: %v", pv.Name, err) | ||
return newSize, fsResizeRequired, fmt.Errorf("Resize volume %q by resizer %q failed: %v", pv.Name, ctrl.name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the convention is for error messages to be lower case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pkg/controller/controller.go
Outdated
@@ -406,8 +417,7 @@ func (ctrl *resizeController) resizePVC(pvc *v1.PersistentVolumeClaim, pv *v1.Pe | |||
// Record an event to indicate that resize operation is failed. | |||
ctrl.eventRecorder.Eventf(pvc, v1.EventTypeWarning, util.VolumeResizeFailed, err.Error()) | |||
} | |||
|
|||
return err | |||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changed to nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how this happeend. Added a unit test to ensure this does not happen. good catch.
pkg/controller/controller.go
Outdated
if err != nil { | ||
return newSize, fsResizeRequired, fmt.Errorf("Update capacity of PV %q to %s failed: %v", pv.Name, newSize.String(), err) | ||
} | ||
err = ctrl.volumes.Update(updatedPV) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to do the cache update in the util method itself, but I guess we don't want to pass ctrl
in. Can we create another helper function here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made a new helper function that updates the PV with api-server and updates local cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be picky, can we also do the same for the other patch + updates? The main goal I'm trying to achieve is that developers can use the util methods to make api updates and don't have to remember to also update the cache afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have now helpers functions on controller type which do the same. They patch the object and then they update the cache. Functions - updatePVCapacity
, markPVCAsFSResizeRequired
and markPVCResizeFinished
both update the object with apiserver and then update the object in the cache.
Did you wanted both patching and updating in cache to be done within util functions which aren't part of controller type? In that case - we will have to pass controller instance to those functions and in which case these functions may as well be on controller type (which is what I have done).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking even one more level down, like patchStatus
so that we only do the patch + update in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about doing that but that means PatchPVCStatus
function can't really belong to util package because it has to also update the controller's cache. I hear what you are saying but I am not sure if it is worth doing given the size of resize controller which has pretty small surface area of PV and PVC updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I moved patchClaim
and patchPersistentVolume
functions from util into controller, so as it can patch the object and update the cache. PTAL
So as we don't end up resizing same object more than once
0325630
to
e1ec0ac
Compare
pkg/controller/controller.go
Outdated
} | ||
err = ctrl.claims.Update(updatedPVC) | ||
if err != nil { | ||
return nil, fmt.Errorf("error updating PVC %s in local cache: %v", util.PVCKey(newPVC), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to fail the entire operation if the local cache fails to update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the worst case when updating local cache failed and the informer cache did not update itself (from PVC events) while operation was retried - the test will flake.
So, it should be fine I think.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gnufied, msau42 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
So as we don't end up resizing same object more than once
Fixes flakes associated with resizing.
/kind bug