Skip to content

PVC used by a job doesn't get resize after the pod of the job completed #175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
PhanLe1010 opened this issue Oct 14, 2021 · 15 comments · Fixed by #178
Closed

PVC used by a job doesn't get resize after the pod of the job completed #175

PhanLe1010 opened this issue Oct 14, 2021 · 15 comments · Fixed by #178
Labels
sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Oct 14, 2021

Summary:
We have a setup in which the external-resizer is used with the storage provider that only supports offline expansion (e.g., only supports PluginCapability_VolumeExpansion_OFFLINE). We deployed a job that uses a PVC provisioned by the storage provider. While the job pod is running, we resize the PVC by modifying spec.resources.requests.storage. The PVC cannot be resized while the pod is running as expected. However, after the job pod is completed, the PVC still doesn't get resized. external-resizerdoesn't send resizing gRPC call to the storage provider. The PVC is stuck in this state forever until we manually delete the job pod.

Reproduce steps:

  1. Deploy external-resizer together with a storage provider (we use Longhorn)

  2. Don't set the --handle-volume-inuse-error flag for the external-resizer . It means that by default, external-resizer will handle handle volume in use error in resizer controller, link

  3. Deploy a job that uses a PVC as below. The job creates a pod that will sleep for 2 minutes and complete.

    Click to open
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: test-job-pvc
      namespace: default
    spec:
      accessModes:
        - ReadWriteOnce
      storageClassName: longhorn
      resources:
        requests:
          storage: 1Gi
    ---
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: test-job
      namespace: default
    spec:
      backoffLimit: 1
      template:
        metadata:
          name: test-job
        spec:
          containers:
            - name: test-job
              image: ubuntu:latest
              imagePullPolicy: IfNotPresent
              securityContext:
                privileged: true
              command: ["/bin/sh"]
              args: ["-c", "echo 'sleep for 120s then exit'; sleep 120"]
              volumeMounts:
                - mountPath: /data
                  name: vol
          restartPolicy: OnFailure
          volumes:
            - name: vol
              persistentVolumeClaim:
                claimName: test-job-pvc
    
  4. While the job pod become running, try to expand the PVC by editing the spec.resources.requests.storage

  5. Observe that the resizing fail

  6. Wait for the job pod to become completed.

  7. Observer that that PVC stuck in the current state forever. It doesn't get resized because external-resizer doesn't attempt to make gRPC expanding call to the storage provider.

Expected Behavior:

Once the job pod is completed, the PVC is no longer consider to be in-used. Therefore external-resizer should attempt to make gRPC expanding call to the storage provider.

Propose:
We dig into the source code see that:

  1. This checker prevent the external-resizer from retrying if the PVC has InUseErrors before AND it is in the ctrl.usedPVCs map
  2. The problem is that the PVC is never removed from the ctrl.usedPVCs map when a pod move to completed phase. PVC is only removed when the pod is deleted, link
  3. We think that the logic over here should be changed to handle the case when the pod become completed. I.e.,:
    func (ctrl *resizeController) updatePod(oldObj, newObj interface{}) {
        pod := parsePod(newObj)
        if pod == nil {
    	    return
        }
        
        if isPodTerminated(pod) {
    	    ctrl.usedPVCs.removePod(pod)
        } else {
    	    ctrl.usedPVCs.addPod(pod)
        }
    }
    

Evn:

  • external-resizer v1.2.0
  • Longhorn v1.2.2
@PhanLe1010
Copy link
Contributor Author

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Oct 15, 2021
@PhanLe1010
Copy link
Contributor Author

We can open a PR if you guys agree with our proposal

@yasker
Copy link

yasker commented Oct 19, 2021

ping @gnufied @mauriciopoppe

@mauriciopoppe
Copy link
Member

I'm not familiar with the external-resizer, @msau42 do you know who can help?

@gnufied
Copy link
Contributor

gnufied commented Oct 19, 2021

Hey sorry I was on PTO last week. The proposal looks good. Please open a PR.

@PhanLe1010
Copy link
Contributor Author

PhanLe1010 commented Oct 20, 2021

@gnufied
PR is opened. Please allow the bot to test the PR

Do I also need to create back-ported PRs into release-1.2 and release-1.3?

@PhanLe1010
Copy link
Contributor Author

@gnufied

I see that the fix is tagged for v1.4.0 release. However, v1.4.0 version requires CSI SPEC 1.5.0. Not all storage providers are using the new CSI SPEC 1.5.0. On the other hand, the bug also exists in external-resizer v1.2.0 and v1.3.0. Therefore, can we backport the fix to release-1.2, release-1.3 and release new versions v1.2.1, v1.3.1?

@gnufied
Copy link
Contributor

gnufied commented Mar 29, 2022

The CSI spec is backward compatible and I don't recall we have made any core changes in features that affect volume expansion in CSI spec for awhile. So, a driver which does not implement 1.5.0 should still be able to run with external-resizer v1.4.0. I do not think there will be any issues or such.

We could backport the fix and open it in older releases too if that makes it easier.

@PhanLe1010
Copy link
Contributor Author

PhanLe1010 commented Mar 29, 2022

We could backport the fix and open it in older releases too if that makes it easier

@gnufied It would be great if we can backport the fix to release-1.2, release-1.3 and release new versions v1.2.1, v1.3.1.

In our case, we are using external-resizer v1.2.0 with CSI Spec v1.2.0. Yeah, I understand that it could work with external-resizer v1.4.0 that uses CSI Spec 1.5.0, but it could also raise some unforeseen bugs that we are not fully aware of yet. I remembered we have some accidents with csi-provisioner before which makes us scared.

@gnufied
Copy link
Contributor

gnufied commented Mar 30, 2022

Feel free to create backport request. I will make a release.

@PhanLe1010
Copy link
Contributor Author

Thanks @gnufied
How do I create backport request?

@gnufied
Copy link
Contributor

gnufied commented Mar 30, 2022

You will find release branches for those versions and you can create PRs against those branches.

@PhanLe1010
Copy link
Contributor Author

@gnufied Created the PRs. Please take a look. Thank you

@PhanLe1010
Copy link
Contributor Author

@gnufied Now that the backport PRs are merged, do you have ETA for when the patch release v1.2.1 and v1.3.1 are released?

@PhanLe1010
Copy link
Contributor Author

@gnufied Now that the backport PRs are merged, do you have ETA for when the patch release v1.2.1 and v1.3.1 are released?

ping @gnufied

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
5 participants