Wrap all GRPC errors in status, fix semantics of NotFound errors #368

davidz627 · 2019-08-06T18:53:00Z

Fixes: #367
/assign @msau42 @jsafrane
/cc @jingxu97

/kind bug
/kind cleanup

ControllerUnpublishVolume now returns success when the Node is GCE API NotFound.
Invalid format VolumeID is now GRPC InvalidArgument error instead of GRPC NotFound.
Underspecified disks not found in any zone now return GRPC NotFound.

k8s-ci-robot · 2019-08-06T18:53:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidz627

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [davidz627]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/gce-pd-csi-driver/controller.go

msau42 · 2019-08-06T19:02:33Z

pkg/gce-pd-csi-driver/controller.go

@@ -267,7 +267,7 @@ func (gceCS *GCEControllerServer) ControllerPublishVolume(ctx context.Context, r

 	volKey, err := common.VolumeIDToKey(volumeID)
 	if err != nil {
-		return nil, status.Error(codes.NotFound, fmt.Sprintf("Could not find volume with ID %v: %v", volumeID, err))
+		return nil, status.Error(codes.InvalidArgument, fmt.Sprintf("ControllerPublishVolume volume ID is invalid: %v", err))
 	}

 	volKey, err = gceCS.CloudProvider.RepairUnderspecifiedVolumeKey(ctx, volKey)


Do we want to check gce error codes here? What if we had temporary issues with the cloud provider, and not that the disk doesn't actually exist.

there's no real way to distinguish between "disk not exists in zone because it may or may not exist in another zone but is temporarily not showing up" vs "disk does not exist in any zone because it just doesn't exist" since RepairUnderspecifiedVolumeKey queries all the zones in the region for the disk

you could check if the zone check fails because of "not found" error code

right, but it may be NotFound is some subset of zones, and some other error in other zones. What do we do in that case?

If we looped through all zones and found it, then return success. Otherwise, if all the zones returned notFound, then return notFound. Otherwise return error?

we can't be sure, but I believe this is consistent with behavior of in-tree plugin. This codepath is only hit for migration - any disks managed natively through the CSI Driver have zone/region information encoded in their volume ID

To be on the safe side, can we search all zones and if we have multiple matches then return error?

In this issue kubernetes/kubernetes#65198, @verult mentioned that csi driver solve this issue, how it is solved?

When using the CSI Driver natively the region/zone information is encoded in the volume ID. This unspecified/repair case is only for CSI Migration, in which case we will continue to have the same issue as before.

alright i've done the original repair suggestion in a seperate commit, PTAL

pkg/gce-pd-csi-driver/controller.go

davidz627 · 2019-08-06T23:02:39Z

/hold
using a private branch version of csi-sanity based on kubernetes-csi/csi-test#212
Wait till above PR merged and pull in new real CSI-Test dependency

pkg/gce-pd-csi-driver/controller.go

msau42 · 2019-08-08T01:43:39Z

pkg/gce-pd-csi-driver/controller.go

-		// This is a success according to the spec
+		// Cannot find volume associated with this ID because VolumeID is not in
+		// correct format, this is a success according to the Spec
+		klog.Warningf("Treating volume as deleted because volume id %s is invalid: %v", volumeID, err)


Should it return success? Can we get into a case where user pre-provisions a PV, but gets the volume id wrong. Then when they go and delete it they think it's successful, when it's actually not, and then we leak a volume.

I don't agree with the return code in this case for many reasons, but this is what it says in the Spec as well as CSI Sanity.

…face for better VolumeID errors

…t for errors conforming to spec

…found in any zone, other error when its found itn multiple zones or error getting disk

davidz627 · 2019-08-15T00:55:52Z

/hold cancel
/retest
@msau42 ready for review

msau42 · 2019-08-15T01:34:43Z

/lgtm

k8s-ci-robot assigned jsafrane and msau42 Aug 6, 2019

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Aug 6, 2019

k8s-ci-robot requested a review from jingxu97 August 6, 2019 18:53

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 6, 2019

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 6, 2019

msau42 reviewed Aug 6, 2019

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 6, 2019

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 6, 2019

Wrap all GRPC errors in status, fix semantics of NotFound errors

3e9b71b

davidz627 force-pushed the fix/err branch from 4a07bf0 to aebe09f Compare August 6, 2019 23:21

msau42 reviewed Aug 7, 2019

View reviewed changes

pkg/gce-pd-csi-driver/controller.go Show resolved Hide resolved

davidz627 force-pushed the fix/err branch from aebe09f to 5e9a5d5 Compare August 7, 2019 22:55

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 7, 2019

msau42 reviewed Aug 8, 2019

View reviewed changes

davidz627 added 2 commits August 8, 2019 15:40

Bump CSI Sanity to latest version and implement the IDGenerator inter…

d69f3b2

…face for better VolumeID errors

Fix unit tests and add a check for volume existence for CreateSnapsho…

f7d55f5

…t for errors conforming to spec

davidz627 force-pushed the fix/err branch from 5e9a5d5 to f7d55f5 Compare August 8, 2019 22:40

Change behavior of repair disks to return not found when disk is not …

c1b73b0

…found in any zone, other error when its found itn multiple zones or error getting disk

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 15, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 15, 2019

k8s-ci-robot merged commit 01b0034 into kubernetes-sigs:master Aug 15, 2019

davidz627 deleted the fix/err branch August 15, 2019 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrap all GRPC errors in status, fix semantics of NotFound errors #368

Wrap all GRPC errors in status, fix semantics of NotFound errors #368

davidz627 commented Aug 6, 2019 •

edited

Loading

k8s-ci-robot commented Aug 6, 2019

msau42 Aug 6, 2019

davidz627 Aug 6, 2019

msau42 Aug 6, 2019

davidz627 Aug 6, 2019

msau42 Aug 7, 2019

davidz627 Aug 7, 2019

msau42 Aug 7, 2019 •

edited

Loading

jingxu97 Aug 8, 2019

davidz627 Aug 8, 2019

davidz627 Aug 8, 2019

davidz627 commented Aug 6, 2019 •

edited

Loading

msau42 Aug 8, 2019

davidz627 Aug 8, 2019

davidz627 commented Aug 15, 2019

msau42 commented Aug 15, 2019

Wrap all GRPC errors in status, fix semantics of NotFound errors #368

Wrap all GRPC errors in status, fix semantics of NotFound errors #368

Conversation

davidz627 commented Aug 6, 2019 • edited Loading

k8s-ci-robot commented Aug 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 Aug 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidz627 commented Aug 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidz627 commented Aug 15, 2019

msau42 commented Aug 15, 2019

davidz627 commented Aug 6, 2019 •

edited

Loading

msau42 Aug 7, 2019 •

edited

Loading

davidz627 commented Aug 6, 2019 •

edited

Loading