Skip to content

Wrap all GRPC errors in status, fix semantics of NotFound errors #368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 15, 2019

Conversation

davidz627
Copy link
Contributor

@davidz627 davidz627 commented Aug 6, 2019

Fixes: #367
/assign @msau42 @jsafrane
/cc @jingxu97

/kind bug
/kind cleanup

ControllerUnpublishVolume now returns success when the Node is GCE API NotFound.
Invalid format VolumeID is now GRPC InvalidArgument error instead of GRPC NotFound.
Underspecified disks not found in any zone now return GRPC NotFound.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Aug 6, 2019
@k8s-ci-robot k8s-ci-robot requested a review from jingxu97 August 6, 2019 18:53
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 6, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidz627

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 6, 2019
@@ -267,7 +267,7 @@ func (gceCS *GCEControllerServer) ControllerPublishVolume(ctx context.Context, r

volKey, err := common.VolumeIDToKey(volumeID)
if err != nil {
return nil, status.Error(codes.NotFound, fmt.Sprintf("Could not find volume with ID %v: %v", volumeID, err))
return nil, status.Error(codes.InvalidArgument, fmt.Sprintf("ControllerPublishVolume volume ID is invalid: %v", err))
}

volKey, err = gceCS.CloudProvider.RepairUnderspecifiedVolumeKey(ctx, volKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to check gce error codes here? What if we had temporary issues with the cloud provider, and not that the disk doesn't actually exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no real way to distinguish between "disk not exists in zone because it may or may not exist in another zone but is temporarily not showing up" vs "disk does not exist in any zone because it just doesn't exist" since RepairUnderspecifiedVolumeKey queries all the zones in the region for the disk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could check if the zone check fails because of "not found" error code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, but it may be NotFound is some subset of zones, and some other error in other zones. What do we do in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we looped through all zones and found it, then return success. Otherwise, if all the zones returned notFound, then return notFound. Otherwise return error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't be sure, but I believe this is consistent with behavior of in-tree plugin. This codepath is only hit for migration - any disks managed natively through the CSI Driver have zone/region information encoded in their volume ID

Copy link
Contributor

@msau42 msau42 Aug 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be on the safe side, can we search all zones and if we have multiple matches then return error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this issue kubernetes/kubernetes#65198, @verult mentioned that csi driver solve this issue, how it is solved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using the CSI Driver natively the region/zone information is encoded in the volume ID. This unspecified/repair case is only for CSI Migration, in which case we will continue to have the same issue as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright i've done the original repair suggestion in a seperate commit, PTAL

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 6, 2019
@davidz627
Copy link
Contributor Author

davidz627 commented Aug 6, 2019

/hold
using a private branch version of csi-sanity based on kubernetes-csi/csi-test#212
Wait till above PR merged and pull in new real CSI-Test dependency

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 6, 2019
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 7, 2019
// This is a success according to the spec
// Cannot find volume associated with this ID because VolumeID is not in
// correct format, this is a success according to the Spec
klog.Warningf("Treating volume as deleted because volume id %s is invalid: %v", volumeID, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it return success? Can we get into a case where user pre-provisions a PV, but gets the volume id wrong. Then when they go and delete it they think it's successful, when it's actually not, and then we leak a volume.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with the return code in this case for many reasons, but this is what it says in the Spec as well as CSI Sanity.

…found in any zone, other error when its found itn multiple zones or error getting disk
@davidz627
Copy link
Contributor Author

/hold cancel
/retest
@msau42 ready for review

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 15, 2019
@msau42
Copy link
Contributor

msau42 commented Aug 15, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 15, 2019
@k8s-ci-robot k8s-ci-robot merged commit 01b0034 into kubernetes-sigs:master Aug 15, 2019
@davidz627 davidz627 deleted the fix/err branch August 15, 2019 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make sure all errors returned in CSI response is wrapped
5 participants