-
Notifications
You must be signed in to change notification settings - Fork 158
Wrap all GRPC errors in status, fix semantics of NotFound errors #368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: davidz627 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -267,7 +267,7 @@ func (gceCS *GCEControllerServer) ControllerPublishVolume(ctx context.Context, r | |||
|
|||
volKey, err := common.VolumeIDToKey(volumeID) | |||
if err != nil { | |||
return nil, status.Error(codes.NotFound, fmt.Sprintf("Could not find volume with ID %v: %v", volumeID, err)) | |||
return nil, status.Error(codes.InvalidArgument, fmt.Sprintf("ControllerPublishVolume volume ID is invalid: %v", err)) | |||
} | |||
|
|||
volKey, err = gceCS.CloudProvider.RepairUnderspecifiedVolumeKey(ctx, volKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to check gce error codes here? What if we had temporary issues with the cloud provider, and not that the disk doesn't actually exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no real way to distinguish between "disk not exists in zone because it may or may not exist in another zone but is temporarily not showing up" vs "disk does not exist in any zone because it just doesn't exist" since RepairUnderspecifiedVolumeKey
queries all the zones in the region for the disk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could check if the zone check fails because of "not found" error code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, but it may be NotFound
is some subset of zones, and some other error in other zones. What do we do in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we looped through all zones and found it, then return success. Otherwise, if all the zones returned notFound, then return notFound. Otherwise return error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't be sure, but I believe this is consistent with behavior of in-tree plugin. This codepath is only hit for migration - any disks managed natively through the CSI Driver have zone/region information encoded in their volume ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be on the safe side, can we search all zones and if we have multiple matches then return error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this issue kubernetes/kubernetes#65198, @verult mentioned that csi driver solve this issue, how it is solved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using the CSI Driver natively the region/zone information is encoded in the volume ID. This unspecified/repair case is only for CSI Migration, in which case we will continue to have the same issue as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright i've done the original repair suggestion in a seperate commit, PTAL
/hold |
pkg/gce-pd-csi-driver/controller.go
Outdated
// This is a success according to the spec | ||
// Cannot find volume associated with this ID because VolumeID is not in | ||
// correct format, this is a success according to the Spec | ||
klog.Warningf("Treating volume as deleted because volume id %s is invalid: %v", volumeID, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it return success? Can we get into a case where user pre-provisions a PV, but gets the volume id wrong. Then when they go and delete it they think it's successful, when it's actually not, and then we leak a volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with the return code in this case for many reasons, but this is what it says in the Spec as well as CSI Sanity.
…face for better VolumeID errors
…t for errors conforming to spec
…found in any zone, other error when its found itn multiple zones or error getting disk
/hold cancel |
/lgtm |
Fixes: #367
/assign @msau42 @jsafrane
/cc @jingxu97
/kind bug
/kind cleanup