Skip to content

Runtime checks around providerID #228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 15, 2018
Merged

Conversation

jbornemann
Copy link
Contributor

This provides a bit more error detection around problems that could arise with providerID.

I am still trying to find out the source of why providerID is not being set on my cluster :

E0809 20:19:15.385243       1 node_controller.go:327] NodeAddress: Error fetching by providerID: GetPrimaryVNICForInstance: Service error:InvalidParameter. Invalid OCID: . http status code: 400 Error fetching by NodeName: GetInstanceByNodeName: Service error:NotAuthorizedOrNotFound. Authorization failed or requested resource not found.. http status code: 404
I0809 20:19:15.385273       1 node_controller.go:392] Successfully initialized node master37-01-usashburn1 with cloud provider
E0809 20:19:20.516397       1 node_controller.go:321] failed to set node provider id: failed to get instance ID from cloud provider: instance not found

My instance names are not the same as my node names, however it should properly fall back to and correspond with my private subnet hostname label. Ideas?

@jbornemann
Copy link
Contributor Author

FYI @tjfontaine

@prydie @jhorwit2

@jhorwit2
Copy link
Member

jhorwit2 commented Aug 9, 2018

I am still trying to find out the source of why providerID is not being set on my cluster

@jbornemann is this an OKE cluster or ?? The way this gets set is via the --provider-id flag on the kubelet.

@jhorwit2
Copy link
Member

jhorwit2 commented Aug 9, 2018

however it should properly fall back to and correspond with my private subnet hostname label. Ideas?

The check for the hostname label is as follows

(*vnic.HostnameLabel != "" && strings.HasPrefix(nodeName, *vnic.HostnameLabel)

Does your node name match that check? @prydie should be able to give context on that check as I can't remember why it exists (i think for OKE).

If possible you should never rely on this though as you'll start to run into potential api rate limiting problems with OCI as your cluster grows, which is less likely with the provider id option.

@jbornemann
Copy link
Contributor Author

This is not OKE.

It should pass that check, which is why I am confused.

The provider-id flag, is that something I need to set on kubelets? provider-type is set to external per instructions.

@jbornemann
Copy link
Contributor Author

Had assumed it would get that value at runtime per ExternalName() or..sorry, forget the function call at the moment

@jhorwit2
Copy link
Member

jhorwit2 commented Aug 9, 2018

Sorry you're right the node controller will check w/ the cloud if provider id is empty string. Same check applies though above, so if it's not gonna pass the hostname prefix check on the node name then it won't work :)

@jhorwit2
Copy link
Member

jhorwit2 commented Aug 9, 2018

Mind pasting the output of kubectl describe node master37-01-usashburn1

@jhorwit2
Copy link
Member

jhorwit2 commented Aug 9, 2018

Also, what version of the CCM are you running? Is the node in the same compartment as the networking components (vcn/subnet)?

@jbornemann
Copy link
Contributor Author

jbornemann commented Aug 9, 2018

HostnameLabel is just the hostname right? Not the FQDN? If it was the FQDN, I suppose that would require it to be

strings.HasPrefix(*vnic.HostnameLabel, nodeName)

To work :/

@jhorwit2
Copy link
Member

jhorwit2 commented Aug 9, 2018

Correct, hostname label is not the fqdn. In order to get that you have to combine the hostname label, subnet dns label and vcn dns label.

@jhorwit2
Copy link
Member

jhorwit2 commented Aug 9, 2018

The provider-id flag, is that something I need to set on kubelets? provider-type is set to external per instructions.

Setting it helps in various ways. If you set that value then there's no "guess work" that has to be done by the CCM. It's also more efficient and will ultimately lead to less rate limiting problems. You can do that via something like the following on the kubelet.

--provider-id=$(curl -s http://169.254.169.254/opc/v1/instance/id)

We should update the docs to mention setting this explicitly. We kinda do right now but in a not great way.

}
return providerID
return providerID, errors.Errorf("(%q) is not a valid provider ID", providerID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a breaking change for existing deployments since it's optional to have the prefix which is how other clouds work as well. You can't actually run multiple CCMs in a single cluster across clouds.

I think a better check would be if empty string return error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, once the provider id is set you cannot change it so every node without the prefix would need to be deleted and register again w/ the cluster to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wasn't aware of that. Thanks. Will change

@jbornemann
Copy link
Contributor Author

Name:               master37-01-usashburn1
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=master37-01-usashburn1
                    node-role.kubernetes.io/master=
Annotations:        alpha.kubernetes.io/provided-node-ip=100.77.1.85
                    node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:             node-role.kubernetes.io/master:NoSchedule
                    node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
CreationTimestamp:  Tue, 24 Jul 2018 12:36:17 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Thu, 09 Aug 2018 23:07:47 +0000   Tue, 24 Jul 2018 12:36:12 +0000   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Thu, 09 Aug 2018 23:07:47 +0000   Tue, 24 Jul 2018 12:36:12 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 09 Aug 2018 23:07:47 +0000   Tue, 24 Jul 2018 12:36:12 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready            True    Thu, 09 Aug 2018 23:07:47 +0000   Wed, 25 Jul 2018 21:36:36 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  100.77.1.85
Capacity:
 cpu:     4
 memory:  30879896Ki
 pods:    110
Allocatable:
 cpu:     3800m
 memory:  30277496Ki
 pods:    110
System Info:
 Machine ID:                 aa623f893a1d42bda8afeaebecbdcb4e
 System UUID:                3444EF13-D4F6-4067-A023-29A197EC3352
 Boot ID:                    a0a5b9b6-87ae-420b-a862-747dba67412c
 Kernel Version:             4.4.0-108-generic
 OS Image:                   Ubuntu 16.04.3 LTS
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://17.3.1
 Kubelet Version:            v1.9.3+coreos.0
 Kube-Proxy Version:         v1.9.3+coreos.0
PodCIDR:                     10.233.64.0/24
ExternalID:                  master37-01-usashburn1
Also, what version of the CCM are you running?

0.5.0

 Is the node in the same compartment as the networking components (vcn/subnet)?

No, those are managed in a separate compartment

Setting it helps in various ways. If you set that value then there's no "guess work" that has to be done by the CCM. It's also more efficient and will ultimately lead to less rate limiting problems. You can do that via something like the following on the kubelet.

--provider-id=$(curl -s http://169.254.169.254/opc/v1/instance/id)

We should update the docs to mention setting this explicitly. We kinda do right now but in a not great way.

I recently added oci as a provider to kubespray. Branch out, not merged. No smarts around provider-id yet, for any provider. May do it at some point, but for now just trying to get stable, automated, clusters.

@jbornemann
Copy link
Contributor Author

Some more details:

Using principal auth. Authorizations for the dynamic group is as follows:

  "Allow dynamic-group k8s_dynamic_group to manage load-balancers in compartment Kubernetes", 
  "Allow dynamic-group k8s_dynamic_group to manage security-lists in compartment Networking",
  "Allow dynamic-group k8s_dynamic_group to read instances in compartment Kubernetes",
  "Allow dynamic-group k8s_dynamic_group to use virtual-network-family in compartment Networking"

@jbornemann jbornemann force-pushed the jeff_changes branch 2 times, most recently from a2d7d43 to 89b3fba Compare August 10, 2018 13:44
@jhorwit2
Copy link
Member

@prydie @owainlewis something i noticed trying to help debug this is we're adding the call stack to every error but we're not logging the errors verbosely so it's basically useless :/.

@jbornemann
Copy link
Contributor Author

Changes made btw @jhorwit2

Also got the load balancers up and running!

Copy link
Contributor

@prydie prydie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (w/ one minor suggestion).

@@ -148,7 +148,15 @@ func getSubnetsForNodes(ctx context.Context, nodes []*v1.Node, client client.Int
continue
}

id := util.MapProviderIDToInstanceID(node.Spec.ProviderID)
if node.Spec.ProviderID == "" {
return nil, errors.Errorf("ProviderID was not present on node %q", node.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use the json path of ProviderID as the error is user facing:

return nil, errors.Errorf(".spec.providerID was not present on node %q", node.Name)

@jbornemann
Copy link
Contributor Author

@prydie Done. Thanks

Copy link
Contributor

@prydie prydie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhorwit2 can you give it another quick glance and accept the changes if you're happy?

@prydie
Copy link
Contributor

prydie commented Aug 15, 2018

@jbornemann Do you mind rebasing your branch as it's now conflicting?

@jbornemann
Copy link
Contributor Author

@prydie rebased

@prydie prydie merged commit 60400e6 into oracle:master Aug 15, 2018
ayushverma14 pushed a commit to ayushverma14/oci-cloud-controller-manager that referenced this pull request Jul 18, 2022
* Runtime checks around providerID

* Added additional required authorization for security lists to example
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants