Skip to content

Stuck on replacing controlplane nodes #921

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
farodin91 opened this issue May 29, 2020 · 9 comments
Closed

Stuck on replacing controlplane nodes #921

farodin91 opened this issue May 29, 2020 · 9 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Milestone

Comments

@farodin91
Copy link
Contributor

/kind bug

What steps did you take and what happened:

  • Create a Cluster
  • Wait for the cluster to be running
  • Change the configuration of a controlplane
  • New controlplane node are spawned
  • New controlplane stucks in the phase: Provisioning with the event: Waiting for control plane to pass control plane health check to continue reconciliation: control plane machine namespace/node-cp-v2-s8p8w has no status.nodeRef

What did you expect to happen:

  • new controlplane node gets the nodeRef and switches to the phase running

Anything else you would like to add:
logs of capi-controller-manager

E0529 12:06:21.142555       1 machine_controller_noderef.go:98] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="node-cp-v2-s8p8w"

Environment:

  • Cluster-api version: v0.3.6
  • Cluster-api-provider-vsphere version: v0.6.4
  • Kubernetes version: (use kubectl version): 1.17.3
  • OS (e.g. from /etc/os-release): ubuntu 18.04

I would like to help.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 29, 2020
@farodin91
Copy link
Contributor Author

It seems to be somthing with: kubernetes/cloud-provider-vsphere#326

@yastij
Copy link
Member

yastij commented Jun 2, 2020

@farodin91 - what is the output when you are trying to get the new node on the target cluster ?

@yastij yastij added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Jun 2, 2020
@farodin91
Copy link
Contributor Author

farodin91 commented Jun 2, 2020

kubectl get node node-cp-v2-92xc5 -o yaml

apiVersion: v1
kind: Node
metadata:
  annotations:
    csi.volume.kubernetes.io/nodeid: '{"csi.vsphere.vmware.com":"node-cp-v2-92xc5"}'
    kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
    node.alpha.kubernetes.io/ttl: "0"
    projectcalico.org/IPv4Address: 10.25.8.63/24
    projectcalico.org/IPv4IPIPTunnelAddr: 10.15.247.128
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2020-06-02T12:18:01Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: vsphere-vm.cpu-8.mem-8gb.os-linux
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: node-cp-v2-92xc5
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
  name: node-cp-v2-92xc5
  resourceVersion: "17595"
  selfLink: /api/v1/nodes/node-cp-v2-92xc5
  uid: cccd5cc7-a932-405f-9f38-3d05d143ba33
spec:
  podCIDR: 10.15.232.0/24
  podCIDRs:
  - 10.15.232.0/24
  providerID: vsphere://421241af-45c7-fbbf-2b7d-73d156d3c120
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
status:
  addresses:
  - address: 10.25.8.63
    type: InternalIP
  - address: node-cp-v2-92xc5
    type: Hostname
  allocatable:
    cpu: "8"
    ephemeral-storage: "18901337672"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 8064980Ki
    pods: "110"
  capacity:
    cpu: "8"
    ephemeral-storage: 20509264Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 8167380Ki
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2020-06-02T12:18:51Z"
    lastTransitionTime: "2020-06-02T12:18:51Z"
    message: Calico is running on this node
    reason: CalicoIsUp
    status: "False"
    type: NetworkUnavailable
  - lastHeartbeatTime: "2020-06-02T13:20:52Z"
    lastTransitionTime: "2020-06-02T12:18:00Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2020-06-02T13:20:52Z"
    lastTransitionTime: "2020-06-02T12:18:00Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2020-06-02T13:20:52Z"
    lastTransitionTime: "2020-06-02T12:18:00Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2020-06-02T13:20:52Z"
    lastTransitionTime: "2020-06-02T12:18:41Z"
    message: kubelet is posting ready status. AppArmor enabled
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - k8s.gcr.io/etcd@sha256:4afb99b4690b418ffc2ceb67e1a17376457e441c1f09ab55447f0aaf992fa646
    - k8s.gcr.io/etcd:3.4.3-0
    sizeBytes: 100947667
  - names:
    - docker.io/calico/node@sha256:dbebe7e01ae85af68673a8e0ce51200ab8ae2a1c69d48dff5b95969b17eca7c2
    - docker.io/calico/node:v3.14.1
    sizeBytes: 90581056
  - names:
    - docker.io/calico/cni@sha256:84113c174b979e686de32094e552933e35d8fc7e2d532efcb9ace5310b65088c
    - docker.io/calico/cni:v3.14.1
    sizeBytes: 77638089
  - names:
    - gcr.io/cloud-provider-vsphere/csi/release/driver@sha256:149e87faaacda614ee95ec271b54c8bfdbd2bf5825abc12d45c654036b798229
    - gcr.io/cloud-provider-vsphere/csi/release/driver:v1.0.2
    sizeBytes: 75130938
  - names:
    - k8s.gcr.io/kube-apiserver@sha256:33400ea29255bd20714b6b8092b22ebb045ae134030d6bf476bddfed9d33e900
    - k8s.gcr.io/kube-apiserver:v1.17.3
    sizeBytes: 50633771
  - names:
    - k8s.gcr.io/kube-controller-manager@sha256:2f0bf4d08e72a1fd6327c8eca3a72ad21af3a608283423bb3c10c98e68759844
    - k8s.gcr.io/kube-controller-manager:v1.17.3
    sizeBytes: 48808424
  - names:
    - k8s.gcr.io/kube-proxy@sha256:3a70e2ab8d1d623680191a1a1f1dcb0bdbfd388784b1f153d5630a7397a63fd4
    - k8s.gcr.io/kube-proxy:v1.17.3
    sizeBytes: 48700427
  - names:
    - docker.io/calico/pod2daemon-flexvol@sha256:d125b9f3c24133bdaf90eaf2bee1d506240d39a77bda712eda3991b6b5d443f0
    - docker.io/calico/pod2daemon-flexvol:v3.14.1
    sizeBytes: 37526807
  - names:
    - k8s.gcr.io/kube-scheduler@sha256:b091f0db3bc61a3339fd3ba7ebb06c984c4ded32e1f2b1ef0fbdfab638e88462
    - k8s.gcr.io/kube-scheduler:v1.17.3
    sizeBytes: 33820167
  - names:
    - gcr.io/cloud-provider-vsphere/cpi/release/manager@sha256:64de5c7f10e55703142383fade40886091528ca505f00c98d57e27f10f04fc03
    - gcr.io/cloud-provider-vsphere/cpi/release/manager:v1.1.0
    sizeBytes: 16201394
  - names:
    - k8s.gcr.io/coredns@sha256:7ec975f167d815311a7136c32e70735f0d00b73781365df1befd46ed35bd4fe7
    - k8s.gcr.io/coredns:1.6.5
    sizeBytes: 13239960
  - names:
    - quay.io/k8scsi/csi-node-driver-registrar:v1.1.0
    sizeBytes: 6939423
  - names:
    - quay.io/k8scsi/livenessprobe:v1.1.0
    sizeBytes: 6690548
  - names:
    - k8s.gcr.io/pause@sha256:f78411e19d84a252e53bff71a4407a5686c46983a2c2eeed83929b888179acea
    - k8s.gcr.io/pause:3.1
    sizeBytes: 317164
  nodeInfo:
    architecture: amd64
    bootID: 4cbff854-25c5-4829-b758-5575f35b12fe
    containerRuntimeVersion: containerd://1.3.3
    kernelVersion: 4.15.0-88-generic
    kubeProxyVersion: v1.17.3
    kubeletVersion: v1.17.3
    machineID: db2b18283b014781b8f967f4f8566437
    operatingSystem: linux
    osImage: Ubuntu 18.04.4 LTS
    systemUUID: AF411242-C745-BFFB-2B7D-73D156D3C120

Tried to things one this node: first set providerID manually and second taint the node as unintialized.
After this the node is mark as running in the capi cluster. No new nodes are spawn after the node is marked as running.

A different between a old node and a new node is that no externalIP is set.

I could created the cluster without the manual changes.

@yastij
Copy link
Member

yastij commented Jun 2, 2020

@farodin91 - can you also grab the logs from kcp and capi controllers ? Also are you able to reproduce it ?

@farodin91
Copy link
Contributor Author

It is reproducible. I can only say it for steps in the issue and not the manual changes.
KCP?

@vincepri
Copy link
Member

vincepri commented Jun 2, 2020

KCP: kubeadm-control-plane controller logs

@farodin91
Copy link
Contributor Author

capi-controller-manager.log
capi-kubeadm-control-plane-controller-manager.log
capv-controller-manager.log

I create a new management cluster using kind and than run all steps as described above.
Naming controlplane naming first iteration test-cl1-cp-v1 and after upgrade: test-cl1-cp-v2

@farodin91
Copy link
Contributor Author

vsphere-cloud-controller-manager.log

I just replaced some internals.

@yastij yastij added this to the v0.7.0 milestone Jun 9, 2020
@yastij yastij self-assigned this Jun 9, 2020
@farodin91
Copy link
Contributor Author

The issue seems to be resolved with 0.6.6 and cluster-api 0.3.7-rc1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

4 participants