Stuck on replacing controlplane nodes #921

farodin91 · 2020-05-29T12:13:48Z

/kind bug

What steps did you take and what happened:

Create a Cluster
Wait for the cluster to be running
Change the configuration of a controlplane
New controlplane node are spawned
New controlplane stucks in the phase: Provisioning with the event: Waiting for control plane to pass control plane health check to continue reconciliation: control plane machine namespace/node-cp-v2-s8p8w has no status.nodeRef

What did you expect to happen:

new controlplane node gets the nodeRef and switches to the phase running

Anything else you would like to add:
logs of capi-controller-manager

E0529 12:06:21.142555       1 machine_controller_noderef.go:98] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="node-cp-v2-s8p8w"

Environment:

Cluster-api version: v0.3.6
Cluster-api-provider-vsphere version: v0.6.4
Kubernetes version: (use kubectl version): 1.17.3
OS (e.g. from /etc/os-release): ubuntu 18.04

I would like to help.

The text was updated successfully, but these errors were encountered:

farodin91 · 2020-06-02T12:41:47Z

It seems to be somthing with: kubernetes/cloud-provider-vsphere#326

yastij · 2020-06-02T13:23:28Z

@farodin91 - what is the output when you are trying to get the new node on the target cluster ?

farodin91 · 2020-06-02T13:32:32Z

kubectl get node node-cp-v2-92xc5 -o yaml

apiVersion: v1
kind: Node
metadata:
  annotations:
    csi.volume.kubernetes.io/nodeid: '{"csi.vsphere.vmware.com":"node-cp-v2-92xc5"}'
    kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock
    node.alpha.kubernetes.io/ttl: "0"
    projectcalico.org/IPv4Address: 10.25.8.63/24
    projectcalico.org/IPv4IPIPTunnelAddr: 10.15.247.128
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2020-06-02T12:18:01Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: vsphere-vm.cpu-8.mem-8gb.os-linux
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: node-cp-v2-92xc5
    kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
  name: node-cp-v2-92xc5
  resourceVersion: "17595"
  selfLink: /api/v1/nodes/node-cp-v2-92xc5
  uid: cccd5cc7-a932-405f-9f38-3d05d143ba33
spec:
  podCIDR: 10.15.232.0/24
  podCIDRs:
  - 10.15.232.0/24
  providerID: vsphere://421241af-45c7-fbbf-2b7d-73d156d3c120
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
status:
  addresses:
  - address: 10.25.8.63
    type: InternalIP
  - address: node-cp-v2-92xc5
    type: Hostname
  allocatable:
    cpu: "8"
    ephemeral-storage: "18901337672"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 8064980Ki
    pods: "110"
  capacity:
    cpu: "8"
    ephemeral-storage: 20509264Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 8167380Ki
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2020-06-02T12:18:51Z"
    lastTransitionTime: "2020-06-02T12:18:51Z"
    message: Calico is running on this node
    reason: CalicoIsUp
    status: "False"
    type: NetworkUnavailable
  - lastHeartbeatTime: "2020-06-02T13:20:52Z"
    lastTransitionTime: "2020-06-02T12:18:00Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2020-06-02T13:20:52Z"
    lastTransitionTime: "2020-06-02T12:18:00Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2020-06-02T13:20:52Z"
    lastTransitionTime: "2020-06-02T12:18:00Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2020-06-02T13:20:52Z"
    lastTransitionTime: "2020-06-02T12:18:41Z"
    message: kubelet is posting ready status. AppArmor enabled
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - k8s.gcr.io/etcd@sha256:4afb99b4690b418ffc2ceb67e1a17376457e441c1f09ab55447f0aaf992fa646
    - k8s.gcr.io/etcd:3.4.3-0
    sizeBytes: 100947667
  - names:
    - docker.io/calico/node@sha256:dbebe7e01ae85af68673a8e0ce51200ab8ae2a1c69d48dff5b95969b17eca7c2
    - docker.io/calico/node:v3.14.1
    sizeBytes: 90581056
  - names:
    - docker.io/calico/cni@sha256:84113c174b979e686de32094e552933e35d8fc7e2d532efcb9ace5310b65088c
    - docker.io/calico/cni:v3.14.1
    sizeBytes: 77638089
  - names:
    - gcr.io/cloud-provider-vsphere/csi/release/driver@sha256:149e87faaacda614ee95ec271b54c8bfdbd2bf5825abc12d45c654036b798229
    - gcr.io/cloud-provider-vsphere/csi/release/driver:v1.0.2
    sizeBytes: 75130938
  - names:
    - k8s.gcr.io/kube-apiserver@sha256:33400ea29255bd20714b6b8092b22ebb045ae134030d6bf476bddfed9d33e900
    - k8s.gcr.io/kube-apiserver:v1.17.3
    sizeBytes: 50633771
  - names:
    - k8s.gcr.io/kube-controller-manager@sha256:2f0bf4d08e72a1fd6327c8eca3a72ad21af3a608283423bb3c10c98e68759844
    - k8s.gcr.io/kube-controller-manager:v1.17.3
    sizeBytes: 48808424
  - names:
    - k8s.gcr.io/kube-proxy@sha256:3a70e2ab8d1d623680191a1a1f1dcb0bdbfd388784b1f153d5630a7397a63fd4
    - k8s.gcr.io/kube-proxy:v1.17.3
    sizeBytes: 48700427
  - names:
    - docker.io/calico/pod2daemon-flexvol@sha256:d125b9f3c24133bdaf90eaf2bee1d506240d39a77bda712eda3991b6b5d443f0
    - docker.io/calico/pod2daemon-flexvol:v3.14.1
    sizeBytes: 37526807
  - names:
    - k8s.gcr.io/kube-scheduler@sha256:b091f0db3bc61a3339fd3ba7ebb06c984c4ded32e1f2b1ef0fbdfab638e88462
    - k8s.gcr.io/kube-scheduler:v1.17.3
    sizeBytes: 33820167
  - names:
    - gcr.io/cloud-provider-vsphere/cpi/release/manager@sha256:64de5c7f10e55703142383fade40886091528ca505f00c98d57e27f10f04fc03
    - gcr.io/cloud-provider-vsphere/cpi/release/manager:v1.1.0
    sizeBytes: 16201394
  - names:
    - k8s.gcr.io/coredns@sha256:7ec975f167d815311a7136c32e70735f0d00b73781365df1befd46ed35bd4fe7
    - k8s.gcr.io/coredns:1.6.5
    sizeBytes: 13239960
  - names:
    - quay.io/k8scsi/csi-node-driver-registrar:v1.1.0
    sizeBytes: 6939423
  - names:
    - quay.io/k8scsi/livenessprobe:v1.1.0
    sizeBytes: 6690548
  - names:
    - k8s.gcr.io/pause@sha256:f78411e19d84a252e53bff71a4407a5686c46983a2c2eeed83929b888179acea
    - k8s.gcr.io/pause:3.1
    sizeBytes: 317164
  nodeInfo:
    architecture: amd64
    bootID: 4cbff854-25c5-4829-b758-5575f35b12fe
    containerRuntimeVersion: containerd://1.3.3
    kernelVersion: 4.15.0-88-generic
    kubeProxyVersion: v1.17.3
    kubeletVersion: v1.17.3
    machineID: db2b18283b014781b8f967f4f8566437
    operatingSystem: linux
    osImage: Ubuntu 18.04.4 LTS
    systemUUID: AF411242-C745-BFFB-2B7D-73D156D3C120

Tried to things one this node: first set providerID manually and second taint the node as unintialized.
After this the node is mark as running in the capi cluster. No new nodes are spawn after the node is marked as running.

A different between a old node and a new node is that no externalIP is set.

I could created the cluster without the manual changes.

yastij · 2020-06-02T13:42:46Z

@farodin91 - can you also grab the logs from kcp and capi controllers ? Also are you able to reproduce it ?

farodin91 · 2020-06-02T13:46:10Z

It is reproducible. I can only say it for steps in the issue and not the manual changes.
KCP?

vincepri · 2020-06-02T14:02:03Z

KCP: kubeadm-control-plane controller logs

farodin91 · 2020-06-02T14:09:41Z

capi-controller-manager.log
capi-kubeadm-control-plane-controller-manager.log
capv-controller-manager.log

I create a new management cluster using kind and than run all steps as described above.
Naming controlplane naming first iteration test-cl1-cp-v1 and after upgrade: test-cl1-cp-v2

farodin91 · 2020-06-02T14:18:15Z

vsphere-cloud-controller-manager.log

I just replaced some internals.

farodin91 · 2020-07-14T07:04:50Z

The issue seems to be resolved with 0.6.6 and cluster-api 0.3.7-rc1.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 29, 2020

farodin91 mentioned this issue Jun 2, 2020

Stuck on replacing controlplane nodes kubernetes-sigs/cluster-api#3127

Closed

yastij added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Jun 2, 2020

yastij added this to the v0.7.0 milestone Jun 9, 2020

yastij self-assigned this Jun 9, 2020

farodin91 closed this as completed Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stuck on replacing controlplane nodes #921

Stuck on replacing controlplane nodes #921

farodin91 commented May 29, 2020

farodin91 commented Jun 2, 2020

Uh oh!

yastij commented Jun 2, 2020

Uh oh!

farodin91 commented Jun 2, 2020 •

edited

Loading

Uh oh!

yastij commented Jun 2, 2020 •

edited

Loading

Uh oh!

farodin91 commented Jun 2, 2020

Uh oh!

vincepri commented Jun 2, 2020

Uh oh!

farodin91 commented Jun 2, 2020

Uh oh!

farodin91 commented Jun 2, 2020

Uh oh!

farodin91 commented Jul 14, 2020

Uh oh!

Stuck on replacing controlplane nodes #921

Stuck on replacing controlplane nodes #921

Comments

farodin91 commented May 29, 2020

farodin91 commented Jun 2, 2020

Uh oh!

yastij commented Jun 2, 2020

Uh oh!

farodin91 commented Jun 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yastij commented Jun 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

farodin91 commented Jun 2, 2020

Uh oh!

vincepri commented Jun 2, 2020

Uh oh!

farodin91 commented Jun 2, 2020

Uh oh!

farodin91 commented Jun 2, 2020

Uh oh!

farodin91 commented Jul 14, 2020

Uh oh!

farodin91 commented Jun 2, 2020 •

edited

Loading

yastij commented Jun 2, 2020 •

edited

Loading