MachinePool ready state leading to not processing providerIDs in CAPI #4982

mweibel · 2024-07-12T10:45:56Z

/kind bug

What steps did you take and what happened:
The following code determines ready state for a AzureMachinePool:

cluster-api-provider-azure/azure/scope/machinepool.go

Lines 571 to 603 in 9079793

    
           func (m *MachinePoolScope) setProvisioningStateAndConditions(v infrav1.ProvisioningState) { 
        
           	m.AzureMachinePool.Status.ProvisioningState = &v 
        
           	switch { 
        
           	case v == infrav1.Succeeded && *m.MachinePool.Spec.Replicas == m.AzureMachinePool.Status.Replicas: 
        
           		// vmss is provisioned with enough ready replicas 
        
           		conditions.MarkTrue(m.AzureMachinePool, infrav1.ScaleSetRunningCondition) 
        
           		conditions.MarkTrue(m.AzureMachinePool, infrav1.ScaleSetModelUpdatedCondition) 
        
           		conditions.MarkTrue(m.AzureMachinePool, infrav1.ScaleSetDesiredReplicasCondition) 
        
           		m.SetReady() 
        
           	case v == infrav1.Succeeded && *m.MachinePool.Spec.Replicas != m.AzureMachinePool.Status.Replicas: 
        
           		// not enough ready or too many ready replicas we must still be scaling up or down 
        
           		updatingState := infrav1.Updating 
        
           		m.AzureMachinePool.Status.ProvisioningState = &updatingState 
        
           		if *m.MachinePool.Spec.Replicas > m.AzureMachinePool.Status.Replicas { 
        
           			conditions.MarkFalse(m.AzureMachinePool, infrav1.ScaleSetDesiredReplicasCondition, infrav1.ScaleSetScaleUpReason, clusterv1.ConditionSeverityInfo, "") 
        
           		} else { 
        
           			conditions.MarkFalse(m.AzureMachinePool, infrav1.ScaleSetDesiredReplicasCondition, infrav1.ScaleSetScaleDownReason, clusterv1.ConditionSeverityInfo, "") 
        
           		} 
        
           		m.SetNotReady() 
        
           	case v == infrav1.Updating: 
        
           		conditions.MarkFalse(m.AzureMachinePool, infrav1.ScaleSetModelUpdatedCondition, infrav1.ScaleSetModelOutOfDateReason, clusterv1.ConditionSeverityInfo, "") 
        
           		m.SetNotReady() 
        
           	case v == infrav1.Creating: 
        
           		conditions.MarkFalse(m.AzureMachinePool, infrav1.ScaleSetRunningCondition, infrav1.ScaleSetCreatingReason, clusterv1.ConditionSeverityInfo, "") 
        
           		m.SetNotReady() 
        
           	case v == infrav1.Deleting: 
        
           		conditions.MarkFalse(m.AzureMachinePool, infrav1.ScaleSetRunningCondition, infrav1.ScaleSetDeletingReason, clusterv1.ConditionSeverityInfo, "") 
        
           		m.SetNotReady() 
        
           	default: 
        
           		conditions.MarkFalse(m.AzureMachinePool, infrav1.ScaleSetRunningCondition, string(v), clusterv1.ConditionSeverityInfo, "") 
        
           		m.SetNotReady() 
        
           	} 
        
           }

The following CAPI code is not run if AzureMachinePool is not ready:
https://github.com/kubernetes-sigs/cluster-api/blob/8d639f1fad564eecf5bda0a2ee03c8a38896a184/exp/internal/controllers/machinepool_controller_phases.go#L290-L319

If I'm right, this logic together has the following effect:

if AzureMachinePool is scaling up or down, or is having an issue (e.g. one VM failed due to bootstrapping and resulting in provisioningState: Failed), the MachinePool does not get reconciled anymore until the ready status changes back again.
this means e.g. that the providerIDList of the MachinePool is not updated anymore
cleanup or addition of new Machines is not processed anymore
label sync can take longer

This is a bug which can lead to issues with the known machines in a cluster. E.g. cluster-autoscaler with clusterapi provider doesn't know about certain machines.

I'm not sure whether the bug is in CAPZ or in CAPI:

the VMSS is still able to scale up or down and function even if the AzureMachinePool is marked not ready. I feel this is a bug in CAPZ
the providerID list should still be updated in CAPI, even if it's not marked ready. I feel this is a bug in CAPI, but as discussed in cluster-api#9858 that depends on the contract

What did you expect to happen:
Scaling up/down works without issues and also a single VM doesn't impact the functioning of the full VMSS.

Anything else you would like to add:
I guess this is initially more of a discussion point because there could be multiple facets of this issue.

Environment:

cluster-api-provider-azure version: latest master
Kubernetes version: (use kubectl version): 1.28.5
OS (e.g. from /etc/os-release): linux/windows

The text was updated successfully, but these errors were encountered:

willie-yao · 2024-08-15T16:42:22Z

/priority backlog

k8s-triage-robot · 2024-11-13T16:58:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

willie-yao · 2024-11-13T18:05:26Z

/remove-lifecycle stale

k8s-triage-robot · 2025-02-11T18:58:08Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

willie-yao · 2025-02-12T18:16:57Z

/remove-lifecycle stale

mweibel · 2025-04-01T14:09:50Z

FTR this is generating a lot of issues for us and I'll work on proposing a PR to fix this behavior.

willie-yao · 2025-04-01T22:45:47Z

Thank you for working on this @mweibel! Let us know if you need help with reviews or anything else.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 12, 2024

github-project-automation bot added this to CAPZ Planning Jul 12, 2024

dtzar added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 16, 2024

k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Aug 15, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 13, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 13, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 11, 2025

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 12, 2025

mweibel linked a pull request Apr 2, 2025 that will close this issue

MachinePool: avoid SetNotReady during normal processing #5537

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MachinePool ready state leading to not processing providerIDs in CAPI #4982

MachinePool ready state leading to not processing providerIDs in CAPI #4982

mweibel commented Jul 12, 2024 •

edited

Loading

willie-yao commented Aug 15, 2024

k8s-triage-robot commented Nov 13, 2024

willie-yao commented Nov 13, 2024

k8s-triage-robot commented Feb 11, 2025

willie-yao commented Feb 12, 2025

mweibel commented Apr 1, 2025

willie-yao commented Apr 1, 2025

MachinePool ready state leading to not processing providerIDs in CAPI #4982

MachinePool ready state leading to not processing providerIDs in CAPI #4982

Comments

mweibel commented Jul 12, 2024 • edited Loading

willie-yao commented Aug 15, 2024

k8s-triage-robot commented Nov 13, 2024

willie-yao commented Nov 13, 2024

k8s-triage-robot commented Feb 11, 2025

willie-yao commented Feb 12, 2025

mweibel commented Apr 1, 2025

willie-yao commented Apr 1, 2025

mweibel commented Jul 12, 2024 •

edited

Loading