Skip to content

[KCP]: Error in machine selection logic during scale down #2760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sedefsavas opened this issue Mar 23, 2020 · 6 comments · Fixed by #2768
Closed

[KCP]: Error in machine selection logic during scale down #2760

sedefsavas opened this issue Mar 23, 2020 · 6 comments · Fixed by #2768
Assignees
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@sedefsavas
Copy link

What steps did you take and what happened:

  1. make test-capd-e2e-full
  2. I see that 3 node cluster came up, and after kcp upgrade step in the e2e test, replacement machine is created. scaleDownControlPlane() fails to find a machine to delete. Seeing this error:
    failed to pick control plane Machine to mark for deletion

What did you expect to happen:
I would expect it to pick one of the machines that has the old version and remove it.

Anything else you would like to add:
The reason this is happening is when there are multiple failure domains:
KCP scaleDownControlPlane() finds a failure domain with most machines, then among the machines that has UpgradeReplacementCreatedAnnotation, it tries to find a machine that is in that failure domain. But this is not necessarily the case always.

fd := controlPlane.FailureDomainWithMostMachines()
machinesInFailureDomain := selectedMachines.Filter(machinefilters.InFailureDomains(fd))

Say we have machines with following failure domains: {Machine1: fd1, Machine2: fd2, Machine3: fd2}.
upgradeControlPlane() calls scaleDownControlPlane() with ownedMachines={Machine1: fd1, Machine2: fd2, Machine3: fd2} and selectedMachines={Machine1: fd1}

In scaleDownControlPlane(), fd will be calculated as fd2 and among selectedMachines, it will try to find a machine that has fd2.

Example unit test for this: sedefsavas#5
This is this reason #2758 failing.

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 23, 2020
@vincepri
Copy link
Member

@detiber @chuckha Are failure domain implemented in KCP as a required feature? If they are, we should document this somewhere or return some error to users. Otherwise, it seems like we need to put checks in place to make sure that an empty failure domain is fine to proceed.

@vincepri
Copy link
Member

/milestone v0.3.x

@k8s-ci-robot k8s-ci-robot added this to the v0.3.x milestone Mar 24, 2020
@vincepri
Copy link
Member

vincepri commented Mar 24, 2020

/area control-plane

@detiber
Copy link
Member

detiber commented Mar 24, 2020

@vincepri it should work with or without failure domains. It's been a while since I've reviewed, but the logic used to prioritize no longer existing failure domains and Machines with no failure domains for scale down.

It also previously used to use the group of "selected" machines to choose a failure domain for scale down, since the selected machine would already belong to the failure domain that we selected as either no longer being present, not defined on the machine, or the one that was the most highly populated. I suspect one of the more recent changes may have made those computations diverge.

@vincepri vincepri added the area/control-plane Issues or PRs related to control-plane lifecycle management label Mar 24, 2020
@sedefsavas
Copy link
Author

The current logic works for scaling down if same set of machines are used for ownedMachines and selectedMachines as that is the case in regular scale down happening in reconcile().

I am changing the logic to pick a failureDomain from the selectedMachines.

@sedefsavas
Copy link
Author

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants