Skip to content

Separate Machine APIs from Cluster APIs #490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
6 tasks
enxebre opened this issue Sep 5, 2018 · 19 comments
Closed
6 tasks

Separate Machine APIs from Cluster APIs #490

enxebre opened this issue Sep 5, 2018 · 19 comments
Labels
area/api Issues or PRs related to the APIs lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@enxebre
Copy link
Member

enxebre commented Sep 5, 2018

I'd like to bring up a conversation about evolving a Machine APIs at a separate cadence from the Cluster API. This ticket aims to facilitate and track a discussion around beta requirements for having a machine API distinctly from beta requirements for the Cluster API project as a whole. A beta API for machines should enable provisioning and deprovisioning machines via Kubernetes-style API, support basic health management, and clear policies for handling scale up and down of sets

Proposed changes:

  • Evolve the Machine API at a separate cadence from Cluster API
  • Split machine types into separate API group (machines.k8s.io) from clusters (cluster.k8s.io) in order to support the ability to serve the types that are only in that group, enable gradual adoption of types in more deployment environments.
  • Machine health checking
  • Machine deletion strategy
  • Make optional fields machine.spec.versions, machine.status.versions
  • Remove cluster assumption from actuators. Move from (Cluster, Machine) tuple to something like (Context, Machine)
@enxebre
Copy link
Member Author

enxebre commented Sep 5, 2018

@dlipovetsky
Copy link
Contributor

dlipovetsky commented Sep 5, 2018

I'd like to understand what the Context object would be. If it does not have all the fields from the Cluster object, will it be able to provide the information wanted by all different machine actuators?

@dlipovetsky
Copy link
Contributor

dlipovetsky commented Sep 5, 2018

If we want to remove the cluster assumption from the machine actuator, there is an alternative to introducing a Context object: We can change the machine controller so that it passes only the Machine object to the actuator.

The actuator would need to identify and get the Cluster object from the API (Both are done by the machine controller today). To identify it, the actuator could follow a reference to the Cluster object (adding this reference to the Machine object has been discussed in #41 and #252).

@detiber
Copy link
Member

detiber commented Sep 5, 2018

If we want to remove the cluster assumption from the machine actuator, there is an alternative to introducing a Context object: We can change the machine controller so that it passes only the Machine object to the actuator.

I also wonder how much this actually accomplishes, considering that the provider implementations are still very much in flux, it seems like this Context object would still need to pass in the full ProviderConfig and ProviderStatus from the cluster object to ensure that actuator implementations are able to retrieve the necessary config/status they might require.

@enxebre
Copy link
Member Author

enxebre commented Sep 10, 2018

@dlipovetsky that sounds reasonable.

Machines api seems close to feature complete. Having a dedicated API group and dropping the controller coupling will likely allow us to evolve it quicker and make it consumable by tools needing beta status which will also favour adoption of the project

cc @dgoodwin @csrwng wdyt?

@csrwng
Copy link
Contributor

csrwng commented Sep 10, 2018

Remove the cluster assumption from the actuator interface

Separately keep discussion on how this fetch/link would happen.

@enxebre I agree, this would allow us to separate concerns.

Given that the general agreement on referencing the cluster from the machine resource was to make it an optional reference, it makes sense that the actuator implementation be where the link is made if necessary.

@hardikdr
Copy link
Member

hardikdr commented Sep 14, 2018

IMHO we should try to avoid reference to cluster-object from actuator-interface in machine-controller: ref
I can see following supporting pointers:

  • Users might actually be willing to use the machine-api stack, machine-* controllers indepedent of cluster-controller.

    • I can recall this discussed when someone asked about running machine-controller against existing GKE cluster, and not worry about cluster-object.
  • From a design standpoint, it would be nice to expect each controller to make use of the objects from the same or below layer: only Downward-depedency.

    • For instance, machineset-controller should have visibility of only MachineObjects and MachineSetObjects.
    • This should allow us to even fine-grain the responsibilities across controller-layers and offers freedom to users of choosing very specific subset of controllers if users don't intend to use higher-level controllers.

Although, regarding overall flow, we could expect cluster-controller to generate necessary MachineDeployment and MachineClass objects with fully-contained providerConfig in it. Further MachineDeployment controller should process MDObject and generate MSetObjects which will in turn generate MachineObjects.
Actuator-interface would only require MachineClass/ProviderConig & MachinObject as parameter, which should be sufficient for actual machine-lifecycle.

@derekwaynecarr
Copy link
Contributor

@enxebre we need to also determine how we coordinate draining nodes prior to removal of the machine. not having consistent drain will cause problems with stateful applications.

@roberthbailey
Copy link

@derekwaynecarr - I believe that the drain logic is intended to be part of the common code in the machine controller, with the goal of making it consistent. I don't recall if it's in there yet or not.

@enxebre
Copy link
Member Author

enxebre commented Oct 1, 2018

@roberthbailey @derekwaynecarr we are currently patching the core machine Controller for draining the node on a delete event. For this to work we are also running a custom controller which links nodes with machines and ensures this is filled machine.Status.NodeRef so the right node can be drained hence the need for #497

@roberthbailey
Copy link

The drain logic definitely isn't in the machine controller yet. We should add it. I might be able to put together a PR for this in the next week or so, since we have drain logic floating around in a number of places already.

@enxebre
Copy link
Member Author

enxebre commented Nov 2, 2018

@roberthbailey we've been patching the core controller to include the logic here https://github.com/openshift/cluster-api-provider-aws/pull/60/files
I'm happy to create a PR of the likes here if there's interest. We should probably also to consider consume this logic from a common place, e.g similar to https://github.com/openshift/kubernetes-drain

@roberthbailey
Copy link

Ideally we will get server side drain at some point... I'll take a look at the OS drain code; it's probably similar to what we have and is already is an easily consumable package.

@enxebre enxebre mentioned this issue Dec 18, 2018
@roberthbailey roberthbailey changed the title machine api beta - exit criteria Separate Machine APIs from Cluster APIs Jan 11, 2019
@roberthbailey roberthbailey added this to the Next milestone Jan 11, 2019
@roberthbailey roberthbailey added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jan 11, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2019
@vincepri
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2019
@vincepri
Copy link
Member

/area api

@k8s-ci-robot k8s-ci-robot added the area/api Issues or PRs related to the APIs label Jun 10, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 8, 2019
@vincepri
Copy link
Member

Closing this issue given that the project's scope and objectives specifies that this is out of scope

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

Closing this issue given that the project's scope and objectives specifies that this is out of scope

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Issues or PRs related to the APIs lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

10 participants