Skip to content

Commit c904d5f

Browse files
committed
CAEP: Make Cluster Infra Resource Optional
1 parent 7ee0cd4 commit c904d5f

File tree

2 files changed

+221
-9
lines changed

2 files changed

+221
-9
lines changed

docs/proposals/20220725-managed-kubernetes.md

+14-9
Original file line numberDiff line numberDiff line change
@@ -65,13 +65,13 @@ superseded-by:
6565
- [Add-ons management](#add-ons-management)
6666
- [Upgrade Strategy](#upgrade-strategy)
6767
- [Implementation History](#implementation-history)
68-
68+
6969
## Glossary
7070

7171
- **Managed Kubernetes** - a Kubernetes service offered/hosted by a service provider where the control plane is run & managed by the service provider. As a cluster service consumer, you don’t have to worry about managing/operating the control plane machines. Additionally, the managed Kubernetes service may extend to cover running managed worker nodes. Examples are EKS in AWS and AKS in Azure. This is different from a traditional implementation in Cluster API, where the control plane and worker nodes are deployed and managed by the cluster admin.
7272
- **Unmanaged Kubernetes** - a Kubernetes cluster where a cluster admin is responsible for provisioning and operating the control plane and worker nodes. In Cluster API this traditionally means a Kubeadm bootstrapped cluster on infrastructure machines (virtual or physical).
7373
- **Managed Worker Node** - an individual Kubernetes worker node where the underlying compute (vm or bare-metal) is provisioned and managed by the service provider. This usually includes the joining of the newly provisioned node into a Managed Kubernetes cluster. The lifecycle is normally controlled via a higher level construct such as a Managed Node Group.
74-
- **Managed Node Group** - is a service that a service provider offers that automates the provisioning of managed worker nodes. Depending on the service provider this group of nodes could contain a fixed number of replicas or it might contain a dynamic pool of replicas that auto-scales up and down. Examples are Node Pools in GCP and EKS managed node groups.
74+
- **Managed Node Group** - is a service that a service provider offers that automates the provisioning of managed worker nodes. Depending on the service provider this group of nodes could contain a fixed number of replicas or it might contain a dynamic pool of replicas that auto-scales up and down. Examples are Node Pools in GCP and EKS managed node groups.
7575
- **Cluster Infrastructure Provider (Infrastructure)** - an Infrastructure provider supplies whatever prerequisites are necessary for creating & running clusters such as networking, load balancers, firewall rules, and so on. ([docs](../book/src/developer/providers/cluster-infrastructure.md))
7676
- **ControlPlane Provider (ControlPlane)** - a control plane provider instantiates a Kubernetes control plane consisting of k8s control plane components such as kube-apiserver, etcd, kube-scheduler and kube-controller-manager. ([docs](../book/src/developer/architecture/controllers/control-plane.md#control-plane-provider))
7777
- **MachineDeployment** - a MachineDeployment orchestrates deployments over a fleet of MachineSets, which is an immutable abstraction over Machines. ([docs](../book/src/developer/architecture/controllers/machine-deployment.md))
@@ -87,7 +87,9 @@ Cluster API was originally designed with unmanaged Kubernetes clusters in mind a
8787

8888
Some Cluster API Providers (i.e. Azure with AKS first and then AWS with EKS) have implemented support for their managed Kubernetes services. These implementations have followed the existing documentation & contracts (that were designed for unmanaged Kubernetes) and have ended up with 2 different implementations.
8989

90-
While working on supporting ClusterClass for EKS in Cluster API Provider AWS (CAPA), it was discovered that the current implementation of EKS within CAPA, where a single resource kind (AWSManagedControlPlane) is used for both ControlPlane and Infrastructure, is incompatible with ClusterClass (See the [issue](https://github.com/kubernetes-sigs/cluster-api/issues/6126)). Separation of ControlPlane and Infrastructure is expected for the ClusterClass implementation to work correctly.
90+
> _While working on supporting ClusterClass for EKS in Cluster API Provider AWS (CAPA), it was discovered that the current implementation of EKS within CAPA, where a single resource kind (AWSManagedControlPlane) is used for both ControlPlane and Infrastructure, is incompatible with ClusterClass (See the [issue](https://github.com/kubernetes-sigs/cluster-api/issues/6126)). Separation of ControlPlane and Infrastructure is expected for the ClusterClass implementation to work correctly._
91+
92+
(Note: the above quoted, italicized text matter is no longer relevant once CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented.)
9193

9294
The responsibilities between the CAPI control plane and infrastructure are blurred with a managed Kubernetes service like AKS or EKS. For example, when you create a EKS control plane in AWS it also creates infrastructure that CAPI would traditionally view as the responsibility of the cluster “infrastructure provider”.
9395

@@ -235,7 +237,7 @@ type GCPManagedControlPlaneSpec struct {
235237
// +optional
236238
Network NetworkSpec `json:"network"`
237239

238-
// AddonsConfig defines the addons to enable with the GKE cluster.
240+
// AddonsConfig defines the addons to enable with the GKE cluster.
239241
// +optional
240242
AddonsConfig *AddonsConfig `json:"addonsConfig,omitempty"`
241243

@@ -262,7 +264,9 @@ CAPA decided to represent an EKS cluster as a CAPI control-plane. This meant tha
262264

263265
Initially CAPA had an infrastructure cluster kind that reported back the control plane endpoint. This required less than ideal code in its controller to watch the control plane and use its value of the control plane endpoint.
264266

265-
As the infrastructure cluster kind only acted as a passthrough (to satisfy the contract with CAPI) it was decided that it would be removed and the control-plane kind (AWSManagedControlPlane) could be used to satisfy both the “infrastructure” and “control-plane” contracts. This worked well until ClusterClass arrived with its expectation that the “infrastructure” and “control-plane” are 2 different resource kinds.
267+
As the infrastructure cluster kind only acted as a passthrough (to satisfy the contract with CAPI) it was decided that it would be removed and the control-plane kind (AWSManagedControlPlane) could be used to satisfy both the “infrastructure” and “control-plane” contracts. _This worked well until ClusterClass arrived with its expectation that the “infrastructure” and “control-plane” are 2 different resource kinds._
268+
269+
(Note: the above italicized text matter is no longer relevant once CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 merges is implemented.)
266270

267271
Note that CAPZ had a similar discussion and an [issue](https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/1396) to remove AzureManagedCluster: AzureManagedCluster is useless; let's remove it (and keep AzureManagedControlPlane)
268272

@@ -273,6 +277,7 @@ Note that CAPZ had a similar discussion and an [issue](https://github.com/kubern
273277
**Cons**
274278

275279
- Doesn’t work with the current implementation of ClusterClass, which expects a separation of ControlPlane and Infrastructure.
280+
- when CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented this con will no longer be true
276281
- Doesn’t provide separation of responsibilities between creating the general cloud infrastructure for the cluster and the actual cluster control plane.
277282
- Managed Kubernetes look different from unmanaged Kubernetes where two separate kinds are used for a control plane and infrastructure. This would impact products building on top of CAPI.
278283

@@ -331,6 +336,8 @@ type GCPManagedClusterSpec struct {
331336
- Need to maintain Infra cluster kind, which is a pass-through layer and has no other functions. In addition to the CRD, controllers, webhooks and conversions webhooks need to be maintained.
332337
- Infra provider doesn’t provision infrastructure and whilst it may meet the CAPI contract, it doesn’t actually create infrastructure as this is done via the control plane.
333338

339+
Note: when CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented this option will no longer be relevant, as we can simply drop the InfraCluster altogether.
340+
334341
#### Option 3: Two kinds with a Managed Control Plane and Managed Infra Cluster with Better Separation of Responsibilities
335342

336343
This option more closely follows the original separation of concerns with the different CAPI provider types. With this option, 2 new resource kinds will be introduced:
@@ -435,11 +442,9 @@ The reasons for this recommendation are as follows:
435442
- The infra cluster provisions and manages the general infrastructure required for the cluster but not the control plane.
436443
- By having a separate infra cluster API definition, it allows differences in the API between managed and unmanaged clusters.
437444

438-
Providers like CAPZ and CAPA have already implemented managed Kubernetes support and there should be no requirement on them to move to Option 3. Both Options 2 and 4 are solutions that would work with ClusterClass and so could be used if required.
439-
440-
Option 1 is the only option that will not work with ClusterClass and would require a change to CAPI. Therefore this option is not recommended.
445+
Providers like CAPZ and CAPA have already implemented managed Kubernetes support and there should be no requirement on them to move to Option 3. Both Options 4 also works well with ClusterClass and so could be used if required. Option 2 will no longer be relevant once CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented.
441446

442-
*** This means that CAPA will have to make changes to move away from Option 1 if it wants to support ClusterClass.
447+
Option 1 will be available when CAEP https://github.com/kubernetes-sigs/cluster-api/pull/8500 is implemented. Until then it is not recommended.
443448

444449
### Additional notes on option 3
445450

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
---
2+
title: Make Cluster Infra Resource Optional
3+
authors:
4+
- "@jackfrancis"
5+
reviewers:
6+
- "@richardcase"
7+
- "@pydctw"
8+
- "@mtougeron"
9+
- “@CecileRobertMichon”
10+
- “@fabriziopandini”
11+
- “@sbueringer”
12+
- "@killianmuldoon"
13+
- "@mboersma"
14+
- "@nojnhuh"
15+
creation-date: 2023-04-07
16+
last-updated: 2023-04-07
17+
status: provisional
18+
see-also:
19+
- "/docs/proposals/20220725-managed-kubernetes.md"
20+
---
21+
22+
# Make Cluster Infra Resource Optional
23+
24+
## Table of Contents
25+
26+
A table of contents is helpful for quickly jumping to sections of a proposal and for highlighting
27+
any additional information provided beyond the standard proposal template.
28+
[Tools for generating](https://github.com/ekalinin/github-markdown-toc) a table of contents from markdown are available.
29+
30+
- [Make Cluster Infra Resource Optional](#make-cluster-infra-resource-optional)
31+
- [Table of Contents](#table-of-contents)
32+
- [Glossary](#glossary)
33+
- [Summary](#summary)
34+
- [Motivation](#motivation)
35+
- [Goals](#goals)
36+
- [Non-Goals](#non-goals)
37+
- [Future work](#future-work)
38+
- [Proposal](#proposal)
39+
- [User Stories](#user-stories)
40+
- [Story 1](#story-1)
41+
- [Story 2](#story-2)
42+
- [Requirements (Optional)](#requirements-optional)
43+
- [Functional Requirements](#functional-requirements)
44+
- [FR1](#fr1)
45+
- [FR2](#fr2)
46+
- [Non-Functional Requirements](#non-functional-requirements)
47+
- [NFR1](#nfr1)
48+
- [NFR2](#nfr2)
49+
- [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
50+
- [Security Model](#security-model)
51+
- [Risks and Mitigations](#risks-and-mitigations)
52+
- [Alternatives](#alternatives)
53+
- [Upgrade Strategy](#upgrade-strategy)
54+
- [Additional Details](#additional-details)
55+
- [Test Plan [optional]](#test-plan-optional)
56+
- [Graduation Criteria [optional]](#graduation-criteria-optional)
57+
- [Version Skew Strategy [optional]](#version-skew-strategy-optional)
58+
- [Implementation History](#implementation-history)
59+
60+
## Glossary
61+
62+
Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).
63+
64+
The following terms will be used in this document.
65+
66+
- `<Infra>Cluster`
67+
- When we say `<Infra>Cluster` we refer to any provider's infra-specific implementation of the Cluster API `Cluster` resource spec. When you see `<Infra>`, interpret that as a placeholder for any provider implementation. Some concrete examples of provider infra cluster implementations are Azure's CAPZ provider (e.g., `AzureCluster` and `AzureManagedCluster`), AWS's CAPA provider (e.g., `AWSCluster` and `AWSManagedCluster`), and Google Cloud's CAPG provider (e.g., `GCPCluster` and `GCPManagedCluster`). Rather than referencing any one of the preceding actual implementations of infra cluster resources, we prefer to generalize to `<Infra>Cluster` so that we don't suggest any provider-specific bias informing our conclusions.
68+
- Managed Kubernetes
69+
- Managed Kubernetes refers to any Kubernetes Cluster provisioning and maintenance platform that is exposed by a service API. For example: [EKS](https://aws.amazon.com/eks/), [OKE](https://www.oracle.com/cloud/cloud-native/container-engine-kubernetes/), [AKS](https://azure.microsoft.com/en-us/products/kubernetes-service), [GKE](https://cloud.google.com/kubernetes-engine), [IBM Cloud Kubernetes Service](https://www.ibm.com/cloud/kubernetes-service), [DOKS](https://www.digitalocean.com/products/kubernetes), and many more throughout the Kubernetes Cloud Native ecosystem.
70+
- _Kubernetes Cluster Infrastructure_
71+
- When we refer to _Kubernetes Cluster Infrastructure_ we aim to distinguish required environmental infrastructure (e.g., cloud virtual networks) in which a Kubernetes cluster resides as a "set of child resources" from the Kubernetes cluster resources themselves (e.g., virtual machines that underlie nodes, managed by Cluster API). Sometimes this is referred to as "BYO Infrastructure"; essentially, we are talking about **infrastructure that supports a Kubernetes cluster, but is not actively managed by Cluster API**. As we will see, this boundary is different when discussing Managed Kubernetes: more infrastructure resources are not managed by Cluster API when running Managed Kubernetes.
72+
- e.g.
73+
- This just means "For example:"!
74+
75+
## Summary
76+
77+
We propose to make provider `<Infra>Cluster` resources optional in order to better represent Managed Kubernetes scenarios where all _Kubernetes Cluster Infrastructure_ is managed by the service provider, and not by Cluster API.
78+
79+
## Motivation
80+
81+
The implementation of Managed Kubernetes scenarios by Cluster API providers occurred after the architectural design of Cluster API, and thus that design process did not consider these Managed Kubernetes scenarios as a user story. In practice, Cluster API's specification has allowed Managed Kubernetes solutions to emerge that aid running fleets of clusters at scale, with CAPA's `AWSManagedCluster` and `AzureManagedCluster` being notable examples. However, because these Managed Kubernetes solutions arrived after the Cluster API contract was defined, providers have not settled on a consistent rendering of how a "Service-Managed Kubernetes" specification fits into a "Cluster API-Managed Kubernetes" surface area.
82+
83+
One particular part of the existing Cluster API surface area that is inconsistent with most Managed Kubernetes user experiences is the accounting of the [Kubernetes API server](https://kubernetes.io/docs/concepts/overview/components/#kube-apiserver). In the canonical "self-managed" user story that Cluster API addresses, it is the provider implementation of Cluster API (e.g., CAPA) that is responsible for scaffolding the necessary _Kubernetes Cluster Infrastructure_ that is required in order to create the Kubernetes API server (e.g., a Load Balancer and a public IP address). This provider responsibility is declared in the `<Infra>Cluster` resource, and carried out via its controllers; and then finally this reconciliation is synchronized with the parent `Cluster` Cluster API resource.
84+
85+
Because there exist Managed Kubernetes scenarios that handle all _Kubernetes Cluster Infrastructure_ responsibilities themselves, Cluster API's requirement of a `<Infra>Cluster` resource leads to weird implementation decisions, because in these scenarios there is no actual work for a Cluster API provider to do to scaffold _Kubernetes Cluster Infrastructure_.
86+
87+
### Goals
88+
89+
- Make `<Infra>Cluster` resources optional.
90+
- Enable API Server endpoint reporting from a provider's Control Plane resource rather than from its `<Infra>Cluster` resource.
91+
- Ensure any changes to the current behavioral contract are backwards-compatible.
92+
93+
### Non-Goals
94+
95+
- Change the Cluster API data type specification.
96+
- Introduce new "Managed Kubernetes" data types in Cluster API.
97+
98+
### Future Work
99+
100+
- Detailed documentation that references the flavors of Managed Kubernetes scenarios and how they can be implemented in Cluster API, with provider examples.
101+
102+
## Proposal
103+
104+
### User Stories
105+
106+
- Detail the things that people will be able to do if this proposal is implemented.
107+
- Include as much detail as possible so that people can understand the "how" of the system.
108+
- The goal here is to make this feel real for users without getting bogged down.
109+
110+
#### Story 1
111+
112+
As a cluster operator, I want to use Cluster API to provision and manage the lifecycle of a control plane that utilizes my service provider's managed Kubernetes control plane (i.e. EKS, AKS, GKE), so that I don’t have to worry about the management/provisioning of control plane nodes, and so I can take advantage of any value add services offered by my cloud provider.
113+
114+
#### Story 2
115+
116+
As a cluster operator, I want to be able to provision both "unmanaged" and "managed" Kubernetes clusters from the same management cluster, so that I can support different requirements and use cases as needed whilst using a single operating model.
117+
118+
#### Story 3
119+
120+
As a Cluster API provider developer, I want guidance on how to incorporate a managed Kubernetes service into my provider, so that its usage is compatible with Cluster API architecture/features and its usage is consistant with other providers.
121+
122+
#### Story 4
123+
124+
As a Cluster API provider developer, I want to enable the ClusterClass feature for a Managed Kubernetes service, so that users can take advantage of an improved UX with ClusterClass-based clusters.
125+
126+
#### Story 5
127+
128+
As a cluster operator, I want to use Cluster API to provision and manage the lifecycle of worker nodes that utilizes my cloud providers' managed instances (if they support them), so that I don't have to worry about the management of these instances.
129+
130+
#### Story 6
131+
132+
As a service provider I want to be able to offer Managed Kubernetes clusters by using CAPI referencing my own managed control plane implementation that satisfies Cluster API contracts.
133+
134+
### Requirements (Optional)
135+
136+
Some authors may wish to use requirements in addition to user stories.
137+
Technical requirements should derived from user stories, and provide a trace from
138+
use case to design, implementation and test case. Requirements can be prioritised
139+
using the MoSCoW (MUST, SHOULD, COULD, WON'T) criteria.
140+
141+
The FR and NFR notation is intended to be used as cross-references across a CAEP.
142+
143+
The difference between goals and requirements is that between an executive summary
144+
and the body of a document. Each requirement should be in support of a goal,
145+
but narrowly scoped in a way that is verifiable or ideally - testable.
146+
147+
#### Functional Requirements
148+
149+
TODO
150+
151+
##### FR1
152+
153+
TODO
154+
155+
##### FR2
156+
157+
TODO
158+
159+
#### Non-Functional Requirements
160+
161+
TODO
162+
163+
##### NFR1
164+
165+
TODO
166+
167+
##### NFR2
168+
169+
TODO
170+
171+
### Implementation Details/Notes/Constraints
172+
173+
- TODO
174+
175+
### Security Model
176+
177+
TODO
178+
179+
### Risks and Mitigations
180+
181+
- TODO
182+
183+
## Alternatives
184+
185+
TODO
186+
187+
## Upgrade Strategy
188+
189+
TODO
190+
191+
## Additional Details
192+
193+
### Test Plan [optional]
194+
195+
TODO
196+
197+
### Graduation Criteria [optional]
198+
199+
TODO
200+
201+
### Version Skew Strategy [optional]
202+
203+
TODO
204+
205+
## Implementation History
206+
207+
- [ ] 01/11/2023: Compile a Google Doc to organize thoughts prior to CAEP (link here)[https://docs.google.com/document/d/1rqzZfsO6k_RmOHUxx47cALSr_6SeTG89e9C44-oHHdQ/]

0 commit comments

Comments
 (0)