-
Notifications
You must be signed in to change notification settings - Fork 604
✨ feat: Add IRSA support for self-managed clusters (rebase) #5109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @sl1pm4t. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
184576c
to
986add5
Compare
/test pull-cluster-api-provider-aws-e2e |
There is some investigation of the failing e2e test happening over here. |
986add5
to
d265b48
Compare
/retest-required |
d265b48
to
909a464
Compare
/test pull-cluster-api-provider-aws-e2e |
909a464
to
8d30f71
Compare
/retest |
8d30f71
to
540420c
Compare
/test pull-cluster-api-provider-aws-e2e-eks |
@sl1pm4t do you need help with rebase? |
No I will be able to rebase. I had just forgotten this was needed. I'll try to do this in the next day or two. |
540420c
to
71a97d5
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I rebased and re-ran |
/test pull-cluster-api-provider-aws-e2e-eks |
/test pull-cluster-api-provider-aws-e2e-eks |
Both e2e have passed 🎉 |
@@ -37,12 +35,6 @@ func (b *S3Bucket) Validate() []*field.Error { | |||
errs = append(errs, field.Required(field.NewPath("spec", "s3Bucket", "name"), "can't be empty")) | |||
} | |||
|
|||
// Feature gate is not enabled but ignition is enabled then send a forbidden error. | |||
if !feature.Gates.Enabled(feature.BootstrapFormatIgnition) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this removed on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, this PR was based on work started by @luthermonson so I don't have the full history.
I see the check still exists in the v1beta2 validation, so I'll revert this and do some tests.
feature/feature.go
Outdated
@@ -67,6 +67,9 @@ const ( | |||
// BootstrapFormatIgnition will allow an user to enable alternate machine bootstrap format, viz. Ignition. | |||
BootstrapFormatIgnition featuregate.Feature = "BootstrapFormatIgnition" | |||
|
|||
// OIDCProviderSupport will allow a user to enable OIDC provider support for kubeadm clusters. | |||
OIDCProviderSupport featuregate.Feature = "OIDCProviderSupport" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we have OIDC provider support in EKS should the feature gate me renamed so that its explicit non-eks (i.e. unmanaged)?
I appreciate the comment says it's for kubeadm clusters but think we may need to be more clear as the gate name is used in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature flag has been renamed to OIDCProviderUnmanagedClusters
@@ -185,6 +185,11 @@ func (t Template) ControllersPolicy() *iamv1.PolicyDocument { | |||
"ec2:DeleteLaunchTemplateVersions", | |||
"ec2:DescribeKeyPairs", | |||
"ec2:ModifyInstanceMetadataOptions", | |||
"iam:CreateOpenIDConnectProvider", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the OIDC feature flag is false by default we may want to not include these extra permissions by default. And instead only add them if enabled via the config file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been fixed in the latest commit.
controllers/awscluster_controller.go
Outdated
@@ -256,6 +267,14 @@ func (r *AWSClusterReconciler) reconcileDelete(ctx context.Context, clusterScope | |||
allErrs = append(allErrs, errors.Wrap(err, "error deleting network")) | |||
} | |||
|
|||
if err := iamService.DeleteOIDCProvider(ctx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be dependent on the feature flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a feature flag gate here, and in the ReconcileNormal
|
||
### 4 - Set Service Account Issuer URL in KubeadmControlPlane Configuration | ||
|
||
The KubeadmControlPlane configuration should be updated to set the apiServer `service-account-issuer` argument with the S3 buckets URL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be nice to create a new cluster flavor (i.e. template) in the templates directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new cluster flavor has been added.
Although there is one gotcha - because I've included ClusterResourceSet for aws-pod-identity-webhook
I also had to add cert-manager
to allow the formers Webhooks to work.
The cert-manager CRD blows out the kubectl-last-applied
annotation limit so the manifest must be applied with --server-side
flag. e.g.
clusterctl generate cluster capa-oidc-test --from=templates/cluster-template-oidc-provider-unmanaged.yaml | k apply -f - --server-side
I was tempted to make this a HelmChartProxy
addon instead, but that introduces a dependency on CAAPH.
controllers/awscluster_controller.go
Outdated
@@ -352,6 +372,12 @@ func (r *AWSClusterReconciler) reconcileNormal(clusterScope *scope.ClusterScope) | |||
} | |||
conditions.MarkTrue(awsCluster, infrav1.S3BucketReadyCondition) | |||
|
|||
if err := iamService.ReconcileOIDCProvider(context.TODO()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we wrap this with a test of the feature being enabled to be explicit? I know the validating webook shouldn't allow setting the associate field when the feature flag is disabled. Undecided about this if i'm honest ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// 4. copy Service Account public signing JWKs to the s3 bucket. | ||
func (s *Service) ReconcileOIDCProvider(ctx context.Context) error { | ||
if !s.scope.AssociateOIDCProvider() { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be good to have a log entry to say its disabled at a high log verbosity level for debugging purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added:
log.V(4).Info("OIDC provider association is disabled, skipping reconciliation")
controllers/awsmachine_controller.go
Outdated
@@ -673,6 +673,50 @@ func (r *AWSMachineReconciler) reconcileOperationalState(ec2svc services.EC2Inte | |||
return err | |||
} | |||
|
|||
// check if the remote kubeconfig works and annotate the cluster | |||
if _, ok := machineScope.InfraCluster.InfraCluster().GetAnnotations()[scope.KubeconfigReadyAnnotation]; !ok && machineScope.IsControlPlane() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this as the mechanism to decide if we carry on with the OIDC reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a suggestion in the awscluster_controller.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been removed.
controllers/awscluster_controller.go
Outdated
if err := iamService.ReconcileOIDCProvider(context.TODO()); err != nil { | ||
conditions.MarkFalse(awsCluster, infrav1.OIDCProviderReadyCondition, infrav1.OIDCProviderReconciliationFailedReason, clusterv1.ConditionSeverityError, "%s", err.Error()) | ||
clusterScope.Error(err, "failed to reconcile OIDC provider") | ||
return reconcile.Result{RequeueAfter: 15 * time.Second}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could move if err := iamService.ReconcileOIDCProvider(context.TODO()); err != nil {
to after awsCluster.Status.Ready = true
and then within iamService.ReconcileOIDCProvider
instead return a ctrl.Result so that you can say RequeueAfter if the downstream cluster isn't available yet. This would allow provisioning to continue for the control plane and then when the cluster is accessible iamService.ReconcileOIDCProvider
would do the second part and deploy the config map and return an empty ctrl.Result so there is no requeue.
The benefit is this removes the need for the AWSMachines controller annotating AWSCluster and keeps everything it the control of the AWSCluster controller.
Hopefully that all makes sense. If it doesn't ping me on slack and we can chat or arrange a call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented your suggestion, with a slight modification.
In the IAM reconciler I change the way we build the jwks
document, so that it can be done entirely locally. We already have the Service Account signing key in the <cluster>-sa
secret, so I added logic to build the JWKS keyset in code instead of reaching across to the workload cluster to retrieve it. I lifted code from the kubernetes
codebase to calculate key hashes etc, so the result is identical to what ends up installed in the cluster.
This means IdentityProvider and S3 objects can be fully provisioned before the workload cluster is available, so I removed the ready check + pause, and therefore didn't end up returning a ctrl.Result
I'm tempted to also remove the creation of the boilerplate-oidc-trust-policy
ConfigMap on the remote cluster because I'm not really sure what value it provides?!
@sl1pm4t do you need help with rebase? |
Yes @STASiAN I haven't had time to rebase and address the PR review comments. |
I'm picking this back up today. Will hopefully be able to push changes to address the review comments this week. |
Thats will be great @sl1pm4t , thank you. I was chatting with someone on Friday and they are really looking forward to this feature. |
4456a29
to
eeed105
Compare
/test pull-cluster-api-provider-aws-e2e |
/retest-required |
/test pull-cluster-api-provider-aws-e2e |
@richardcase I think this is ready for re-review. I have rebased and I think I have addressed your review comments. |
/test pull-cluster-api-provider-aws-e2e-eks |
Fix s3 tests
406a61e
to
4bc93b4
Compare
Co-authored-by: Richard Case <[email protected]> Make OIDC IAM permissions optional rename feature to distinguish from managed mode Rework OIDC reconcile to work mgmt side only Fixes and tidy up Add OIDC provider flavor use keyIDFromPublicKey from k8s code Gate OIDC reconcile behind feature flag. Revert unneeded awsmachine_controller test changes fix clusterawsadm template tests Fix lints & tests bump pod ID webhook to latest (v0.6.6) make generate Re-order reconcile steps. The Identity provider needs the S3 objects, so let's create them first and avoid flakey failures. Fixes after rebase
4bc93b4
to
809d4ed
Compare
/test pull-cluster-api-provider-aws-e2e |
friendly ping |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds IRSA functionality to self-managed clusters. This will bring the self-managed clusters inline in functionality with Managed clusters that provide most of this functionality out of the box.
With this PR, the following new resources are created:
<cluster_name>/.well-known/openid-configuration
- OpenID Connect discovery document<cluster_name>/openid/v1/jwks
- Service Account signing public keyThis is a continuation of an old unmerged PR #4094 - with some fixes and some functionality removed to reduce the scope of the PR.
The functionality removed includes:
amazon-pod-identity-webhook addon
to the workload cluster. I felt there are already many ways to manage cluster addons, including ClusterResourceSets or CAAPH, and that it was unnecessary to install the addon via the controller which then becomes an ongoing maintenance burden. Instead, the requirement for the addon has been added to the documentation.service-account-issuer
argument through kubeadm patches. This is easily covered in the documentation and only requires a single line of config to be added to theAWSCluster
resource, but also during testing I experienced issues with this being applied inconsistently, resulting in different values across the control plane nodes.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #3560
Supersedes #4094
Special notes for your reviewer:
This PR adds a new ReconcileOIDCProvider to the AWSCluster reconciliation loop.
Checklist:
Release note: