Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGMT-19980: TNA clusters #7402

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

giladravid16
Copy link
Contributor

Enhancement to define the implementation of installing TNA clusters.
See openshift's enhancement for more details about TNA clusters.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 13, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 13, 2025

@giladravid16: This pull request references MGMT-19980 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.19.0" version, but no target version was set.

In response to this:

Enhancement to define the implementation of installing TNA clusters.
See openshift's enhancement for more details about TNA clusters.

List all the issues related to this PR

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 13, 2025
Copy link

openshift-ci bot commented Mar 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giladravid16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 13, 2025
@openshift-ci openshift-ci bot requested review from danmanor and rccrdpccl March 13, 2025 15:23
@giladravid16
Copy link
Contributor Author

/cc @carbonin @eranco74

the worker's requirements to avoid issues mid-development).
- The HostTypeHardwareRequirementsWrapper structs used by the operators.

We will auto-assign the arbiter role to a host if the following are true:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we need to assign it to the host with lowest HW in case all 3 matches HW requirements to be masters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial idea was to check for the lowest HW that pass the requirements, but then I saw that right now for masters we auto-select the first ones we find without comparing it to the other hosts (even if it's a total waste of resources) and I assume it's because it can be difficult to determine which host is smallest (we would need to compare ClusterHostRequirementsDetails).

This enhancement will include the following:
- In the API we will:
- Add a new host_role for arbiter.
- Add a new high_availability_mode for TNA.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the high_availability_mode is deprecated, and we will use control_plane_count instead.
I'm not sure if 3 CP == 2 CP with 1 arbiter.
@danmanor WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most explicit thing to do is add arbiter_count. I wouldn't use control_plane_count=2 alone for TNA, because one day OpenShift might support a 2-node cluster without an arbiter (this was demoed some time back).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I thought of using high_availability_mode, in openshift-installer they set the ControlPlaneTopology to HighlyAvailableArbiter and InfrastructureTopology to HighlyAvailable for this reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @gamli75 said, high_availability_mode is deprecated in the Assisted API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added arbiter_count instead

- Deprecate the existing parameters and create new ones for hyperthreading
and disk encryption (they will be maps of host_role to bool).
- Add a new feature_support_level_id for TNA.
- We will update the clusters' validations and transitions for TNA clusters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add more details about the validations that will be affected by this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I go into more details in the Implementation Details


Our validations will need to be updated as follows:
- A host can only be assigned the arbiter role if the cluster is TNA.
- TNA clusters must have at least 1 arbiter node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a host with an arbiter role on day 2? If not, let's add it here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine not. Maybe add this as a non-goal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and it is allowed to add arbiter nodes as day2 nodes, if the cluster was installed as TNA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to handle it in day2 - let's add it to the enhancement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it as a goal


### User Stories

- As a solutions architect for a retail organization, I want to deploy OpenShift
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see user stories associated with assisted installer here. I'm not sure how these are helpful for framing how we're going to implement this.

These are good to justify why OpenShift might want to allow this kind of topology, but that's not really relevant here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gamli75 are there any user stories of our customers wanting a topology like this?


Our validations will need to be updated as follows:
- A host can only be assigned the arbiter role if the cluster is TNA.
- TNA clusters must have at least 1 arbiter node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine not. Maybe add this as a non-goal.


Our validations will need to be updated as follows:
- A host can only be assigned the arbiter role if the cluster is TNA.
- TNA clusters must have at least 1 arbiter node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why at least? Are we expecting more than one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The openshift-installer does not set a max limit on the number of arbiter nodes, it just has to be at least 1.

Our validations will need to be updated as follows:
- A host can only be assigned the arbiter role if the cluster is TNA.
- TNA clusters must have at least 1 arbiter node.
- TNA clusters can have 2-5 control plane nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question here. Why have an arbiter node if we have 3 or more CP nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The openshift-installer allows it, so I assume we should also allow it.

Comment on lines 123 to 157
A few drawbacks we have is that we will be creating a new variant of OpenShift
that implements a new unique way of doing HA for kubernetes. This does mean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a drawback of our implementation. This sounds like it's talking more about TNA generally.

This enhancement should focus on the changes in assisted installer rather than TNA as an OpenShift topology.

This enhancement will include the following:
- In the API we will:
- Add a new host_role for arbiter.
- Add a new high_availability_mode for TNA.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most explicit thing to do is add arbiter_count. I wouldn't use control_plane_count=2 alone for TNA, because one day OpenShift might support a 2-node cluster without an arbiter (this was demoed some time back).

Comment on lines 41 to 45
- Deprecate the existing parameters and create new ones for hyperthreading
and disk encryption (they will be maps of host_role to bool).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

@giladravid16 giladravid16 Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently hyperthreading is one of ['masters', 'workers', 'none', 'all'], I think it's better to define a struct that has a boolean for each role then add more combinations with arbiter.
Disk encryption is similar with it's enable_on field, I would prefer adding a new field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, hyperthreading will only affect the REST API, so that's easy. Disk encryption would break the k8s API, which we should avoid.

I recommend extending the possibilities to avoid breaking the API. You can add:
'masters,workers'
'masters,arbiter'
'workers,arbiter'
'masters,workers,arbiter'

'all' would now mean 'masters,workers,arbiter'.
Using 'masters,workers,arbiter' would be future-proof in case we add another type later.

Copy link
Contributor Author

@giladravid16 giladravid16 Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's problematic to depracate the disk-encryption, I would rather also not depracate the hyperthreading. Either way the API will have the extra enum values that I tried to avoid.

@giladravid16 giladravid16 force-pushed the MGMT-19980 branch 2 times, most recently from 2571c9e to 4031a3b Compare March 20, 2025 09:38
@giladravid16
Copy link
Contributor Author

/retest-required

Copy link

openshift-ci bot commented Mar 26, 2025

@giladravid16: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@carbonin
Copy link
Member

This looks fine to me

@tsorya @avishayt @gamli75 do you want to take another look?

- In the host discovery we need to allow setting a host's role to be arbiter.
- In the cluster details:
- we need to allow setting the arbiter_count, with zero as the default.
- We need to allow setting control_plane_count to 2, but then force setting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carbonin Will this be correct when we have the Two Node OpenShift with Fencing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we have TNF we'll need to add another parameter for it, and then the validation will need to check for either TNA or TNF. Right now only TNA is an option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think this is fine for now when we only have TNA, but I don't think there's anything better we can do until we also need to handle TNF.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I take this back based on the conversation in the kickoff.

It might be better to just validate that we have a good combination rather than forcing arbiters to 1 since that's something we'd need to pull out once we have TNF.

This enhancement will include the following:
- In the API we will:
- Add a new host_role for arbiter.
- Add a new field for clusters named arbiter_count. The default value will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the kickoff it would be better to not require this up front.

We should be able to infer when an arbiter would be required and do autoassign and validations accordingly after discovery.


### Implementation Details/Notes/Constraints

The enums for cluster's hyperthreading and disk encryption will be:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During kickoff it was raised that we may want to evaluate the usage of the disk encryption feature, but I'm not sure that really affects what we want to handle in this feature.


### Open Questions

N/A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add ABI concerns here. We should either ensure it works or disable it explicitly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with kube-api for the first pass. We should disable this explicitly in the controllers if we're not going to support it on the first pass.

at the edge. They can support running 2 node clusters for redundancy but would like
the option to deploy a lower cost node as an arbiter to supply the 3 nodes for etcd quorum.

### Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a goal for tracking this feature usage in metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants