-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MGMT-19980: TNA clusters #7402
base: master
Are you sure you want to change the base?
MGMT-19980: TNA clusters #7402
Conversation
@giladravid16: This pull request references MGMT-19980 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giladravid16 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6bdb5e0
to
8005ed0
Compare
the worker's requirements to avoid issues mid-development). | ||
- The HostTypeHardwareRequirementsWrapper structs used by the operators. | ||
|
||
We will auto-assign the arbiter role to a host if the following are true: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we need to assign it to the host with lowest HW in case all 3 matches HW requirements to be masters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial idea was to check for the lowest HW that pass the requirements, but then I saw that right now for masters we auto-select the first ones we find without comparing it to the other hosts (even if it's a total waste of resources) and I assume it's because it can be difficult to determine which host is smallest (we would need to compare ClusterHostRequirementsDetails).
This enhancement will include the following: | ||
- In the API we will: | ||
- Add a new host_role for arbiter. | ||
- Add a new high_availability_mode for TNA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the high_availability_mode is deprecated, and we will use control_plane_count instead.
I'm not sure if 3 CP == 2 CP with 1 arbiter.
@danmanor WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most explicit thing to do is add arbiter_count. I wouldn't use control_plane_count=2 alone for TNA, because one day OpenShift might support a 2-node cluster without an arbiter (this was demoed some time back).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I thought of using high_availability_mode, in openshift-installer they set the ControlPlaneTopology to HighlyAvailableArbiter and InfrastructureTopology to HighlyAvailable for this reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @gamli75 said, high_availability_mode is deprecated in the Assisted API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added arbiter_count instead
- Deprecate the existing parameters and create new ones for hyperthreading | ||
and disk encryption (they will be maps of host_role to bool). | ||
- Add a new feature_support_level_id for TNA. | ||
- We will update the clusters' validations and transitions for TNA clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add more details about the validations that will be affected by this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I go into more details in the Implementation Details
|
||
Our validations will need to be updated as follows: | ||
- A host can only be assigned the arbiter role if the cluster is TNA. | ||
- TNA clusters must have at least 1 arbiter node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a host with an arbiter role on day 2? If not, let's add it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd imagine not. Maybe add this as a non-goal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and it is allowed to add arbiter nodes as day2 nodes, if the cluster was installed as TNA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to handle it in day2 - let's add it to the enhancement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it as a goal
|
||
### User Stories | ||
|
||
- As a solutions architect for a retail organization, I want to deploy OpenShift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see user stories associated with assisted installer here. I'm not sure how these are helpful for framing how we're going to implement this.
These are good to justify why OpenShift might want to allow this kind of topology, but that's not really relevant here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gamli75 are there any user stories of our customers wanting a topology like this?
|
||
Our validations will need to be updated as follows: | ||
- A host can only be assigned the arbiter role if the cluster is TNA. | ||
- TNA clusters must have at least 1 arbiter node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd imagine not. Maybe add this as a non-goal.
|
||
Our validations will need to be updated as follows: | ||
- A host can only be assigned the arbiter role if the cluster is TNA. | ||
- TNA clusters must have at least 1 arbiter node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why at least? Are we expecting more than one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The openshift-installer does not set a max limit on the number of arbiter nodes, it just has to be at least 1.
Our validations will need to be updated as follows: | ||
- A host can only be assigned the arbiter role if the cluster is TNA. | ||
- TNA clusters must have at least 1 arbiter node. | ||
- TNA clusters can have 2-5 control plane nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question here. Why have an arbiter node if we have 3 or more CP nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The openshift-installer allows it, so I assume we should also allow it.
A few drawbacks we have is that we will be creating a new variant of OpenShift | ||
that implements a new unique way of doing HA for kubernetes. This does mean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a drawback of our implementation. This sounds like it's talking more about TNA generally.
This enhancement should focus on the changes in assisted installer rather than TNA as an OpenShift topology.
This enhancement will include the following: | ||
- In the API we will: | ||
- Add a new host_role for arbiter. | ||
- Add a new high_availability_mode for TNA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most explicit thing to do is add arbiter_count. I wouldn't use control_plane_count=2 alone for TNA, because one day OpenShift might support a 2-node cluster without an arbiter (this was demoed some time back).
- Deprecate the existing parameters and create new ones for hyperthreading | ||
and disk encryption (they will be maps of host_role to bool). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently hyperthreading is one of ['masters', 'workers', 'none', 'all'], I think it's better to define a struct that has a boolean for each role then add more combinations with arbiter.
Disk encryption is similar with it's enable_on field, I would prefer adding a new field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, hyperthreading will only affect the REST API, so that's easy. Disk encryption would break the k8s API, which we should avoid.
I recommend extending the possibilities to avoid breaking the API. You can add:
'masters,workers'
'masters,arbiter'
'workers,arbiter'
'masters,workers,arbiter'
'all' would now mean 'masters,workers,arbiter'.
Using 'masters,workers,arbiter' would be future-proof in case we add another type later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's problematic to depracate the disk-encryption, I would rather also not depracate the hyperthreading. Either way the API will have the extra enum values that I tried to avoid.
2571c9e
to
4031a3b
Compare
4031a3b
to
7ae70b4
Compare
/retest-required |
@giladravid16: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
- In the host discovery we need to allow setting a host's role to be arbiter. | ||
- In the cluster details: | ||
- we need to allow setting the arbiter_count, with zero as the default. | ||
- We need to allow setting control_plane_count to 2, but then force setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carbonin Will this be correct when we have the Two Node OpenShift with Fencing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we have TNF we'll need to add another parameter for it, and then the validation will need to check for either TNA or TNF. Right now only TNA is an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think this is fine for now when we only have TNA, but I don't think there's anything better we can do until we also need to handle TNF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I take this back based on the conversation in the kickoff.
It might be better to just validate that we have a good combination rather than forcing arbiters to 1 since that's something we'd need to pull out once we have TNF.
This enhancement will include the following: | ||
- In the API we will: | ||
- Add a new host_role for arbiter. | ||
- Add a new field for clusters named arbiter_count. The default value will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the kickoff it would be better to not require this up front.
We should be able to infer when an arbiter would be required and do autoassign and validations accordingly after discovery.
|
||
### Implementation Details/Notes/Constraints | ||
|
||
The enums for cluster's hyperthreading and disk encryption will be: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During kickoff it was raised that we may want to evaluate the usage of the disk encryption feature, but I'm not sure that really affects what we want to handle in this feature.
|
||
### Open Questions | ||
|
||
N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add ABI concerns here. We should either ensure it works or disable it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with kube-api for the first pass. We should disable this explicitly in the controllers if we're not going to support it on the first pass.
at the edge. They can support running 2 node clusters for redundancy but would like | ||
the option to deploy a lower cost node as an arbiter to supply the 3 nodes for etcd quorum. | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a goal for tracking this feature usage in metrics.
Enhancement to define the implementation of installing TNA clusters.
See openshift's enhancement for more details about TNA clusters.
List all the issues related to this PR
What environments does this code impact?
How was this code tested?
Checklist
docs
, README, etc)Reviewers Checklist