Skip to content

Commit 3abe305

Browse files
authored
Merge pull request kubernetes-sigs#10651 from dhij/dhij/restructure-release-doc
📖 release: restructure release docs team roles
2 parents defa62d + 9ae96b5 commit 3abe305

File tree

9 files changed

+580
-564
lines changed

9 files changed

+580
-564
lines changed

CHANGELOG/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@
22

33
This folder contains release notes for past releases. Changes to this folder in the main branch trigger a GitHub Action that creates release tags and a draft release.
44

5-
See [release documentation](../docs/release/release-tasks.md) for more information.
5+
See [release documentation](../docs/release/release-team.md) for more information.

docs/release/release-tasks.md

Lines changed: 0 additions & 551 deletions
This file was deleted.

docs/release/release-team-onboarding.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ through at the beginning of the cycle:
2525
- Kubernetes SIG membership:
2626
- Try to become an official member of the Kubernetes SIG, if possible. More information on the membership and requirements can be found [here](https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/release/release-team.md#cluster-api-release-team-vs-kuberneteskubernetes-sig-membership).
2727
- Familiarize yourself with the Release Process:
28-
- Review the release [tasks document](../release/release-tasks.md) which explains the responsibilities and tasks for each role within the release team.
28+
- Review the release [team roles](../release/release-team.md#team-roles) which explains the responsibilities and tasks for each role within the release team.
2929
- Check the Release Timeline:
3030
- Go through the [release timeline](../release/releases) of the release cycle you are involved in (i.e checkout `release-1.6.md` if you are part of the 1.6 cycle release team) to better understand the key milestones and deadlines.
3131

@@ -44,7 +44,7 @@ Now, let's dive into the specific onboarding notes for each sub-team below.
4444

4545
- Understand Release Process:
4646
- Get to know how project's release process works.
47-
- Walk through the [release note generation process](../release/release-tasks.md#create-pr-for-release-notes) and try to generate notes by yourself. This is the most important process the comms team is in charge of.
47+
- Walk through the [release note generation process](../release/role-handbooks/communications/README.md#create-pr-for-release-notes) and try to generate notes by yourself. This is the most important process the comms team is in charge of.
4848
- Familiarize yourself with the release notes tool [code](https://github.com/kubernetes-sigs/cluster-api/tree/main/hack/tools/release). You'll probably need to update this code during the release cycle to cover new cases or add new features.
4949
- Documentation familiarity:
5050
- Explore project's documentation and start learning how to update and maintain it.

docs/release/release-team.md

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# Cluster API Release Team
2+
13
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
24
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
35

@@ -19,8 +21,6 @@
1921

2022
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
2123

22-
# Cluster API Release Team
23-
2424
## Overview
2525

2626
In the past, releasing Cluster API has been mostly ad-hoc and relied on one or more contributors to do most of the chore work necessary to prepare the release. One of the major downsides of this approach is that it is often difficult for users and providers to plan around Cluster API releases as they often have little visibility around when a release will happen.
@@ -42,7 +42,7 @@ This document introduces the concept of a release team with the following goals
4242

4343
Note that this document is intended to be a starting point for the release team. It is not a complete release process document.
4444

45-
More details on the CAPI release process can be found in the [release cycle](./release-cycle.md) and [release task](./release-tasks.md) documentation.
45+
More details on the CAPI release process can be found in the [release cycle](./release-cycle.md) and the respective [role handbooks](./role-handbooks) documentation.
4646

4747
## Duration of Term
4848

@@ -67,11 +67,15 @@ As noted above, making changes to the CAPI release cadence is out of scope for
6767

6868
## Team Roles
6969

70-
- **Release Lead**: responsible for coordinating release activities, assembling the release team, taking ultimate accountability for all release tasks to be completed on time, and ensuring that a retrospective happens. The lead is also responsible for ensuring a successor is selected and trained for future release cycles.
71-
- **Communications/Docs/Release Notes Manager**: Responsible for communicating key dates to the community, improving release process documentation, and polishing release notes. Also responsible for ensuring the user-facing Netlify book and provider upgrade documentation are up to date.
72-
- **CI Signal/Bug Triage/Automation Manager**: Assumes the responsibility of the quality gate for the release and makes sure blocking issues and bugs are triaged and dealt with in a timely fashion. Helps improve release automation and tools.
73-
- **Team member**: Any Release Team lead or manager may select one or more additional members to help with their tasks. These team members will help fulfill future Release Team staffing requirements and continue to grow the CAPI community in general.
74-
*Note*: This is also documented in [Release tasks](./release-tasks.md) together with a mapping to specific tasks.
70+
**Notes**:
71+
72+
* The examples in these documents are based on the v1.6 release cycle.
73+
74+
| Role | Handbook |
75+
|---|---|
76+
| Release Lead | [Lead Handbook](role-handbooks/release-lead/README.md) |
77+
| CI Signal | [CI Signal Handbook](role-handbooks/ci-signal/README.md) |
78+
| Communications | [Communications Handbook](role-handbooks/communications/README.md) |
7579

7680
## Team repo permissions
7781
- Release notes (`CHANGELOG` folder)
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# CI Signal/Bug Triage/Automation Manager
2+
3+
## Overview
4+
5+
* If a task is prefixed with `[Track]` it means it should be ensured that this task is done, but the folks with the corresponding role are not responsible to do it themselves.
6+
7+
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
8+
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
9+
10+
- [Responsibilities](#responsibilities)
11+
- [Tasks](#tasks)
12+
- [Setup jobs and dashboards for a new release branch](#setup-jobs-and-dashboards-for-a-new-release-branch)
13+
- [[Continuously] Monitor CI signal](#continuously-monitor-ci-signal)
14+
- [[Continuously] Reduce the amount of flaky tests](#continuously-reduce-the-amount-of-flaky-tests)
15+
- [[Continuously] Bug triage](#continuously-bug-triage)
16+
17+
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
18+
19+
## Responsibilities
20+
21+
* Signal:
22+
* Responsibility for the quality of the release
23+
* Continuously monitor CI signal, so a release can be cut at any time
24+
* Add CI signal for new release branches
25+
* Bug Triage:
26+
* Make sure blocking issues and bugs are triaged and dealt with in a timely fashion
27+
* Automation:
28+
* Maintain and improve release automation, tooling & related developer docs
29+
30+
## Tasks
31+
32+
### Setup jobs and dashboards for a new release branch
33+
34+
The goal of this task is to have test coverage for the new release branch and results in testgrid.
35+
While we add test coverage for the new release branch we will also drop the tests for old release branches if necessary.
36+
37+
1. Create new jobs based on the jobs running against our `main` branch:
38+
1. Copy the `main` branch entry as `release-1.6` in the `cluster-api-prowjob-gen.yaml` file in [test-infra](https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes-sigs/cluster-api/).
39+
2. Modify the following at the `release-1.6` branch entry:
40+
* Change intervals (let's use the same as for `release-1.5`).
41+
2. Create a new dashboard for the new branch in: `test-infra/config/testgrids/kubernetes/sig-cluster-lifecycle/config.yaml` (`dashboard_groups` and `dashboards`).
42+
3. Remove old release branches and unused versions from the `cluster-api-prowjob-gen.yaml` file in [test-infra](https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes-sigs/cluster-api/) according to our policy documented in [Support and guarantees](../../../../CONTRIBUTING.md#support-and-guarantees). For example, let's assume we just added `release-1.6`, then we can now drop test coverage for the `release-1.3` branch.
43+
4. Regenerate the prowjob configuration running `make generate-test-infra-prowjobs` command from cluster-api repository. Before running this command, ensure to export the `TEST_INFRA_DIR` variable, specifying the location of the [test-infra](https://github.com/kubernetes/test-infra/) repository in your environment. For further information, refer to this [link](https://github.com/kubernetes-sigs/cluster-api/pull/9937).
44+
45+
```sh
46+
TEST_INFRA_DIR=../../k8s.io/test-infra make generate-test-infra-prowjobs
47+
```
48+
5. Verify the jobs and dashboards a day later by taking a look at: `https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.6`
49+
6. Update `.github/workflows/weekly-security-scan.yaml` - to setup Trivy and govulncheck scanning - `.github/workflows/weekly-md-link-check.yaml` - to setup link checking in the CAPI book - and `.github/workflows/weekly-test-release.yaml` - to verify the release target is working - for the currently supported branches.
50+
7. Update the [PR markdown link checker](https://github.com/kubernetes-sigs/cluster-api/blob/main/.github/workflows/pr-md-link-check.yaml) accordingly (e.g. `main` -> `release-1.6`).
51+
<br>Prior art: [Update branch for link checker](https://github.com/kubernetes-sigs/cluster-api/pull/9206)
52+
53+
54+
Prior art:
55+
56+
* [Add jobs for CAPI release 1.6](https://github.com/kubernetes/test-infra/pull/31208)
57+
58+
### [Continuously] Monitor CI signal
59+
60+
The goal of this task is to keep our tests running in CI stable.
61+
62+
**Note**: To be very clear, this is not meant to be an on-call role for Cluster API tests.
63+
64+
1. Add yourself to the [Cluster API alert mailing list](https://github.com/kubernetes/k8s.io/blob/151899b2de933e58a4dfd1bfc2c133ce5a8bbe22/groups/sig-cluster-lifecycle/groups.yaml#L20-L35)
65+
<br\>**Note**: An alternative to the alert mailing list is manually monitoring the [testgrid dashboards](https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api)
66+
(also dashboards of previous releases). Using the alert mailing list has proven to be a lot less effort though.
67+
2. Subscribe to `CI Activity` notifications for the Cluster API repo.
68+
3. Check the existing **failing-test** and **flaking-test** issue templates under `.github/ISSUE_TEMPLATE/` folder of the repo, used to create an issue for failing or flaking tests respectively. Please make sure they are up-to-date and if not, send a PR to update or improve them.
69+
4. Check if there are any existing jobs that got stuck (have been running for more than 12 hours) in a ['pending'](https://prow.k8s.io/?repo=kubernetes-sigs%2Fcluster-api&state=pending) state:
70+
- If that is the case, notify the maintainers and ask them to manually cancel and re-run the stuck jobs.
71+
5. Triage CI failures reported by mail alerts or found by monitoring the testgrid dashboards:
72+
1. Create an issue using an appropriate template (failing-test) in the Cluster API repository to surface the CI failure.
73+
2. Identify if the issue is a known issue, new issue or a regression.
74+
3. Mark the issue as `release-blocking` if applicable.
75+
6. Triage periodic GitHub actions failures, with special attention to image scan results;
76+
Eventually open issues as described above.
77+
7. Run periodic deep-dive sessions with the CI team to investigate failing and flaking tests. Example session recording: https://www.youtube.com/watch?v=YApWftmiDTg
78+
79+
#### [Continuously] Reduce the amount of flaky tests
80+
81+
The Cluster API tests are pretty stable, but there are still some flaky tests from time to time.
82+
83+
To reduce the amount of flakes please periodically:
84+
85+
1. Take a look at recent CI failures via `k8s-triage`:
86+
* [main: e2e, e2e-mink8s, test, test-mink8s](https://storage.googleapis.com/k8s-triage/index.html?job=.*cluster-api.*(test%7Ce2e)-(mink8s-)*main&xjob=.*-provider-.*)
87+
2. Open issues using an appropriate template (flaking-test) for occurring flakes and ideally fix them or find someone who can.
88+
**Note**: Given resource limitations in the Prow cluster it might not be possible to fix all flakes.
89+
Let's just try to pragmatically keep the amount of flakes pretty low.
90+
91+
### [Continuously] Bug triage
92+
93+
The goal of bug triage is to triage incoming issues and if necessary flag them with `release-blocking`
94+
and add them to the milestone of the current release.
95+
96+
We probably have to figure out some details about the overlap between the bug triage task here, release leads
97+
and Cluster API maintainers.

0 commit comments

Comments
 (0)