Skip to content

Commit 054e021

Browse files
committed
fixup: Finalize README
1 parent 628804f commit 054e021

File tree

1 file changed

+114
-48
lines changed

1 file changed

+114
-48
lines changed

tests/fixture/bootstrapmonitor/README.md

Lines changed: 114 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,9 @@ regressions in compatibility.
1414

1515
### Types of bootstrap testing for C-Chain
1616

17+
The X-Chain and P-Chain always synchronize all state, but the bulk of
18+
data for testnet and mainnet is on the C-Chain and there are 2 options:
19+
1720
#### State Sync
1821

1922
A bootstrap with state sync enabled (the default) ensures that only
@@ -26,50 +29,103 @@ default) not all history will be stored.
2629

2730
To enable, supply `state-sync-enabled: false` as C-Chain configuration.
2831

29-
## Architecture TODO(marun) Rename
32+
## Overview
3033

31-
The intention of `bootstrap-monitor` is to enable a statefulset to
34+
The intention of `bootstrap-monitor` is to enable a `StatefulSet` to
3235
perform continous bootstrap testing for a given avalanchego
33-
configuration. It ensures that a new testing pod either starts or
34-
resumes a test, and upon completion of a test, polls for a new image
35-
to test and initiates a new test when one is found.
36-
37-
- `bootstrap-monitor init` intended to run as init container of an avalanchego node
38-
- mounts the same data volume
39-
- if the image of the avalanchego container is tagged with `latest`
40-
- runs an avalanchego pod with `latest` to retrieve the
41-
associated image id
42-
- not possible to retrieve the image id from the running pod
43-
since it won't be populated until after the init container
44-
has finished execution
45-
- updates the managing stateful set with the image id
46-
- attempts to read an image name from a file on the data volume
47-
- if the file is not present or the image name differs from the image name of the pod
48-
- write the image name to a file on the data volume
49-
- clear the data volume
50-
- report that a new bootstrap test is starting for the current image
51-
- if the image name was present on disk and differs from the current image
52-
- report that a bootstrap test is being resumed
36+
configuration. It ensures that a testing pod either starts or resumes
37+
a test, and upon completion of a test, polls for a new image to test
38+
and initiates a new test when one is found.
39+
40+
- Both the `init` and `wait-for-completion` commands of the
41+
`bootstrap-monitor` binary are intended to run as containers of a
42+
pod alongside an avalanchego container. The pod is expected to be
43+
managed by a `StatefulSet` to ensure the pod is restarted on
44+
failure and that only a single pod runs at a time to avoid
45+
contention for the backing data volume. Both commands derive the
46+
configuration of a bootstrap test from the pod:
47+
- The network targeted by the test is determined by the value of
48+
the `AVAGO_NETWORK_NAME` env var set for the avalanchego
49+
container.
50+
- Whether state sync is enabled is determined by the value of the
51+
`AVAGO_CHAIN_CONFIG_CONTENT` env var set for the avalanchego
52+
container.
53+
- The image used by the test is determined by the image configured
54+
for the avalanchego container.
55+
- The versions of the avalanchego image used by the test is
56+
determined by the pod annotation with key
57+
`avalanche.avax.network/avalanchego-versions`.
58+
- When a bootstrap testing pod is inevitably rescheduled or
59+
restarted, the contents of the `PersistentVolumeClaim` configured
60+
by the managing `StatefulSet` will persist across pod restarts to
61+
allow resumption of the interrupted test.
62+
- Both the `init` and `wait-for-completion` commands of the
63+
`bootstrap-monitor` attempt to read serialized test details (namely
64+
the image used for the test and the start time of the test) from
65+
the same data volume used by the avalanchego node.
66+
- The `bootstrap-monitor init` command is intended to run as as the
67+
an init container of an avalanchego node and ensure that the ID of
68+
the image and its associated versions are recorded for the test and
69+
that the contents of the pod's data volume is either cleared for a
70+
new test or retained to enable resuming a previously started
71+
test. It accomplishes this by:
72+
- Mounting the same data volume as the avalanchego node
73+
- Reading bootstrap test configuration as described previously
74+
- Determining the image ID and versions for an image if the
75+
avalanchego image for the pod uses the `latest` tag. This will
76+
only need to be performed the first pod that a bootstrap testing
77+
`StatefulSet` runs. Subsequent pods from the same `StatefulSet`
78+
should have an image qualified with its SHA and version details
79+
set by the previous test run's `wait-for-completion` pod.
80+
- A new pod will be started with with the `latest` image to
81+
execute `avalanchego --versions-json` to determine the exact
82+
version of the image and update the `StatefulSet` managing the
83+
pod which will prompt a pod restart. This ensures both that a
84+
test result can be associated with a specific image SHA and the
85+
avalanchego versions (including commit hash) of the binary that
86+
the image provides.
87+
- A separate pod is used because the image ID of a non-init
88+
avalanchego container using a `latest`-tagged image is only
89+
available when that container runs rather than when an init container runs.
90+
- While it would be possible to add an init container running the
91+
same avalanchego image as the primary avalanchego container,
92+
have it run the version command, and then have the
93+
`bootstrap-monitor init` container read those results, the
94+
method of discoverying the versions and image of the
95+
avalanchego image currently tagged with `latest` would still be
96+
required by the `wait-for-completion` command (described in a
97+
subsequent section) to enable discovery of a new image to
98+
test. It seemed preferable to have only a single way to
99+
discover image details.
100+
- Attempting to read the serialized test details from a file on the
101+
data volume. This file will not exist if the data volume has not
102+
been used before.
103+
- Comparing the image from the serialized test details to the image
104+
in the test configuration.
105+
- If the images differ (or the file was not present), the data
106+
volume is initialized for a new test:
107+
- The data volume is cleared
108+
- The image from the test configuration and and time are written to the data volume
109+
- If the images are the same, the data volume is used as-is to
110+
enable resuming the in-progress test.
53111
- `bootstrap-monitor wait-for-completion` is intended to run as a
54-
sidecar of the avalanchego container and mount the same data volume read-only
55-
- every health check interval
56-
- checks the health of the node
57-
- logs the disk usage of the data volume
58-
- once the node is healthy
59-
- every image check interval
60-
- starts a pod with the avalanchego image tagged `latest` to find a new image to test
61-
- once a new image is found
62-
- updates the managing stateful set with the new image to prompt a new bootstrap test
112+
sidecar of the avalanchego container. It polls the health of the
113+
node container to detect when a bootstrap test has completed
114+
successfully, then polls for a new image to test and when one is
115+
found, updates the managing `StatefulSet` with that image to
116+
trigger the start of a new test. The process to detect a new image
117+
is the same as was described for the `init` command.
63118

64119
## Package details
65120

66-
| Filename | Purpose |
67-
|:----------------|:---------------------------------------------------------------|
68-
| common.go | Defines code common between init and wait |
69-
| init.go | Defines how a bootstrap test is initialized |
70-
| wait.go | Defines the loop that waits for completion of a bootstrap test |
71-
| cmd/main.go | The binary entrypoint for the bootstrap-monitor |
72-
| e2e/e2e_test.go | The e2e test that validates the bootstrap-monitor |
121+
| Filename | Purpose |
122+
|:-------------------------|:-----------------------------------------------------------------------|
123+
| bootstrap_test_config.go | Defines how the configuration for a bootstrap test is read from a pod. |
124+
| common.go | Defines code common between init and wait |
125+
| init.go | Defines how a bootstrap test is initialized |
126+
| wait.go | Defines the loop that waits for completion of a bootstrap test |
127+
| cmd/main.go | The binary entrypoint for the bootstrap-monitor |
128+
| e2e/e2e_test.go | The e2e test that validates the bootstrap-monitor |
73129

74130
## Supporting files
75131

@@ -86,16 +142,26 @@ to test and initiates a new test when one is found.
86142

87143
## Alternatives considered
88144

89-
### Run bootstrap tests on hosted github workers
145+
### Run bootstrap tests on github workers
90146

91-
- allow triggering / reporting to happen with github
92-
- but 5 day limit on job duration wouldn't probably wouldn't support full sync testing
147+
- Public github workers are not compatible with bootstrap testing due
148+
to the available storage of 30GB being insufficient for even state
149+
sync bootstrap.
150+
- Self-hosted github workers are not compatible with bootstrap testing
151+
due to the 5 day maximum duration for a job running on a self-hosted
152+
runner. State sync bootstrap usually completes within 5 days, but full
153+
sync bootstrap usually takes much longer.
93154

94155
### Adding a 'bootstrap mode' to avalanchego
95-
- with a --bootstrap-mode flag, exit on successful bootstrap
96-
- but using it without a controller would require using `latest` to
97-
ensure that the node version could change on restarts
98-
- but when using `latest` there is no way to avoid having pod
99-
restart preventing the completion of an in-process bootstrap
100-
test. Only by using a specific image tag will it be possible for
101-
a restarted pod to reliably resume a bootstrap test.
156+
157+
If avalanchego supported a `--bootstrap-mode` flag that exited on
158+
successful bootstrap, and a pod configured with this flag used an
159+
image with a `latest` tag, the pod would continously bootstrap, exit,
160+
and restart with the current latest image. While appealingly simple,
161+
this approach doesn't directly support:
162+
163+
- a mechanism for resuming a long-running bootstrap. Given the
164+
expected duration of a bootstrap test, and the fact that a workload on
165+
Kubernetes is not guaranteed to run without interruption, a separate
166+
init process is suggested to enable resumption of an interrupted test.
167+
- a mechanism for reporting disk usage and duration of execution

0 commit comments

Comments
 (0)