@@ -14,6 +14,9 @@ regressions in compatibility.
14
14
15
15
### Types of bootstrap testing for C-Chain
16
16
17
+ The X-Chain and P-Chain always synchronize all state, but the bulk of
18
+ data for testnet and mainnet is on the C-Chain and there are 2 options:
19
+
17
20
#### State Sync
18
21
19
22
A bootstrap with state sync enabled (the default) ensures that only
@@ -26,50 +29,103 @@ default) not all history will be stored.
26
29
27
30
To enable, supply ` state-sync-enabled: false ` as C-Chain configuration.
28
31
29
- ## Architecture TODO(marun) Rename
32
+ ## Overview
30
33
31
- The intention of ` bootstrap-monitor ` is to enable a statefulset to
34
+ The intention of ` bootstrap-monitor ` is to enable a ` StatefulSet ` to
32
35
perform continous bootstrap testing for a given avalanchego
33
- configuration. It ensures that a new testing pod either starts or
34
- resumes a test, and upon completion of a test, polls for a new image
35
- to test and initiates a new test when one is found.
36
-
37
- - ` bootstrap-monitor init ` intended to run as init container of an avalanchego node
38
- - mounts the same data volume
39
- - if the image of the avalanchego container is tagged with ` latest `
40
- - runs an avalanchego pod with ` latest ` to retrieve the
41
- associated image id
42
- - not possible to retrieve the image id from the running pod
43
- since it won't be populated until after the init container
44
- has finished execution
45
- - updates the managing stateful set with the image id
46
- - attempts to read an image name from a file on the data volume
47
- - if the file is not present or the image name differs from the image name of the pod
48
- - write the image name to a file on the data volume
49
- - clear the data volume
50
- - report that a new bootstrap test is starting for the current image
51
- - if the image name was present on disk and differs from the current image
52
- - report that a bootstrap test is being resumed
36
+ configuration. It ensures that a testing pod either starts or resumes
37
+ a test, and upon completion of a test, polls for a new image to test
38
+ and initiates a new test when one is found.
39
+
40
+ - Both the ` init ` and ` wait-for-completion ` commands of the
41
+ ` bootstrap-monitor ` binary are intended to run as containers of a
42
+ pod alongside an avalanchego container. The pod is expected to be
43
+ managed by a ` StatefulSet ` to ensure the pod is restarted on
44
+ failure and that only a single pod runs at a time to avoid
45
+ contention for the backing data volume. Both commands derive the
46
+ configuration of a bootstrap test from the pod:
47
+ - The network targeted by the test is determined by the value of
48
+ the ` AVAGO_NETWORK_NAME ` env var set for the avalanchego
49
+ container.
50
+ - Whether state sync is enabled is determined by the value of the
51
+ ` AVAGO_CHAIN_CONFIG_CONTENT ` env var set for the avalanchego
52
+ container.
53
+ - The image used by the test is determined by the image configured
54
+ for the avalanchego container.
55
+ - The versions of the avalanchego image used by the test is
56
+ determined by the pod annotation with key
57
+ ` avalanche.avax.network/avalanchego-versions ` .
58
+ - When a bootstrap testing pod is inevitably rescheduled or
59
+ restarted, the contents of the ` PersistentVolumeClaim ` configured
60
+ by the managing ` StatefulSet ` will persist across pod restarts to
61
+ allow resumption of the interrupted test.
62
+ - Both the ` init ` and ` wait-for-completion ` commands of the
63
+ ` bootstrap-monitor ` attempt to read serialized test details (namely
64
+ the image used for the test and the start time of the test) from
65
+ the same data volume used by the avalanchego node.
66
+ - The ` bootstrap-monitor init ` command is intended to run as as the
67
+ an init container of an avalanchego node and ensure that the ID of
68
+ the image and its associated versions are recorded for the test and
69
+ that the contents of the pod's data volume is either cleared for a
70
+ new test or retained to enable resuming a previously started
71
+ test. It accomplishes this by:
72
+ - Mounting the same data volume as the avalanchego node
73
+ - Reading bootstrap test configuration as described previously
74
+ - Determining the image ID and versions for an image if the
75
+ avalanchego image for the pod uses the ` latest ` tag. This will
76
+ only need to be performed the first pod that a bootstrap testing
77
+ ` StatefulSet ` runs. Subsequent pods from the same ` StatefulSet `
78
+ should have an image qualified with its SHA and version details
79
+ set by the previous test run's ` wait-for-completion ` pod.
80
+ - A new pod will be started with with the ` latest ` image to
81
+ execute ` avalanchego --versions-json ` to determine the exact
82
+ version of the image and update the ` StatefulSet ` managing the
83
+ pod which will prompt a pod restart. This ensures both that a
84
+ test result can be associated with a specific image SHA and the
85
+ avalanchego versions (including commit hash) of the binary that
86
+ the image provides.
87
+ - A separate pod is used because the image ID of a non-init
88
+ avalanchego container using a ` latest ` -tagged image is only
89
+ available when that container runs rather than when an init container runs.
90
+ - While it would be possible to add an init container running the
91
+ same avalanchego image as the primary avalanchego container,
92
+ have it run the version command, and then have the
93
+ ` bootstrap-monitor init ` container read those results, the
94
+ method of discoverying the versions and image of the
95
+ avalanchego image currently tagged with ` latest ` would still be
96
+ required by the ` wait-for-completion ` command (described in a
97
+ subsequent section) to enable discovery of a new image to
98
+ test. It seemed preferable to have only a single way to
99
+ discover image details.
100
+ - Attempting to read the serialized test details from a file on the
101
+ data volume. This file will not exist if the data volume has not
102
+ been used before.
103
+ - Comparing the image from the serialized test details to the image
104
+ in the test configuration.
105
+ - If the images differ (or the file was not present), the data
106
+ volume is initialized for a new test:
107
+ - The data volume is cleared
108
+ - The image from the test configuration and and time are written to the data volume
109
+ - If the images are the same, the data volume is used as-is to
110
+ enable resuming the in-progress test.
53
111
- ` bootstrap-monitor wait-for-completion ` is intended to run as a
54
- sidecar of the avalanchego container and mount the same data volume read-only
55
- - every health check interval
56
- - checks the health of the node
57
- - logs the disk usage of the data volume
58
- - once the node is healthy
59
- - every image check interval
60
- - starts a pod with the avalanchego image tagged ` latest ` to find a new image to test
61
- - once a new image is found
62
- - updates the managing stateful set with the new image to prompt a new bootstrap test
112
+ sidecar of the avalanchego container. It polls the health of the
113
+ node container to detect when a bootstrap test has completed
114
+ successfully, then polls for a new image to test and when one is
115
+ found, updates the managing ` StatefulSet ` with that image to
116
+ trigger the start of a new test. The process to detect a new image
117
+ is the same as was described for the ` init ` command.
63
118
64
119
## Package details
65
120
66
- | Filename | Purpose |
67
- | :----------------| :---------------------------------------------------------------|
68
- | common.go | Defines code common between init and wait |
69
- | init.go | Defines how a bootstrap test is initialized |
70
- | wait.go | Defines the loop that waits for completion of a bootstrap test |
71
- | cmd/main.go | The binary entrypoint for the bootstrap-monitor |
72
- | e2e/e2e_test.go | The e2e test that validates the bootstrap-monitor |
121
+ | Filename | Purpose |
122
+ | :-------------------------| :-----------------------------------------------------------------------|
123
+ | bootstrap_test_config.go | Defines how the configuration for a bootstrap test is read from a pod. |
124
+ | common.go | Defines code common between init and wait |
125
+ | init.go | Defines how a bootstrap test is initialized |
126
+ | wait.go | Defines the loop that waits for completion of a bootstrap test |
127
+ | cmd/main.go | The binary entrypoint for the bootstrap-monitor |
128
+ | e2e/e2e_test.go | The e2e test that validates the bootstrap-monitor |
73
129
74
130
## Supporting files
75
131
@@ -86,16 +142,26 @@ to test and initiates a new test when one is found.
86
142
87
143
## Alternatives considered
88
144
89
- ### Run bootstrap tests on hosted github workers
145
+ ### Run bootstrap tests on github workers
90
146
91
- - allow triggering / reporting to happen with github
92
- - but 5 day limit on job duration wouldn't probably wouldn't support full sync testing
147
+ - Public github workers are not compatible with bootstrap testing due
148
+ to the available storage of 30GB being insufficient for even state
149
+ sync bootstrap.
150
+ - Self-hosted github workers are not compatible with bootstrap testing
151
+ due to the 5 day maximum duration for a job running on a self-hosted
152
+ runner. State sync bootstrap usually completes within 5 days, but full
153
+ sync bootstrap usually takes much longer.
93
154
94
155
### Adding a 'bootstrap mode' to avalanchego
95
- - with a --bootstrap-mode flag, exit on successful bootstrap
96
- - but using it without a controller would require using ` latest ` to
97
- ensure that the node version could change on restarts
98
- - but when using ` latest ` there is no way to avoid having pod
99
- restart preventing the completion of an in-process bootstrap
100
- test. Only by using a specific image tag will it be possible for
101
- a restarted pod to reliably resume a bootstrap test.
156
+
157
+ If avalanchego supported a ` --bootstrap-mode ` flag that exited on
158
+ successful bootstrap, and a pod configured with this flag used an
159
+ image with a ` latest ` tag, the pod would continously bootstrap, exit,
160
+ and restart with the current latest image. While appealingly simple,
161
+ this approach doesn't directly support:
162
+
163
+ - a mechanism for resuming a long-running bootstrap. Given the
164
+ expected duration of a bootstrap test, and the fact that a workload on
165
+ Kubernetes is not guaranteed to run without interruption, a separate
166
+ init process is suggested to enable resumption of an interrupted test.
167
+ - a mechanism for reporting disk usage and duration of execution
0 commit comments