You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/arch.md
+46-33Lines changed: 46 additions & 33 deletions
Original file line number
Diff line number
Diff line change
@@ -1,22 +1,22 @@
1
1
Insights operator is OpenShift Cloud Native application based on the [Operators Framework](https://github.com/operator-framework).
2
2
Operators Framework is a toolkit to manage other cloud native applications.
3
3
4
-
Tip:
4
+
Tip:
5
5
Try to install operators-sdk and generate new operator using https://sdk.operatorframework.io/docs/building-operators/golang/quickstart/
6
6
You will see how much code is generated by operators-sdk by default and what is provided by default in an operator.
7
7
8
8
Main goal of Insights Operators is to periodically gather anonymized data from applications in cluster and periodically upload them
9
9
to `cloud.redhat.com` for analysis.
10
10
11
-
Insights Operator itself is not managing any applications, it rather only runs using Operators Framework infrastructure.
11
+
Insights Operator itself is not managing any applications, it rather only runs using Operators Framework infrastructure.
12
12
Like is the convention of Operator applications it has most of the code structured in pkg package and `pkg/controller/operator.go`
13
13
is the hosting the Operator controller. Typically Operator controllers are reading configuration and starts some periodical tasks.
14
14
15
15
## How Insights Operator reads configuration
16
-
In case of Insights Operator, configuration is a combination of file [config/pod.yaml](config/pod.yaml) and configuration stored in
16
+
In case of Insights Operator, configuration is a combination of file [config/pod.yaml](config/pod.yaml) and configuration stored in
17
17
Namespace openshift-config in secret support. In the secret support is the endpoint and interval. The secret doesn't exist by default,
18
18
but when exists it overrides default settings which IO reads from config/pod.yaml.
19
-
The support secret has
19
+
The support secret has
20
20
- endpoint (where to upload to),
21
21
- interval (baseline for how often to gather and upload)
22
22
- httpProxy, httpsProxy, noProxy eventually to set custom proxy, which overrides cluster proxy just for Insights Operator uploads
@@ -76,7 +76,7 @@ Internally the configObserver has an array of subscribers, so all of them will g
76
76
77
77
## How is Insights Operator scheduling gathering
78
78
Commonly used pattern in Insights Operator is that a task is started as go routine and runs its own cycle of periodic actions.
79
-
These actions are mostly started from `operator.go`.
79
+
These actions are mostly started from `operator.go`.
80
80
They are usually using wait.Until - runs function periodically after short delay until end is signalled.
81
81
There are these main tasks scheduled:
82
82
- Gatherer
@@ -87,23 +87,18 @@ There are these main tasks scheduled:
87
87
### Scheduling of Gatherer
88
88
Gatherer is using this logic to start information gathering from the cluster, and it is handled in [periodic.go](pkg/controller/periodic/periodic.go).
89
89
90
-
The periodic is using a producer/consumer queue. It is periodically adding the gatherer to queue. The queue has a limit for maximally one gatherer of the name in queue and because we only have gatherer "config" it is only this one in queue.
91
-
The adding to queue is run in periodic.periodicTrigger, which is started from periodic.Run function.
92
-
Function periodicTrigger tries to add a new one (blocking if the queue is full) every interval/2.
90
+
So far we have only 1 Gatherer(called `clusterconfig`), it has several gather-functions each collecting different data from the cluster.
91
+
The workflow of the gather-functions is managed by the Gatherer.
92
+
Only one Gatherer runs at one time, this is because we only have 1 Gatherer at the moment. (ie.: we can add concurrency here when its needed)
93
+
When IO starts there is an initial delay before the first `Gather` happens, after that a `Gather` is initiated every interval, this is done by `periodicTrigger`.
94
+
`periodic.Run` handles the initial delay and starts the `periodicTrigger` like `go wait.Until(func() { c.periodicTrigger(stopCh) }, time.Second, stopCh)`.
93
95
94
-
Code: In periodic.Run it is calling periodicTrigger if it has a chance
95
-
```
96
-
go wait.Until(func() { c.periodicTrigger(stopCh) }, time.Second, stopCh)
97
-
```
98
-
The periodicTrigger is special in a way that it tries to balance times when it inserts to the queue.
99
-
`wait.Jitter(i, m)` returns `time.Duration` after `time i + random(i*m)`.
100
-
And even if it is time to add to queue it balances inserts with AddAfter between `i/4 to i/4 + random(i/4 * 2)` (sometimes between 1/4 and 3/4).
101
-
102
-
The consumers from the queue are 4 workers started as goroutines from periodic.Run function.
103
-
Because we only have one gatherer in queue, I think only one worker will effectively run task from queue. The actual start of gatherer is call to c.sync from periodic.processNextWorkItem. The c.sync is also defined in periodic and calls Gather + immediately stores returned data to disk as one file with timestamp.
96
+
`Gather` uses `ExponentialBackoff` to retry (amount specified in: `status.GatherFailuresCountThreshold`) if a Gatherer returns any errors, these errors are mostly caused when a collected resource is not yet ready therefore it can't be right now collected so we should retry later.
97
+
It's important that all retries finish before the next gather period starts, so that we don't have potential conflicts, the Backoff is calibrated to take this into account.
98
+
Errors that occurred during a gather-function are logged in the metadata part of the IO archive. (`insigths-operator/gathers.json`)
104
99
105
100
### Scheduling and running of Uploader
106
-
The `operator.go` is starting background task defined in `pkg/insights/insightsuploader/insightsuploader.go`. The insights uploader is periodically checking if there are any data to upload by calling summarizer.
101
+
The `operator.go` is starting background task defined in `pkg/insights/insightsuploader/insightsuploader.go`. The insights uploader is periodically checking if there are any data to upload by calling summarizer.
107
102
If no data to upload are found the uploader continues with next cycle.
108
103
The uploader cycle is running `wait.Poll` function which is waiting until config changes or until there is a time to upload. The time to upload is set by initialDelay.
109
104
If this is the first upload (the lastReportedTime from status is not set) the uploader uses `interval/8+random(interval/8*2)` as next upload time. This could be reset though to 0, if it is Safe to upload immediately. If any upload was already reported, the next upload interval is going to be `now - lastReported + interval + 1.2 Jitter`.
@@ -125,7 +120,7 @@ If the file names are unstable (for example reading from Api with Limit and reac
125
120
126
121
## Scheduling the ConfigObserver
127
122
Another background task started from Observer is from `pkg/config/configobserver/configobserver.go`. The observer creates configObserver by calling `configObserver.New`, which sets default observing interval to 5 minutes.
128
-
The Run method runs again wait.Poll every 5 minutes and reads both support and pull-secret secrets.
123
+
The Run method runs again wait.Poll every 5 minutes and reads both support and pull-secret secrets.
129
124
130
125
## Scheduling diskpruner and what it does
131
126
By default Insights Operator Gather is calling diskrecorder to save newly collected data in a new file, but doesn't remove old. This is the task of diskpruner. Observer calls `recorder.PeriodicallyPrune()` function. It is again using wait.Until pattern and runs approximately after every second interval.
@@ -162,7 +157,7 @@ $ oc get co insights -o=json | jq '.status.conditions'
162
157
}
163
158
]
164
159
```
165
-
The status is being updated by `pkg/controller/status/status.go`. Status has a background task, which is periodically updating
160
+
The status is being updated by `pkg/controller/status/status.go`. Status has a background task, which is periodically updating
166
161
the Operator status from its internal list of Sources. Any component which wants to participate on Operator's status adds a
167
162
SimpleReporter, which is returning its actual Status. The Simple reporter is defined in controllerstatus.
168
163
@@ -177,7 +172,7 @@ It relies on fact that updateStatus is called on Start of status cycle.
177
172
178
173
179
174
## How is Insights Operator using various Api Clients
180
-
Internally Insights operator is talking to Kubernetes Api server over Http Rest queries. Each query is authenticated by a Bearer token,
175
+
Internally Insights operator is talking to Kubernetes Api server over Http Rest queries. Each query is authenticated by a Bearer token,
181
176
To simulate see an actual Rest query being used, you can try:
But adding Bearer token and creating Rest query is all handled automatically for us by using Clients, which are generated, type safe golang libraries,
192
+
But adding Bearer token and creating Rest query is all handled automatically for us by using Clients, which are generated, type safe golang libraries,
198
193
like [github.com/openshift/client-go](github.com/openshift/client-go) or [github.com/kubernetes/client-go](github.com/kubernetes/client-go).
199
194
Both these libraries are generated by automation, which specifies from which Api repo and which Api Group it generates it.
200
-
All these clients are created in [operator.go](pkg/controller/operator.go) from the KUBECONFIG envvar defined in cluster and passed into [clusterconfig.go](pkg/controller/clusterconfig.go).
195
+
196
+
All clients are created near/at where they are going to be used, we pass around the configs that were created from the KUBECONFIG envvar defined in cluster.
197
+
Reason for doing this is that there are many clients every one of which is cheap to create and passing around the config is simple while also not changing much over time.
198
+
On the other hand its quite cumbersome to pass around a bunch of clients, the number of which is changing by the day, with no benefit.
201
199
202
200
## How are the credentials used in clients
203
201
In IO deployment [manifest](manifests/06-deployment.yaml) is specified service account operator (serviceAccountName: operator). This is the account under which insights operator runs or reads its configuration or also reads the metrics.
@@ -223,9 +221,9 @@ Code: The impersonated account is specified in config/pod.yaml (or config/local.
To test where the client has right permissions, the command mentioned above with verb, api and service account can be used.
224
+
To test where the client has right permissions, the command mentioned above with verb, api and service account can be used.
227
225
228
-
Note: I was only able to test missing permissions on OCP 4.3, the versions above seems like always passing fine. Maybe higher versions
226
+
Note: I was only able to test missing permissions on OCP 4.3, the versions above seems like always passing fine. Maybe higher versions
229
227
don't have RBAC enabled.
230
228
231
229
Code: Example error returned from Api, in this case downloading Get config from imageregistry.
@@ -259,17 +257,32 @@ Such a client is used in [GatherMachineSet](pkg/gather/clusterconfig/clusterconf
259
257
260
258
261
259
## Gathering the data
262
-
When the `periodic.go` calls method Gather of interface Gatherer, it actually calls `pkg/gather/clusterconfig/clusterconfig.go`.
263
-
The function Gather calls one by one functions to gather (collect results of kubernetes Api call). Each `GatherXX` method is returning record object,
264
-
which is a list of records with name (how file will be called in archive) and actual data. Actual data (any struct) has to implement Marshalable interface,
265
-
which requires to have Marshall method.
266
260
267
-
The structure of Gather calls is created inside the from `pkg/gather/clusterconfig` from Gather method. The method creates the list of gathering functions
268
-
and passes them to record.Collect method in `pkg/record/interface.go`. The Collect method works in a way that any error returned from a gathering function
269
-
is stored, but all next functions are still called, unless the parent context returns an Error (for example on timeout).
261
+
### clusterconfig
262
+
When the `periodic.go` calls method Gather of the `clusterconfig` Gatherer, it's handled [here](https://github.com/openshift/insights-operator/blob/master/pkg/gather/clusterconfig/0_gatherer.go#L99).
263
+
264
+
The clusterconfig Gatherer starts each gather-function in its own separate goroutine with a dedicated channel to send back their results.
265
+
Each gather-function is its own separate entity, each creates their own clients using the configs present in the `Gatherer` object that was passed down as parameter.
266
+
We further divided the gather-functions into 2 main parts:
267
+
1. the 'adapter-part' that is called by the `Gatherer.Gather`, named `Gather<Something>`, it handles the creation of the clients and handling the communication with the `Gatherer`.
268
+
2. the 'core-part' that holds the actual logic of what to gather, named `gather<Something>`, the clients required for this are passed in as arguments by the 'adapter-part'.
269
+
270
+
Gather-functions are IO bound and they don't use much of the CPU, so giving each of them a goroutine doesn't stress the CPU but gives us an 'async' way of making REST calls, which improves the performance greatly.
271
+
272
+
After starting the goroutines, the Gatherer will start monitoring the channels, when it receives a result it will:
273
+
- Store the received `record`s using the provided `record.Interface`'s `Record` method.
274
+
- Store some metadata about the gather-function.
275
+
- Collect the errors accordingly. Errors are accumulated over all the gather-functions and returned as 1 summed up error.
270
276
271
277
Each result is being stored into record.Item as Marshalable item. It is using either golang Json marshaller, or K8s Api serializers. Those has to be explicitly registered in init func. The record is created to archive under its Name specifying full relative path including folders. The extension for particular record file is defined by GetExtension() func, but most of them are today of "json", except metrics or id.
272
278
279
+
The `gatherFunctions` map is where we reference all the gather-functions we have within the `clusterconfig` package.
280
+
Each has an id (the key in the map) these can be used to only execute a selection of the gather-functions. (according to the default config all gather-function will be executed)
281
+
Furthermore each gather-function is categorized into either:
282
+
-`important` meaning if that gather-function has an error we will notify `periodic.go` about it, which will handle it accordingly.
283
+
-`failable` meaning if that gather-function has an error we will just log it and add it to our metadata.
284
+
This is necessary as we are expanding into gathering data about resources that are not guaranteed to be present on the cluster. By default if a resource is not present we shouldn't see an error, but it's better to be safe.
285
+
273
286
## Downloading and exposing Archive Analysis
274
287
After the successful upload of archive, the progress monitoring task starts. By default it waits for 1m until it checks if results of analysis of the archive (done by external pipeline in cloud.redhat.com) are available. The report contains LastUpdatedAt timestamp, and verifies if report has changed its state (for this cluster) since the last time. If there was no
275
288
update (yet), it retries its download of analysis, because we have uploaded the data, but no analysis was provided back yet.
0 commit comments