You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/arch.md
+69-8
Original file line number
Diff line number
Diff line change
@@ -5,9 +5,11 @@ The main goal of the Insights Operator is to periodically gather anonymized data
5
5
Insights Operator does not manage any application. As usual with operator applications, most of the code is structured in the `pkg` package and `pkg/controller/operator.go` hosts the operator controller. Typically, operator controllers read configuration and start some periodical tasks.
6
6
7
7
## How the Insights operator reads configuration
8
+
8
9
The Insights Operator's configuration is a combination of the file [config/pod.yaml](../config/pod.yaml)(basically default configuration hardcoded in the image) and the configuration stored in either the `support` secret in the `openshift-config` namespace or the `insights-config` configmap in the `openshift-insights` namespace. Neither the secret nor the configmap exist by default, but when they do, they override the default settings which IO reads from the `config/pod.yaml`. The configmap takes precedent over the secret.
9
10
The `insights-config` configmap provides the following configuration structure:
The `support` secret provides following configuration attributes:
36
+
34
37
- `endpoint`- upload endpoint. Overwritten by `dataReporting/uploadEndpoint` from the configmap. Default is `https://console.redhat.com/api/ingress/v1/upload`.
35
38
- `interval`- data gathering & uploading frequency. Overwritten by `dataReporting/interval` from the configmap. Default is `2h`.
36
39
- `httpProxy`, `httpsProxy`, `noProxy` eventually to set custom proxy, which overrides cluster proxy just for the Insights Operator. Overwritten by `proxy/httpProxy`, `proxy/httpsProxy` and `proxy/noProxy`, respectively, from the configmap.
@@ -71,14 +74,16 @@ type: Opaque
71
74
```shell script
72
75
oc get secret support -n openshift-config -o=json | jq -r .data.endpoint | base64 -d
73
76
```
74
-
```
77
+
78
+
```shell
75
79
https://console.redhat.com/api/ingress/v1/upload
76
80
```
77
81
78
82
```shell script
79
83
oc get secret support -n openshift-config -o=json | jq -r .data.interval | base64 -d
80
84
```
81
-
```
85
+
86
+
```shell
82
87
2h
83
88
```
84
89
@@ -107,12 +112,14 @@ The configuration secrets are periodically refreshed by the [configobserver](../
Internally the configObserver has an array of subscribers, so all of them will get the signal.
115
+
Internally the `configObserver` has an array of subscribers, so all of them will get the signal.
111
116
112
117
## How the Insights operator schedules tasks
118
+
113
119
A commonly used pattern in the Insights Operator is that the task is run as a go routine and performs its own cycle of periodic actions.
114
120
These actions are mostly started from the `operator.go`. They are usually using `wait.Until` - runs function periodically after short delay until end is signalled.
115
121
There are these main tasks scheduled:
122
+
116
123
- Gatherer
117
124
- Uploader
118
125
- Downloader (Report gatherer)
@@ -145,9 +152,23 @@ The data from this gatherer is stored in the `/config/workload_info.json` file i
145
152
146
153
Defined in [conditional_gatherer.go](../pkg/gatherers/conditional/conditional_gatherer.go). This gatherer is run regularly (2h by default), but it only gathers some data when a corresponding condition is met. The conditions and corresponding gathering functions are defined in an external service (https://console.redhat.com/api/gathering/gathering_rules). A typical example of a condition is when an alert is firing. This also means that this gatherer relies on the availability of Prometheus metrics and alerts.
147
154
148
-
The data from this gatherer is stored under the `/conditional` directory in the archive.
155
+
The functionality of this gatherer was extended in the 4.17 version. The value of the `conditionalGathererEndpoint` was updated and the endpoint serves an updated content. The main addition is the field `container_logs` in the content provided by the external service. This field was added during the implementation of the [Rapid Recommendations](https://github.com/openshift/enhancements/blob/master/enhancements/insights/rapid-recommendations.md) OpenShift enhancement proposal and it contains an array of so called container log requests. Container log request might look as follows:
The `namespace` defines the namespace name. The `pod_name_regex` defines a regular expression to match Pod names (in the given namespace) and finally `messages` define a list of regular expressions to filter all the matching container logs. There is one optional attribute `previous` saying whether you want to filter the log of a previous container.
168
+
149
169
150
170
## Downloading and exposing Insights Analysis
171
+
151
172
After every successful upload of archive, the operator waits (see the `reportPullingDelay` config attribute) and
152
173
then it tries to download the latest Insights analysis result of the latest archive (created by the Insights pipeline
153
174
in `console.redhat.com`). The report is verified by checking the `LastCheckedAt` timestamp (see
@@ -157,6 +178,7 @@ attribute). The successfully downloaded Insights report is parsed and the number
157
178
recommendations are exposed via `health_statuses_insights` Prometheus metric.
158
179
159
180
Code: Example of reported metrics:
181
+
160
182
```prometheus
161
183
# HELP health_statuses_insights [ALPHA] Information about the cluster health status as detected by Insights tooling.
> The alerts are defined [here](../manifests/08-prometheus_rule.yaml)
189
211
190
212
### Scheduling and running of Uploader
213
+
191
214
The `operator.go` starts background task defined in `pkg/insights/insightsuploader/insightsuploader.go`. The insights uploader periodically checks if there is any data to upload. If no data is found, the uploader continues with next cycle.
192
215
The uploader triggers the `wait.Until` function, which waits until the configuration changes, or it is time to upload. After start of the operator, there is some waiting time before the very first upload. This time is defined by `initialDelay`. If no error occurred while sending the POST request, then the next uploader check is defined as `wait.Jitter(interval, 1.2)`, where interval is the gathering interval.
193
216
194
217
## How Uploader authenticates to console.redhat.com
218
+
195
219
The HTTP communication with the external service (e.g uploading the Insights archive or downloading the Insights analysis) is defined in the [insightsclient package](../pkg/insights/insightsclient/). The HTTP transport is encrypted with TLS (see the `clientTransport()` function defined in the `pkg/insights/insightsclient/insightsclient.go`. This function (and the `prepareRequest` function) uses `pkg/authorizer/clusterauthorizer.go` to respect the proxy settings and to authorize (i.e add the authorization header with respective token value) the requests. The user defined certificates in the `/var/run/configmaps/trusted-ca-bundle/ca-bundle.crt` are taken into account (see the cluster wide proxy setting in the [OCP documentation](https://docs.openshift.com/container-platform/latest/networking/enable-cluster-wide-proxy.html)).
196
220
197
221
## Summarising the content before upload
222
+
198
223
Summarizer is defined by `pkg/recorder/diskrecorder/diskrecorder.go` and is merging all existing archives. That is, it merges together all archives with name matching pattern `insights-*.tar.gz`, which weren't removed and which are newer than the last check time. Then mergeReader is taking one file after another and adding all of them to archive under their path.
199
224
If the file names are unstable (for example reading from Api with Limit and reaching the Limit), it could merge together more files than specified in Api limit.
200
225
201
226
## Scheduling the ConfigObserver
227
+
202
228
Another background task is from `pkg/config/configobserver/configobserver.go`. The observer creates `configObserver` by calling `configObserver.New`, which sets default observing interval to 5 minutes.
203
229
The `Start` method runs again `wait.Until` every 5 minutes and reads both `support` and `pull-secret` secrets.
204
230
205
231
## Scheduling diskpruner and what it does
232
+
206
233
By default Insights Operator Gather is calling diskrecorder to save newly collected data in a new file, but doesn't remove old. This is the task of diskpruner. Observer calls `recorder.PeriodicallyPrune()` function. It is again using wait.Until pattern and runs approximately after every second interval.
207
234
Internally it calls `diskrecorder.Prune` with `maxAge = interval*6*24` (with 2h it is 12 days) everything older is going to be removed from the archive path (by default `/tmp/insights-operator`).
208
235
209
236
## How the Insights operator sets operator status
237
+
210
238
The operator status is based on K8s [Pod conditions](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions).
211
239
Code: How Insights Operator status conditions looks like:
240
+
212
241
```shell script
213
242
oc get co insights -o=json | jq '.status.conditions'
214
243
```
244
+
215
245
```json
216
246
[
217
247
{
@@ -261,18 +291,36 @@ oc get co insights -o=json | jq '.status.conditions'
261
291
"reason": "AsExpected",
262
292
"status": "True",
263
293
"type": "Available"
264
-
}
294
+
},
295
+
{
296
+
"lastTransitionTime": "2024-09-09T12:16:20Z",
297
+
"reason": "AsExpected",
298
+
"status": "True",
299
+
"type": "RemoteConfigurationValid"
300
+
},
301
+
{
302
+
"lastTransitionTime": "2024-09-09T12:16:20Z",
303
+
"reason": "AsExpected",
304
+
"status": "True",
305
+
"type": "RemoteConfigurationAvailable"
306
+
},
265
307
]
266
308
```
309
+
267
310
A condition is defined by its type. You may notice that there are some non-standard clusteroperator conditions. They are:
311
+
268
312
- `SCAAvailable`- based on the SCA (Simple Content Access) controller in `pkg/ocm/sca/sca.go` and provides information about the status of downloading the SCA entitlements.
269
313
- `ClusterTransferAvailable`- based on the cluster transfer controller in `pkg/ocm/clustertransfer/cluster_transfer.go` and provides information about the availability of cluster transfers.
270
314
- `Disabled` - indicates whether data gathering is disabled or enabled. Note that when the operator is `Disabled=True`, it is still also `Available=True`, which is strange at first glance, but the Cluster Version Operator (CVO) checks that all the clusteroperators are `Available=True` during the OpenShift installation. If they are not, the installation will fail, which happens in disconnected environments/clusters where the Insights operator is usually `Disabled=True` (because there is no `cloud.openshift.com` token in the `pull-secret`). You can find more about this topic in:
271
315
- https://pkg.go.dev/github.com/openshift/api/config/v1#ClusterStatusConditionType - note that when the Insights Operator is `Disabled=True` then it does not require immediate administrator intervention (and thus it still reports `Available=True`) - i.e nobody should be paged in this situation
- `RemoteConfigurationAvailable`refers to the remote configuration (originally known as Gathering conditions) provided by the external service (see the [Conditional gatherer](#conditional-gatherer)). This condition tells whether the endpoint was available (HTTP 200 status code) or not.
318
+
- `RemoteConfigurationValid`refers to the remote configuration (originally known as Gathering conditions) provided by the external service (see the [Conditional gatherer](#conditional-gatherer)). This conditions tells whether the content read from the endpoint is a valid JSON and can be parsed by the operator.
273
319
274
320
In addition to the above clusteroperator conditions, there are some intermediate clusteroperator conditions. These are:
321
+
275
322
- `UploadDegraded` - this condition occurs when there is any unsuccessful upload of the Insights data (if the number of the upload attemp is equal or greater than 5 then the operator is marked as **Degraded**). Example is:
323
+
276
324
```json
277
325
{
278
326
"lastTransitionTime": "2022-05-18T10:12:23Z",
@@ -283,7 +331,9 @@ In addition to the above clusteroperator conditions, there are some intermediate
283
331
},
284
332
285
333
```
334
+
286
335
- `InsightsDownloadDegraded` - this condition occurs when there is any unsuccessful download of the Insights analysis. Example is:
336
+
287
337
```json
288
338
{
289
339
"lastTransitionTime": "2022-05-18T10:17:49Z",
@@ -300,6 +350,7 @@ the operator status from its internal list of sources. Any component which wants
300
350
SimpleReporter, which returns its actual status. The Simple reporter is defined in `controllerstatus`.
301
351
302
352
Code: In `operator.go` components are adding their reporters to Status Sources:
353
+
303
354
```go
304
355
statusReporter.AddSources(uploader)
305
356
```
@@ -308,13 +359,15 @@ This periodic status updater calls `updateStatus `which sets the operator status
308
359
The uploader `updateStatus` determines if it is safe to upload, if cluster operator status is healthy. It relies on fact that `updateStatus` is called on start of status cycle.
309
360
310
361
## How is Insights Operator using various API Clients
362
+
311
363
Internally Insights operator talks to Kubernetes API server over HTTP REST queries. Each query is authenticated by a Bearer token,
312
364
to simulate see an actual Rest query being used, you can try:
313
365
314
366
```shell script
315
367
oc get pods -A -v=9
316
368
```
317
-
```
369
+
370
+
```text
318
371
I1006 12:26:33.972634 66541 loader.go:375] Config loaded from file: /home/mkunc/.kube/config
I1006 12:26:36.075230 66541 round_trippers.go:443] GET https://api.sharedocp4upi43.lab.upshift.rdu2.redhat.com:6443/api/v1/pods?limit=500 200 OK in 2097 milliseconds
@@ -337,6 +390,7 @@ Reason for doing this is that there are many clients every one of which is cheap
337
390
On the other hand its quite cumbersome to pass around a bunch of clients, the number of which is changing by the day, with no benefit.
338
391
339
392
## How are the credentials used in clients
393
+
340
394
In IO deployment [manifest](manifests/06-deployment.yaml) is specified service account operator (serviceAccountName: operator). This is the account under which insights operator runs or reads its configuration or also reads the metrics.
341
395
Because Insights Operator needs quite powerful credentials to access cluster-wide resources, it has one more service account called gather. It is created
342
396
in [manifest](manifests/03-clusterrole.yaml).
@@ -346,12 +400,14 @@ Code: To verify if gather account has right permissions to call verb list from a
346
400
```shell script
347
401
kubectl auth can-i list machinesets --as=system:serviceaccount:openshift-insights:gather
348
402
```
349
-
```
403
+
404
+
```shell
350
405
yes
351
406
```
352
407
353
408
This account is used to impersonate any clients which are being used in Gather Api calls. The impersonated account is set in operator go:
354
409
Code: In Operator.go specific Api client is using impersonated account
@@ -372,22 +429,26 @@ Note: I was only able to test missing permissions on OCP 4.3, the versions above
372
429
don't have RBAC enabled.
373
430
374
431
Code: Example error returned from Api, in this case downloading Get config from imageregistry.
432
+
375
433
```
376
434
configs.imageregistry.operator.openshift.io "cluster" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "configs" in API group "imageregistry.operator.openshift.io" at the cluster scope
377
435
```
378
436
379
437
## How API extensions works
438
+
380
439
If any cloud native application wants to add some Kubernetes Api endpoint, it needs to define it using [K8s Api extensions](https://kubernetes.io/docs/concepts/extend-kubernetes/) and it would need to define Custom Resource Definition. Openshift itself defines them for [github.com/openshift/api](github.com/openshift/api) (ClusterOperators, Proxy, Image, ..). Thus for using api of Openshift, we need to use Openshift's client-go generated client.
381
440
If we would need to use Api of some other Operators, we would need to find if Operator is defining Api.
382
441
383
442
Typically when operator defines a new CRD type, this type would be defined inside of its repo (for example [Machine Config Operator's MachineConfig](https://github.com/openshift/machine-config-operator/tree/master/pkg/apis/machineconfiguration.openshift.io)).
384
443
385
444
To talk to specific Api, we need to have generated clientset and generated lister types from the CRD type. There might be three possibilities:
445
+
386
446
- Operator doesn't generate clientset nor lister types
387
447
- Operator generate only lister types
388
448
- Operator generates both, clientset and lister types
389
449
390
450
Machine Config Operator defines:
451
+
391
452
- its Lister types [here](https://github.com/openshift/machine-config-operator/tree/master/pkg/generated/listers/machineconfiguration.openshift.io/v1)
392
453
- its ClientSet [here](https://github.com/openshift/machine-config-operator/blob/master/pkg/generated/clientset/versioned/clientset.go)
0 commit comments