Skip to content

Commit 39aaa2c

Browse files
authored
update docs/arch.md to mention the Rapid Recommendations (openshift#996)
1 parent 6c7cc90 commit 39aaa2c

File tree

1 file changed

+69
-8
lines changed

1 file changed

+69
-8
lines changed

docs/arch.md

+69-8
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,11 @@ The main goal of the Insights Operator is to periodically gather anonymized data
55
Insights Operator does not manage any application. As usual with operator applications, most of the code is structured in the `pkg` package and `pkg/controller/operator.go` hosts the operator controller. Typically, operator controllers read configuration and start some periodical tasks.
66

77
## How the Insights operator reads configuration
8+
89
The Insights Operator's configuration is a combination of the file [config/pod.yaml](../config/pod.yaml)(basically default configuration hardcoded in the image) and the configuration stored in either the `support` secret in the `openshift-config` namespace or the `insights-config` configmap in the `openshift-insights` namespace. Neither the secret nor the configmap exist by default, but when they do, they override the default settings which IO reads from the `config/pod.yaml`. The configmap takes precedent over the secret.
910
The `insights-config` configmap provides the following configuration structure:
10-
```
11+
12+
```yaml
1113
dataReporting:
1214
interval: 30m0s,
1315
uploadEndpoint: https://console.redhat.com/api/ingress/v1/upload,
@@ -31,6 +33,7 @@ proxy:
3133
```
3234
3335
The `support` secret provides following configuration attributes:
36+
3437
- `endpoint` - upload endpoint. Overwritten by `dataReporting/uploadEndpoint` from the configmap. Default is `https://console.redhat.com/api/ingress/v1/upload`.
3538
- `interval` - data gathering & uploading frequency. Overwritten by `dataReporting/interval` from the configmap. Default is `2h`.
3639
- `httpProxy`, `httpsProxy`, `noProxy` eventually to set custom proxy, which overrides cluster proxy just for the Insights Operator. Overwritten by `proxy/httpProxy`, `proxy/httpsProxy` and `proxy/noProxy`, respectively, from the configmap.
@@ -71,14 +74,16 @@ type: Opaque
7174
```shell script
7275
oc get secret support -n openshift-config -o=json | jq -r .data.endpoint | base64 -d
7376
```
74-
```
77+
78+
```shell
7579
https://console.redhat.com/api/ingress/v1/upload
7680
```
7781

7882
```shell script
7983
oc get secret support -n openshift-config -o=json | jq -r .data.interval | base64 -d
8084
```
81-
```
85+
86+
```shell
8287
2h
8388
```
8489

@@ -107,12 +112,14 @@ The configuration secrets are periodically refreshed by the [configobserver](../
107112
configCh, cancelFn := c.configurator.ConfigChanged()
108113
```
109114

110-
Internally the configObserver has an array of subscribers, so all of them will get the signal.
115+
Internally the `configObserver` has an array of subscribers, so all of them will get the signal.
111116

112117
## How the Insights operator schedules tasks
118+
113119
A commonly used pattern in the Insights Operator is that the task is run as a go routine and performs its own cycle of periodic actions.
114120
These actions are mostly started from the `operator.go`. They are usually using `wait.Until` - runs function periodically after short delay until end is signalled.
115121
There are these main tasks scheduled:
122+
116123
- Gatherer
117124
- Uploader
118125
- Downloader (Report gatherer)
@@ -145,9 +152,23 @@ The data from this gatherer is stored in the `/config/workload_info.json` file i
145152

146153
Defined in [conditional_gatherer.go](../pkg/gatherers/conditional/conditional_gatherer.go). This gatherer is run regularly (2h by default), but it only gathers some data when a corresponding condition is met. The conditions and corresponding gathering functions are defined in an external service (https://console.redhat.com/api/gathering/gathering_rules). A typical example of a condition is when an alert is firing. This also means that this gatherer relies on the availability of Prometheus metrics and alerts.
147154

148-
The data from this gatherer is stored under the `/conditional` directory in the archive.
155+
The functionality of this gatherer was extended in the 4.17 version. The value of the `conditionalGathererEndpoint` was updated and the endpoint serves an updated content. The main addition is the field `container_logs` in the content provided by the external service. This field was added during the implementation of the [Rapid Recommendations](https://github.com/openshift/enhancements/blob/master/enhancements/insights/rapid-recommendations.md) OpenShift enhancement proposal and it contains an array of so called container log requests. Container log request might look as follows:
156+
157+
```json
158+
{
159+
"namespace": "openshift-cloud-credential-operator",
160+
"pod_name_regex": ".*",
161+
"messages": [
162+
"googleapis\\.com.*proxyconnect\\ tcp:\\ dial\\ tcp.*i/o\\ timeout"
163+
]
164+
}
165+
```
166+
167+
The `namespace` defines the namespace name. The `pod_name_regex` defines a regular expression to match Pod names (in the given namespace) and finally `messages` define a list of regular expressions to filter all the matching container logs. There is one optional attribute `previous` saying whether you want to filter the log of a previous container.
168+
149169

150170
## Downloading and exposing Insights Analysis
171+
151172
After every successful upload of archive, the operator waits (see the `reportPullingDelay` config attribute) and
152173
then it tries to download the latest Insights analysis result of the latest archive (created by the Insights pipeline
153174
in `console.redhat.com`). The report is verified by checking the `LastCheckedAt` timestamp (see
@@ -157,6 +178,7 @@ attribute). The successfully downloaded Insights report is parsed and the number
157178
recommendations are exposed via `health_statuses_insights` Prometheus metric.
158179

159180
Code: Example of reported metrics:
181+
160182
```prometheus
161183
# HELP health_statuses_insights [ALPHA] Information about the cluster health status as detected by Insights tooling.
162184
# TYPE health_statuses_insights gauge
@@ -188,30 +210,38 @@ health_statuses_insights{metric="total"} 2
188210
> The alerts are defined [here](../manifests/08-prometheus_rule.yaml)
189211

190212
### Scheduling and running of Uploader
213+
191214
The `operator.go` starts background task defined in `pkg/insights/insightsuploader/insightsuploader.go`. The insights uploader periodically checks if there is any data to upload. If no data is found, the uploader continues with next cycle.
192215
The uploader triggers the `wait.Until` function, which waits until the configuration changes, or it is time to upload. After start of the operator, there is some waiting time before the very first upload. This time is defined by `initialDelay`. If no error occurred while sending the POST request, then the next uploader check is defined as `wait.Jitter(interval, 1.2)`, where interval is the gathering interval.
193216

194217
## How Uploader authenticates to console.redhat.com
218+
195219
The HTTP communication with the external service (e.g uploading the Insights archive or downloading the Insights analysis) is defined in the [insightsclient package](../pkg/insights/insightsclient/). The HTTP transport is encrypted with TLS (see the `clientTransport()` function defined in the `pkg/insights/insightsclient/insightsclient.go`. This function (and the `prepareRequest` function) uses `pkg/authorizer/clusterauthorizer.go` to respect the proxy settings and to authorize (i.e add the authorization header with respective token value) the requests. The user defined certificates in the `/var/run/configmaps/trusted-ca-bundle/ca-bundle.crt` are taken into account (see the cluster wide proxy setting in the [OCP documentation](https://docs.openshift.com/container-platform/latest/networking/enable-cluster-wide-proxy.html)).
196220

197221
## Summarising the content before upload
222+
198223
Summarizer is defined by `pkg/recorder/diskrecorder/diskrecorder.go` and is merging all existing archives. That is, it merges together all archives with name matching pattern `insights-*.tar.gz`, which weren't removed and which are newer than the last check time. Then mergeReader is taking one file after another and adding all of them to archive under their path.
199224
If the file names are unstable (for example reading from Api with Limit and reaching the Limit), it could merge together more files than specified in Api limit.
200225

201226
## Scheduling the ConfigObserver
227+
202228
Another background task is from `pkg/config/configobserver/configobserver.go`. The observer creates `configObserver` by calling `configObserver.New`, which sets default observing interval to 5 minutes.
203229
The `Start` method runs again `wait.Until` every 5 minutes and reads both `support` and `pull-secret` secrets.
204230

205231
## Scheduling diskpruner and what it does
232+
206233
By default Insights Operator Gather is calling diskrecorder to save newly collected data in a new file, but doesn't remove old. This is the task of diskpruner. Observer calls `recorder.PeriodicallyPrune()` function. It is again using wait.Until pattern and runs approximately after every second interval.
207234
Internally it calls `diskrecorder.Prune` with `maxAge = interval*6*24` (with 2h it is 12 days) everything older is going to be removed from the archive path (by default `/tmp/insights-operator`).
208235

209236
## How the Insights operator sets operator status
237+
210238
The operator status is based on K8s [Pod conditions](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions).
211239
Code: How Insights Operator status conditions looks like:
240+
212241
```shell script
213242
oc get co insights -o=json | jq '.status.conditions'
214243
```
244+
215245
```json
216246
[
217247
{
@@ -261,18 +291,36 @@ oc get co insights -o=json | jq '.status.conditions'
261291
"reason": "AsExpected",
262292
"status": "True",
263293
"type": "Available"
264-
}
294+
},
295+
{
296+
"lastTransitionTime": "2024-09-09T12:16:20Z",
297+
"reason": "AsExpected",
298+
"status": "True",
299+
"type": "RemoteConfigurationValid"
300+
},
301+
{
302+
"lastTransitionTime": "2024-09-09T12:16:20Z",
303+
"reason": "AsExpected",
304+
"status": "True",
305+
"type": "RemoteConfigurationAvailable"
306+
},
265307
]
266308
```
309+
267310
A condition is defined by its type. You may notice that there are some non-standard clusteroperator conditions. They are:
311+
268312
- `SCAAvailable` - based on the SCA (Simple Content Access) controller in `pkg/ocm/sca/sca.go` and provides information about the status of downloading the SCA entitlements.
269313
- `ClusterTransferAvailable` - based on the cluster transfer controller in `pkg/ocm/clustertransfer/cluster_transfer.go` and provides information about the availability of cluster transfers.
270314
- `Disabled` - indicates whether data gathering is disabled or enabled. Note that when the operator is `Disabled=True`, it is still also `Available=True`, which is strange at first glance, but the Cluster Version Operator (CVO) checks that all the clusteroperators are `Available=True` during the OpenShift installation. If they are not, the installation will fail, which happens in disconnected environments/clusters where the Insights operator is usually `Disabled=True` (because there is no `cloud.openshift.com` token in the `pull-secret`). You can find more about this topic in:
271315
- https://pkg.go.dev/github.com/openshift/api/config/v1#ClusterStatusConditionType - note that when the Insights Operator is `Disabled=True` then it does not require immediate administrator intervention (and thus it still reports `Available=True`) - i.e nobody should be paged in this situation
272316
- https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-version-operator/dev/clusteroperator.md#conditions
317+
- `RemoteConfigurationAvailable` refers to the remote configuration (originally known as Gathering conditions) provided by the external service (see the [Conditional gatherer](#conditional-gatherer)). This condition tells whether the endpoint was available (HTTP 200 status code) or not.
318+
- `RemoteConfigurationValid` refers to the remote configuration (originally known as Gathering conditions) provided by the external service (see the [Conditional gatherer](#conditional-gatherer)). This conditions tells whether the content read from the endpoint is a valid JSON and can be parsed by the operator.
273319

274320
In addition to the above clusteroperator conditions, there are some intermediate clusteroperator conditions. These are:
321+
275322
- `UploadDegraded` - this condition occurs when there is any unsuccessful upload of the Insights data (if the number of the upload attemp is equal or greater than 5 then the operator is marked as **Degraded**). Example is:
323+
276324
```json
277325
{
278326
"lastTransitionTime": "2022-05-18T10:12:23Z",
@@ -283,7 +331,9 @@ In addition to the above clusteroperator conditions, there are some intermediate
283331
},
284332
285333
```
334+
286335
- `InsightsDownloadDegraded` - this condition occurs when there is any unsuccessful download of the Insights analysis. Example is:
336+
287337
```json
288338
{
289339
"lastTransitionTime": "2022-05-18T10:17:49Z",
@@ -300,6 +350,7 @@ the operator status from its internal list of sources. Any component which wants
300350
SimpleReporter, which returns its actual status. The Simple reporter is defined in `controllerstatus`.
301351

302352
Code: In `operator.go` components are adding their reporters to Status Sources:
353+
303354
```go
304355
statusReporter.AddSources(uploader)
305356
```
@@ -308,13 +359,15 @@ This periodic status updater calls `updateStatus `which sets the operator status
308359
The uploader `updateStatus` determines if it is safe to upload, if cluster operator status is healthy. It relies on fact that `updateStatus` is called on start of status cycle.
309360

310361
## How is Insights Operator using various API Clients
362+
311363
Internally Insights operator talks to Kubernetes API server over HTTP REST queries. Each query is authenticated by a Bearer token,
312364
to simulate see an actual Rest query being used, you can try:
313365

314366
```shell script
315367
oc get pods -A -v=9
316368
```
317-
```
369+
370+
```text
318371
I1006 12:26:33.972634 66541 loader.go:375] Config loaded from file: /home/mkunc/.kube/config
319372
I1006 12:26:33.977546 66541 round_trippers.go:423] curl -k -v -XGET -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" -H "User-Agent: oc/4.5.0 (linux/amd64) kubernetes/9933eb9" -H "Authorization: Bearer Xy9HoVzNdsRifGr3oCIl7pfxwkeqE2u058avw6o969w" 'https://api.sharedocp4upi43.lab.upshift.rdu2.redhat.com:6443/api/v1/pods?limit=500'
320373
I1006 12:26:36.075230 66541 round_trippers.go:443] GET https://api.sharedocp4upi43.lab.upshift.rdu2.redhat.com:6443/api/v1/pods?limit=500 200 OK in 2097 milliseconds
@@ -337,6 +390,7 @@ Reason for doing this is that there are many clients every one of which is cheap
337390
On the other hand its quite cumbersome to pass around a bunch of clients, the number of which is changing by the day, with no benefit.
338391

339392
## How are the credentials used in clients
393+
340394
In IO deployment [manifest](manifests/06-deployment.yaml) is specified service account operator (serviceAccountName: operator). This is the account under which insights operator runs or reads its configuration or also reads the metrics.
341395
Because Insights Operator needs quite powerful credentials to access cluster-wide resources, it has one more service account called gather. It is created
342396
in [manifest](manifests/03-clusterrole.yaml).
@@ -346,12 +400,14 @@ Code: To verify if gather account has right permissions to call verb list from a
346400
```shell script
347401
kubectl auth can-i list machinesets --as=system:serviceaccount:openshift-insights:gather
348402
```
349-
```
403+
404+
```shell
350405
yes
351406
```
352407

353408
This account is used to impersonate any clients which are being used in Gather Api calls. The impersonated account is set in operator go:
354409
Code: In Operator.go specific Api client is using impersonated account
410+
355411
```go
356412
gatherKubeConfig := rest.CopyConfig(controller.KubeConfig)
357413
if len(s.Impersonate) > 0 {
@@ -362,6 +418,7 @@ Code: In Operator.go specific Api client is using impersonated account
362418
```
363419

364420
Code: The impersonated account is specified in config/pod.yaml (or config/local.yaml) using:
421+
365422
```yaml
366423
impersonate: system:serviceaccount:openshift-insights:gather
367424
```
@@ -372,22 +429,26 @@ Note: I was only able to test missing permissions on OCP 4.3, the versions above
372429
don't have RBAC enabled.
373430

374431
Code: Example error returned from Api, in this case downloading Get config from imageregistry.
432+
375433
```
376434
configs.imageregistry.operator.openshift.io "cluster" is forbidden: User "system:serviceaccount:openshift-insights:gather" cannot get resource "configs" in API group "imageregistry.operator.openshift.io" at the cluster scope
377435
```
378436

379437
## How API extensions works
438+
380439
If any cloud native application wants to add some Kubernetes Api endpoint, it needs to define it using [K8s Api extensions](https://kubernetes.io/docs/concepts/extend-kubernetes/) and it would need to define Custom Resource Definition. Openshift itself defines them for [github.com/openshift/api](github.com/openshift/api) (ClusterOperators, Proxy, Image, ..). Thus for using api of Openshift, we need to use Openshift's client-go generated client.
381440
If we would need to use Api of some other Operators, we would need to find if Operator is defining Api.
382441

383442
Typically when operator defines a new CRD type, this type would be defined inside of its repo (for example [Machine Config Operator's MachineConfig](https://github.com/openshift/machine-config-operator/tree/master/pkg/apis/machineconfiguration.openshift.io)).
384443

385444
To talk to specific Api, we need to have generated clientset and generated lister types from the CRD type. There might be three possibilities:
445+
386446
- Operator doesn't generate clientset nor lister types
387447
- Operator generate only lister types
388448
- Operator generates both, clientset and lister types
389449

390450
Machine Config Operator defines:
451+
391452
- its Lister types [here](https://github.com/openshift/machine-config-operator/tree/master/pkg/generated/listers/machineconfiguration.openshift.io/v1)
392453
- its ClientSet [here](https://github.com/openshift/machine-config-operator/blob/master/pkg/generated/clientset/versioned/clientset.go)
393454

0 commit comments

Comments
 (0)