Skip to content

Commit adff965

Browse files
author
Ricardo Lüders
authored
chore(docs): adding list of insights generated metrics (#645)
* chore(docs): adding list of insights generated metrics * chore(docs): adding missing alerts and metrics * chore(docs): improving metrics and alerts docs * chore(docs): fix typo for insights * chore(docs): fix typo for insights
1 parent 0929403 commit adff965

File tree

3 files changed

+40
-13
lines changed

3 files changed

+40
-13
lines changed

Diff for: README.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,8 @@ gen_cert_file.py kubeconfig.yaml
103103
## Prometheus metrics provided by Insights Operator
104104

105105
It is possible to read Prometheus metrics provided by Insights Operator. Example of metrics exposed by
106-
Insights Operator can be found at [metrics.txt](docs/metrics.txt)
106+
Insights Operator can be found at [metrics.txt](docs/metrics.txt), also there is a list of possible metrics
107+
available in the [architecture document](docs/arch.md).
107108

108109
Depending on how or where the IO is running you may have different ways to retrieve the metrics.
109110
Here is a list of some options, so you can find the one that fits you:

Diff for: docs/arch.md

+36-10
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,13 @@ Insights Operator does not manage any application. As usual with operator applic
88
The Insights Operator's configuration is a combination of the file [config/pod.yaml](../config/pod.yaml)(basically default configuration hardcoded in the image) and configuration stored in the `support` secret in the `openshift-config` namespace. The secret doesn't exist by default, but when it does, it overrides default settings which IO reads from the `config/pod.yaml`.
99
The `support` secret provides following configuration attributes:
1010
- `endpoint` - upload endpoint - default is `https://console.redhat.com/api/ingress/v1/upload`,
11-
- `interval` - data gathering & uploading frequency - default is `2h`
11+
- `interval` - data gathering & uploading frequency - default is `2h`
1212
- `httpProxy`, `httpsProxy`, `noProxy` eventually to set custom proxy, which overrides cluster proxy just for the Insights Operator
1313
- `username`, `password` - if set, the insights client upload will be authenticated by basic authorization using the username/password. By default, it uses the token (see below) from the `pull-secret` secret.
1414
- `enableGlobalObfuscation` - to enable the global obfuscation of the IP addresses and the cluster domain name. Default value is `false`
1515
- `reportEndpoint` - download endpoint. From this endpoint, the Insights operator downloads the latest Insights analysis. Default value is `https://console.redhat.com/api/insights-results-aggregator/v2/cluster/%s/reports` (where `%s` must be replaced with the cluster ID)
1616
- `reportPullingDelay` - the delay between data upload and download. Default value is `60s`
17-
- `reportPullingTimeout` - timeout for the Insights download request.
17+
- `reportPullingTimeout` - timeout for the Insights download request.
1818
- `reportMinRetryTime` - the time after which the request is retried. Default value is `30s`
1919
- `scaEndpoint` - the endpoing for downloading the Simple Content Access(SCA) entitlements. Default value is `https://api.openshift.com/api/accounts_mgmt/v1/certificates`
2020
- `scaInterval` - frequency of the SCA entitlements download. Default value is `8h`.
@@ -103,7 +103,7 @@ There are these main tasks scheduled:
103103
Insights operator defines three types of gatherers (see below). Each of them must implement the [Interface](../pkg/gatherers/interface.go#L11) and they are initialized by calling `gather.CreateAllGatherers` in `operator.go`. The actual gathering is triggered in `Run` method in `pkg/controller/periodic/periodic.go`, but not every gatherer is triggered every time ( for example, see the [CustomPeriodGatherer type](../pkg/gatherers/interface.go#L21)).
104104

105105
Each gatherer includes one or more gathering functions. Gathering functions are defined as a map, where the key is the name of the function and the value is the [GatheringClosure type](../pkg/gatherers/interface.go#L34). They are executed concurrently in the `HandleTasksConcurrently` function in `pkg/gather/task_processing.go`.
106-
One of the attributes of the `GatheringClosure` type is the function that returns the values: `([]record.Record, []error)`. The slice of the records is the result of gathering function. The actual data is in the `Item` attribute of the `Record`. This `Item` is of type `Marshalable` (see the interface in the [record.go](../pkg/record/record.go)) and there are two JSON marshallers used to serialize the data - `JSONMarshaller` and `ResourceMarshaller` which allows you to save few bytes by omitting the `managedFields` during the serialization.
106+
One of the attributes of the `GatheringClosure` type is the function that returns the values: `([]record.Record, []error)`. The slice of the records is the result of gathering function. The actual data is in the `Item` attribute of the `Record`. This `Item` is of type `Marshalable` (see the interface in the [record.go](../pkg/record/record.go)) and there are two JSON marshallers used to serialize the data - `JSONMarshaller` and `ResourceMarshaller` which allows you to save few bytes by omitting the `managedFields` during the serialization.
107107
Errors, warnings or panics that occurred during given gathering function are logged in the "metadata" part of the Insights operator archive. See [sample archive example](../docs/insights-archive-sample/insigths-operator/gathers.json)
108108

109109
### Clusterconfig gatherer
@@ -116,16 +116,22 @@ The data from this gatherer is stored under `/config` directory in the archive.
116116

117117
Defined in [workloads_gatherer.go](../pkg/gatherers/workloads/workloads_gatherer.go). This gatherer only runs every 12 hours and the interval is not configurable. This is done because running the gatherer more often would significantly increase data in the archive, that is assumed will not change very often. There is only one gathering function in this gatherer and it gathers workload fingerprint data (SHA of the images, fingerprints of namespaces as number of pods in namespace, fingerprints of containers as first command and first argument).
118118

119-
The data from this gatherer is stored in the `/config/workload_info.json` file in the archive, but please note that not every archive contains this data.
119+
The data from this gatherer is stored in the `/config/workload_info.json` file in the archive, but please note that not every archive contains this data.
120120

121121
### Conditional gatherer
122122

123-
Defined in [conditional_gatherer.go](../pkg/gatherers/conditional/conditional_gatherer.go). This gatherer is ran regularly (2h by default), but it only gathers some data when a corresponding condition is met. The conditions and corresponding gathering functions are defined in an external service (https://console.redhat.com/api/gathering/gathering_rules). A typical example of a condition is when an alert is firing. This also means that this gatherer relies on the availability of Prometheus metrics and alerts.
123+
Defined in [conditional_gatherer.go](../pkg/gatherers/conditional/conditional_gatherer.go). This gatherer is ran regularly (2h by default), but it only gathers some data when a corresponding condition is met. The conditions and corresponding gathering functions are defined in an external service (https://console.redhat.com/api/gathering/gathering_rules). A typical example of a condition is when an alert is firing. This also means that this gatherer relies on the availability of Prometheus metrics and alerts.
124124

125-
The data from this gatherer is stored under the `/conditional` directory in the archive.
125+
The data from this gatherer is stored under the `/conditional` directory in the archive.
126126

127127
## Downloading and exposing Insights Analysis
128-
After every successful upload of archive, the operator waits for 1m (see the `reportPullingDelay` config attribute) and then it tries to download the latest Insights analysis result of the latest archive (created by the Insights pipeline in `console.redhat.com`). The report is verified by checking the `LastCheckedAt` timestamp (see `pkg/insights/insightsreport/types.go`). If the latest Insights result is not yet available (e.g. the pipeline may be delayed) or there has been some error response, the download request is repeated (see the `reportMinRetryTime` config attribute). The successfully downloaded Insights report is parsed and the numbers of corresponding hitting Insights recommendations are exposed via `health_statuses_insights` Prometheus metric.
128+
After every successful upload of archive, the operator waits (see the `reportPullingDelay` config attribute) and
129+
then it tries to download the latest Insights analysis result of the latest archive (created by the Insights pipeline
130+
in `console.redhat.com`). The report is verified by checking the `LastCheckedAt` timestamp (see
131+
`pkg/insights/insightsreport/types.go`). If the latest Insights result is not yet available (e.g. the pipeline may be
132+
delayed) or there has been some error response, the download request is repeated (see the `reportMinRetryTime` config
133+
attribute). The successfully downloaded Insights report is parsed and the numbers of corresponding hitting Insights
134+
recommendations are exposed via `health_statuses_insights` Prometheus metric.
129135

130136
Code: Example of reported metrics:
131137
```prometheus
@@ -138,12 +144,32 @@ health_statuses_insights{metric="moderate"} 1
138144
health_statuses_insights{metric="total"} 2
139145
```
140146

147+
### Metrics
148+
149+
- `health_statuses_insights`, information about the cluster health status based on the last downloaded report, corresponding to its number of hitting recommendations grouped by severity.
150+
- `insightsclient_request_send_total`, tracks the number of archives sent.
151+
- `insightsclient_request_recvreport_total`, tracks the number of Insights reports received/downloaded.
152+
- `insightsclient_last_gather_time`, the time of the last Insights data gathering.
153+
- `insights_recommendation_active`, expose Insights recommendations as Prometheus alerts.
154+
155+
> **Note**
156+
> The metrics are registered by [the `MustRegisterMetrics` function](../pkg/insights/metrics.go)
157+
158+
### Alerts
159+
160+
- `InsightsDisabled`, Insights operator is disabled.
161+
- `SimpleContentAccessNotAvailable`, simple content access certificates are not available.
162+
- `InsightsRecommendationActive`, an Insights recommendation is active for this cluster.
163+
164+
> **Note**
165+
> The alerts are defined [here](../manifests/08-prometheus_rule.yaml)
166+
141167
### Scheduling and running of Uploader
142168
The `operator.go` starts background task defined in `pkg/insights/insightsuploader/insightsuploader.go`. The insights uploader periodically checks if there is any data to upload. If no data is found, the uploader continues with next cycle.
143-
The uploader triggers the `wait.Until` function, which waits until the configuration changes or it is time to upload. After start of the operator, there is some waiting time before the very first upload. This time is defined by `initialDelay`. If no error occurred while sending the POST request, then the next uploader check is defined as `wait.Jitter(interval, 1.2)`, where interval is the gathering interval.
169+
The uploader triggers the `wait.Until` function, which waits until the configuration changes or it is time to upload. After start of the operator, there is some waiting time before the very first upload. This time is defined by `initialDelay`. If no error occurred while sending the POST request, then the next uploader check is defined as `wait.Jitter(interval, 1.2)`, where interval is the gathering interval.
144170

145171
## How Uploader authenticates to console.redhat.com
146-
The HTTP communication with the external service (e.g uploading the Insights archive or downloading the Insights analysis) is defined in the [insightsclient package](../pkg/insights/insightsclient/). The HTTP transport is encrypted with TLS (see the `clientTransport()` function defined in the `pkg/insights/insightsclient/insightsclient.go`. This function (and the `prepareRequest` function) uses `pkg/authorizer/clusterauthorizer.go` to respect the proxy settings and to authorize (i.e add the authorization header with respective token value) the requests. The user defined certificates in the `/var/run/configmaps/trusted-ca-bundle/ca-bundle.crt` are taken into account (see the cluster wide proxy setting in the [OCP documentation](https://docs.openshift.com/container-platform/latest/networking/enable-cluster-wide-proxy.html)).
172+
The HTTP communication with the external service (e.g uploading the Insights archive or downloading the Insights analysis) is defined in the [insightsclient package](../pkg/insights/insightsclient/). The HTTP transport is encrypted with TLS (see the `clientTransport()` function defined in the `pkg/insights/insightsclient/insightsclient.go`. This function (and the `prepareRequest` function) uses `pkg/authorizer/clusterauthorizer.go` to respect the proxy settings and to authorize (i.e add the authorization header with respective token value) the requests. The user defined certificates in the `/var/run/configmaps/trusted-ca-bundle/ca-bundle.crt` are taken into account (see the cluster wide proxy setting in the [OCP documentation](https://docs.openshift.com/container-platform/latest/networking/enable-cluster-wide-proxy.html)).
147173

148174
## Summarising the content before upload
149175
Summarizer is defined by `pkg/recorder/diskrecorder/diskrecorder.go` and is merging all existing archives. That is, it merges together all archives with name matching pattern `insights-*.tar.gz`, which weren't removed and which are newer than the last check time. Then mergeReader is taking one file after another and adding all of them to archive under their path.
@@ -218,7 +244,7 @@ oc get co insights -o=json | jq '.status.conditions'
218244
A condition is defined by its type. You may notice that there are some non-standard clusteroperator conditions. They are:
219245
- `SCAAvailable` - based on the SCA (Simple Content Access) controller in `pkg/ocm/sca/sca.go` and provides information about the status of downloading the SCA entitlements.
220246
- `ClusterTransferAvailable` - based on the cluster transfer controller in `pkg/ocm/clustertransfer/cluster_transfer.go` and provides information about the availability of cluster transfers.
221-
- `Disabled` - indicates whether data gathering is disabled or enabled.
247+
- `Disabled` - indicates whether data gathering is disabled or enabled.
222248

223249
In addition to the above clusteroperator conditions, there are some intermediate clusteroperator conditions. These are:
224250
- `UploadDegraded` - this condition occurs when there is any unsuccessful upload of the Insights data (if the number of the upload attemp is equal or greater than 5 then the operator is marked as **Degraded**). Example is:

Diff for: pkg/insights/insightsclient/insightsclient.go

+2-2
Original file line numberDiff line numberDiff line change
@@ -266,11 +266,11 @@ func (c *Client) createAndWriteMIMEHeader(source *Source, mw *multipart.Writer,
266266
var (
267267
counterRequestSend = metrics.NewCounterVec(&metrics.CounterOpts{
268268
Name: "insightsclient_request_send_total",
269-
Help: "Tracks the number of metrics sends",
269+
Help: "Tracks the number of archives sent",
270270
}, []string{"client", "status_code"})
271271
counterRequestRecvReport = metrics.NewCounterVec(&metrics.CounterOpts{
272272
Name: "insightsclient_request_recvreport_total",
273-
Help: "Tracks the number of reports requested",
273+
Help: "Tracks the number of insights reports received/downloaded",
274274
}, []string{"client", "status_code"})
275275
)
276276

0 commit comments

Comments
 (0)