Merge pull request #323 from 0sewa0/update_arch.md

openshift-merge-robot · web-flow · commit 45317ed3cdf6 · 2021-02-03T05:18:23.000-05:00
Bug 1923892: Updates arch.md
diff --git a/docs/arch.md b/docs/arch.md
@@ -1,22 +1,22 @@
 Insights operator is OpenShift Cloud Native application based on the [Operators Framework](https://github.com/operator-framework).
 Operators Framework is a toolkit to manage other cloud native applications.
 
-Tip: 
+Tip:
 Try to install operators-sdk and generate new operator using https://sdk.operatorframework.io/docs/building-operators/golang/quickstart/
 You will see how much code is generated by operators-sdk by default and what is provided by default in an operator.
 
 Main goal of Insights Operators is to periodically gather anonymized data from applications in cluster and periodically upload them
 to `cloud.redhat.com` for analysis.
 
-Insights Operator itself is not managing any applications, it rather only runs using Operators Framework infrastructure. 
+Insights Operator itself is not managing any applications, it rather only runs using Operators Framework infrastructure.
 Like is the convention of Operator applications it has most of the code structured in pkg package and `pkg/controller/operator.go`
 is the hosting the Operator controller. Typically Operator controllers are reading configuration and starts some periodical tasks.
 
 ## How Insights Operator reads configuration
-In case of Insights Operator, configuration is a combination of file [config/pod.yaml](config/pod.yaml) and configuration stored in 
+In case of Insights Operator, configuration is a combination of file [config/pod.yaml](config/pod.yaml) and configuration stored in
 Namespace openshift-config in secret support. In the secret support is the endpoint and interval. The secret doesn't exist by default,
 but when exists it overrides default settings which IO reads from config/pod.yaml.
-The support secret has 
+The support secret has
 - endpoint (where to upload to),
 - interval (baseline for how often to gather and upload)
 - httpProxy, httpsProxy, noProxy eventually to set custom proxy, which overrides cluster proxy just for Insights Operator uploads
@@ -76,7 +76,7 @@ Internally the configObserver has an array of subscribers, so all of them will g
 
 ## How is Insights Operator scheduling gathering
 Commonly used pattern in Insights Operator is that a task is started as go routine and runs its own cycle of periodic actions.
-These actions are mostly started from `operator.go`. 
+These actions are mostly started from `operator.go`.
 They are usually using wait.Until - runs function periodically after short delay until end is signalled.
 There are these main tasks scheduled:
 - Gatherer
@@ -87,23 +87,18 @@ There are these main tasks scheduled:
 ### Scheduling of Gatherer
 Gatherer is using this logic to start information gathering from the cluster, and it is handled in [periodic.go](pkg/controller/periodic/periodic.go).
 
-The periodic is using a producer/consumer queue. It is periodically adding the gatherer to queue. The queue has a limit for maximally one gatherer of the name in queue and because we only have gatherer "config" it is only this one in queue.
-The adding to queue is run in periodic.periodicTrigger, which is started from periodic.Run function. 
-Function periodicTrigger tries to add a new one (blocking if the queue is full) every interval/2.
+So far we have only 1 Gatherer(called `clusterconfig`), it has several gather-functions each collecting different data from the cluster.
+The workflow of the gather-functions is managed by the Gatherer.
+Only one Gatherer runs at one time, this is because we only have 1 Gatherer at the moment. (ie.: we can add concurrency here when its needed)
+When IO starts there is an initial delay before the first `Gather` happens, after that a `Gather` is initiated every interval, this is done by `periodicTrigger`.
+`periodic.Run` handles the initial delay and starts the `periodicTrigger` like `go wait.Until(func() { c.periodicTrigger(stopCh) }, time.Second, stopCh)`.
 
-Code: In periodic.Run it is calling periodicTrigger if it has a chance
-```
-go wait.Until(func() { c.periodicTrigger(stopCh) }, time.Second, stopCh)
-```
-The periodicTrigger is special in a way that it tries to balance times when it inserts to the queue.
-`wait.Jitter(i, m)` returns `time.Duration` after `time i + random(i*m)`.
-And even if it is time to add to queue it balances inserts with AddAfter between `i/4 to i/4 + random(i/4 * 2)` (sometimes between 1/4 and 3/4).
-
-The consumers from the queue are 4 workers started as goroutines from periodic.Run function.
-Because we only have one gatherer in queue, I think only one worker will effectively run task from queue. The actual start of gatherer is call to c.sync from periodic.processNextWorkItem. The c.sync is also defined in periodic and calls Gather + immediately stores returned data to disk as one file with timestamp.
+`Gather` uses `ExponentialBackoff` to retry (amount specified in: `status.GatherFailuresCountThreshold`) if a Gatherer returns any errors, these errors are mostly caused when a collected resource is not yet ready therefore it can't be right now collected so we should retry later.
+It's important that all retries finish before the next gather period starts, so that we don't have potential conflicts, the Backoff is calibrated to take this into account.
+Errors that occurred during a gather-function are logged in the metadata part of the IO archive. (`insigths-operator/gathers.json`)
 
 ### Scheduling and running of Uploader
-The `operator.go` is starting background task defined in `pkg/insights/insightsuploader/insightsuploader.go`. The insights uploader is periodically checking if there are any data to upload by calling summarizer. 
+The `operator.go` is starting background task defined in `pkg/insights/insightsuploader/insightsuploader.go`. The insights uploader is periodically checking if there are any data to upload by calling summarizer.
 If no data to upload are found the uploader continues with next cycle.
 The uploader cycle is running `wait.Poll` function which is waiting until config changes or until there is a time to upload. The time to upload is set by initialDelay.
 If this is the first upload (the lastReportedTime from status is not set) the uploader uses `interval/8+random(interval/8*2)` as next upload time. This could be reset though to 0, if it is Safe to upload immediately. If any upload was already reported, the next upload interval is going to be `now - lastReported + interval + 1.2 Jitter`.
@@ -125,7 +120,7 @@ If the file names are unstable (for example reading from Api with Limit and reac
 
 ## Scheduling the ConfigObserver
 Another background task started from Observer is from `pkg/config/configobserver/configobserver.go`. The observer creates configObserver by calling `configObserver.New`, which sets default observing interval to 5 minutes.
-The Run method runs again wait.Poll every 5 minutes and reads both support and pull-secret secrets. 
+The Run method runs again wait.Poll every 5 minutes and reads both support and pull-secret secrets.
 
 ## Scheduling diskpruner and what it does
 By default Insights Operator Gather is calling diskrecorder to save newly collected data in a new file, but doesn't remove old. This is the task of diskpruner. Observer calls `recorder.PeriodicallyPrune()` function. It is again using wait.Until pattern and runs approximately after every second interval.
@@ -162,7 +157,7 @@ $ oc get co insights -o=json | jq '.status.conditions'
   }
 ]
 ```
-The status is being updated by `pkg/controller/status/status.go`. Status has a background task, which is periodically updating 
+The status is being updated by `pkg/controller/status/status.go`. Status has a background task, which is periodically updating
 the Operator status from its internal list of Sources. Any component which wants to participate on Operator's status adds a
 SimpleReporter, which is returning its actual Status. The Simple reporter is defined in controllerstatus.
 
@@ -177,7 +172,7 @@ It relies on fact that updateStatus is called on Start of status cycle.
 
 
 ## How is Insights Operator using various Api Clients
-Internally Insights operator is talking to Kubernetes Api server over Http Rest queries. Each query is authenticated by a Bearer token, 
+Internally Insights operator is talking to Kubernetes Api server over Http Rest queries. Each query is authenticated by a Bearer token,
 To simulate see an actual Rest query being used, you can try:
 ```
 $ oc get pods -A -v=9
@@ -194,10 +189,13 @@ I1006 12:26:36.467245   66541 request.go:1068] Response Body: {"kind":"Table","a
 ... CUT HERE
 ```
 
-But adding Bearer token and creating Rest query is all handled automatically for us by using Clients, which are generated, type safe golang libraries, 
+But adding Bearer token and creating Rest query is all handled automatically for us by using Clients, which are generated, type safe golang libraries,
 like [github.com/openshift/client-go](github.com/openshift/client-go) or [github.com/kubernetes/client-go](github.com/kubernetes/client-go).
 Both these libraries are generated by automation, which specifies from which Api repo and which Api Group it generates it.
-All these clients are created in [operator.go](pkg/controller/operator.go) from the KUBECONFIG envvar defined in cluster and passed into [clusterconfig.go](pkg/controller/clusterconfig.go).
+
+All clients are created near/at where they are going to be used, we pass around the configs that were created from the KUBECONFIG envvar defined in cluster.
+Reason for doing this is that there are many clients every one of which is cheap to create and passing around the config is simple while also not changing much over time.
+On the other hand its quite cumbersome to pass around a bunch of clients, the number of which is changing by the day, with no benefit.
 
 ## How are the credentials used in clients
 In IO deployment [manifest](manifests/06-deployment.yaml) is specified service account operator (serviceAccountName: operator). This is the account under which insights operator runs or reads its configuration or also reads the metrics.
@@ -223,9 +221,9 @@ Code: The impersonated account is specified in config/pod.yaml (or config/local.
 ```
 impersonate: system:serviceaccount:openshift-insights:gather
 ```
-To test where the client has right permissions, the command mentioned above with verb, api and service account can be used. 
+To test where the client has right permissions, the command mentioned above with verb, api and service account can be used.
 
-Note: I was only able to test missing permissions on OCP 4.3, the versions above seems like always passing fine. Maybe higher versions 
+Note: I was only able to test missing permissions on OCP 4.3, the versions above seems like always passing fine. Maybe higher versions
 don't have RBAC enabled.
 
 Code: Example error returned from Api, in this case downloading Get config from imageregistry.
@@ -259,17 +257,32 @@ Such a client is used in [GatherMachineSet](pkg/gather/clusterconfig/clusterconf
 
 
 ## Gathering the data
-When the `periodic.go` calls method Gather of interface Gatherer, it actually calls `pkg/gather/clusterconfig/clusterconfig.go`.
-The function Gather calls one by one functions to gather (collect results of kubernetes Api call). Each `GatherXX` method is returning record object,
-which is a list of records with name (how file will be called in archive) and actual data. Actual data (any struct) has to implement Marshalable interface,
-which requires to have Marshall method.
 
-The structure of Gather calls is created inside the from `pkg/gather/clusterconfig` from Gather method. The method creates the list of gathering functions
-and passes them to record.Collect method in `pkg/record/interface.go`. The Collect method works in a way that any error returned from a gathering function
-is stored, but all next functions are still called, unless the parent context returns an Error (for example on timeout).
+### clusterconfig
+When the `periodic.go` calls method Gather of the `clusterconfig` Gatherer, it's handled [here](https://github.com/openshift/insights-operator/blob/master/pkg/gather/clusterconfig/0_gatherer.go#L99).
+
+The clusterconfig Gatherer starts each gather-function in its own separate goroutine with a dedicated channel to send back their results.
+Each gather-function is its own separate entity, each creates their own clients using the configs present in the `Gatherer` object that was passed down as parameter.
+We further divided the gather-functions into 2 main parts:
+1. the 'adapter-part' that is called by the `Gatherer.Gather`, named `Gather<Something>`, it handles the creation of the clients and handling the communication with the `Gatherer`.
+2. the 'core-part' that holds the actual logic of what to gather, named `gather<Something>`, the clients required for this are passed in as arguments by the 'adapter-part'.
+
+Gather-functions are IO bound and they don't use much of the CPU, so giving each of them a goroutine doesn't stress the CPU but gives us an 'async' way of making REST calls, which improves the performance greatly.
+
+After starting the goroutines, the Gatherer will start monitoring the channels, when it receives a result it will:
+- Store the received `record`s using the provided `record.Interface`'s `Record` method.
+- Store some metadata about the gather-function.
+- Collect the errors accordingly. Errors are accumulated over all the gather-functions and returned as 1 summed up error.
 
 Each result is being stored into record.Item as Marshalable item. It is using either golang Json marshaller, or K8s Api serializers. Those has to be explicitly registered in init func. The record is created to archive under its Name specifying full relative path including folders. The extension for particular record file is defined by GetExtension() func, but most of them are today of "json", except metrics or id.
 
+The `gatherFunctions` map is where we reference all the gather-functions we have within the `clusterconfig` package.
+Each has an id (the key in the map) these can be used to only execute a selection of the gather-functions. (according to the default config all gather-function will be executed)
+Furthermore each gather-function is categorized into either:
+- `important` meaning if that gather-function has an error we will notify `periodic.go` about it, which will handle it accordingly.
+- `failable` meaning if that gather-function has an error we will just log it and add it to our metadata.
+This is necessary as we are expanding into gathering data about resources that are not guaranteed to be present on the cluster. By default if a resource is not present we shouldn't see an error, but it's better to be safe.
+
 ## Downloading and exposing Archive Analysis
 After the successful upload of archive, the progress monitoring task starts. By default it waits for 1m until it checks if results of analysis of the archive (done by external pipeline in cloud.redhat.com) are available. The report contains LastUpdatedAt timestamp, and verifies if report has changed its state (for this cluster) since the last time. If there was no
 update (yet), it retries its download of analysis, because we have uploaded the data, but no analysis was provided back yet.