|
| 1 | +# Observability |
| 2 | + |
| 3 | +[](https://github.com/gitpod-com/observability/actions) |
| 4 | +[](https://gitpod.slack.com/archives/C01KGM9D8LE) |
| 5 | +[](https://gitpod.io/#https://github.com/gitpod-com/observability) |
| 6 | + |
| 7 | +Set of Jsonnet files used to deploy customized [monitoring-satellites](#monitoring-satellite) and [monitoring-centrals](#monitoring-central) into different clusters. |
| 8 | + |
| 9 | +## Table of contents |
| 10 | + |
| 11 | +- [Applications](#applications) |
| 12 | + - [Monitoring-satellite](#monitoring-satellite) |
| 13 | + - [Monitoring-Central](#monitoring-central) |
| 14 | +- [Workflows](#workflows) |
| 15 | + - [Development](#development) |
| 16 | + - [CI](#ci) |
| 17 | + - [Deployment](#deployment) |
| 18 | + |
| 19 | +## Applications |
| 20 | + |
| 21 | +### Monitoring-satellite |
| 22 | + |
| 23 | +Monitoring-satellite is an application responsible for collecting observability signals from kubernetes clusters. Components included in monitoring-satellite: |
| 24 | + |
| 25 | +* [Prometheus-Operator](https://github.com/prometheus-operator/prometheus-operator) |
| 26 | +* [Prometheus](https://github.com/prometheus/prometheus) |
| 27 | +* [Alertmanager](https://github.com/prometheus/alertmanager) |
| 28 | +* [Node-exporter](https://github.com/prometheus/node_exporter) |
| 29 | +* [Kube-State-Metrics](https://github.com/kubernetes/kube-state-metrics) |
| 30 | +* [Grafana](https://github.com/grafana/grafana) |
| 31 | +* Custom ServiceMonitors for [Gitpod](https://github.com/gitpod-io/gitpod)'s components |
| 32 | + |
| 33 | +Monitoring-satellite can be customized by setting up jsonnet external-variables: |
| 34 | + |
| 35 | +* `namespace` - changes the namespace where monitoring-satellite will be installed |
| 36 | +* `cluster_name` - adds a external label named `cluster` to Prometheus. This label is extermelly important to differentiate metrics comming from multiple clusters after being stored in monitoring-central. |
| 37 | +* `remote_write_url` - When defining this variable with something different from an empty string, Prometheus will send metrics to a Metrics backend, e.g. Thanos or Cortex, through Prometheus' Remote Write Protocol. |
| 38 | +* `pagerduty_routing_key` - Used to route critical alerts to pagerduty. |
| 39 | +* `slack_webhook_url_critical` - When defining this variable with something different from an empty string, Alertmanager will be configured to route alerts to Slack. **Careful:** When declaring this variable, you should also declare `slack_webhook_url_warning` and `slack_webhook_url_info`, which will route alerts from lower severities to different channels. |
| 40 | +* `dns_name` - When defining this variable with something different from an empty string, a set of extra resources will be created to expose Grafana to the internet while keeping it secure. When defining this variable, be careful to also declare `grafana_ingress_node_port`, `gcp_external_ip_address`, `IAP_client_id` and `IAP_client_secret`. The components included are: |
| 41 | + * Ingress |
| 42 | + * SSL Certificate (Requires certmanager installed in the cluster) |
| 43 | + * Google Cloud Backend Config |
| 44 | + |
| 45 | +#### Monitoring-satellite RoadMap |
| 46 | + |
| 47 | +As you can see, Metrics is the only Observability signal being collected by monitoring satellite right now. To make it complete Observability signal collector, we'll extend this application to collect: |
| 48 | + |
| 49 | +* `Logs` - With [Promtail](https://grafana.com/docs/loki/latest/clients/promtail/) or [Fluentd](https://www.fluentd.org/) |
| 50 | +* `Traces` - With [Jaeger Agent](https://www.jaegertracing.io/docs/1.22/deployment/) or [OpenTelemetry Collector](https://github.com/open-telemetry/opentelemetry-collector) |
| 51 | +* `Profiles` - With [ConProf](https://github.com/conprof/conprof) |
| 52 | + |
| 53 | +### Monitoring-Central |
| 54 | + |
| 55 | +Monitoring-central is an application responsible for storing multiple signals collected by multiple monitoring-satellites for long term. Monitoring-central is the best place to analyze data during incidents or historical trend analisis. Components included in monitoring-central: |
| 56 | + |
| 57 | +* [Grafana](https://github.com/grafana/grafana) |
| 58 | +* [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics) |
| 59 | + |
| 60 | +Monitoring-central can be customized by setting up jsonnet external-variables: |
| 61 | + |
| 62 | +* `dns_name` - When defining this variable with something different from an empty string, a set of extra resources will be created to expose Grafana to the internet while keeping it secure. When defining this variable, be careful to also declare `grafana_ingress_node_port`, `gcp_external_ip_address`, `IAP_client_id` and `IAP_client_secret`. The components included are: |
| 63 | + * Ingress |
| 64 | + * SSL Certificate (Requires certmanager installed in the cluster) |
| 65 | + * Google Cloud Backend Config |
| 66 | + |
| 67 | +#### Monitoring-central RoadMap |
| 68 | + |
| 69 | +Similarly to monitoring-satellite, monitoring-central only supports metric collection right now. To make it a complete Observability signal backend storage, we'll extend this application to store: |
| 70 | + |
| 71 | +* `Logs` - With [Loki](https://github.com/grafana/loki) |
| 72 | +* `Traces` - With [Jaeger](https://github.com/jaegertracing/jaeger) or [Tempo](https://github.com/grafana/tempo) |
| 73 | +* `Profiles` - With [ConProf](https://github.com/conprof/conprof) |
| 74 | + |
| 75 | +> To accelerate the development of monitoring-central, we are strongly considering teaming up with the Red Hat Monitoring Team to use [Observatorium](https://github.com/observatorium/observatorium) as our storage for all observability signals. |
| 76 | +
|
| 77 | +## Workflows |
| 78 | + |
| 79 | +### Development |
| 80 | + |
| 81 | +See [docs/code-design](./docs/code-design.md) for details on our folder structure. |
| 82 | + |
| 83 | +During development we generate YAML files and Grafana dashboards based on our jsonnet templates. |
| 84 | + |
| 85 | +**Notice**: These YAML files are only used during development and CI. For development/ci the entrypoints are `monitoring-*/manifests/*.jsonnet` whereas for ArgoCD the entrypoint is `monitoring-*/main.jsonnet`. |
| 86 | + |
| 87 | +To generate the YAML files and Grafana dashboards run the command below. |
| 88 | + |
| 89 | +```sh |
| 90 | +make generate |
| 91 | +``` |
| 92 | + |
| 93 | +The generated files are placed in `monitoring-*/manifests` - while working on the jsonnet templates it can sometimes be helpful to check out the generated YAML to see if everything looks the way you expected. |
| 94 | + |
| 95 | +If you'd like to test Grafana dashboards during development, you can copy the content of the JSON files located at `components/gitpod/mixin/dashboard_out` and import it to Grafana using the import feature: |
| 96 | + |
| 97 | + |
| 98 | + |
| 99 | + |
| 100 | +To make sure that all our jsonnet templates can compile and are correctly formatted run: |
| 101 | + |
| 102 | +```sh |
| 103 | +make fmt |
| 104 | +``` |
| 105 | + |
| 106 | +If you are changing Prometheus rules you can additionally run: |
| 107 | + |
| 108 | +```sh |
| 109 | +make promtool-lint |
| 110 | +``` |
| 111 | + |
| 112 | +### CI |
| 113 | + |
| 114 | +We use Github Actions to validate PRs. |
| 115 | + |
| 116 | +### Deployment |
| 117 | + |
| 118 | +To make changes to monitoring-satellites and monitoring-centrals spread across our clusters, simply merge a PR to the `main` branch and ArgoCD will automatically synchronize all deployed applications. |
| 119 | + |
| 120 | +If you want to verify ArgoCD has applied your changes, you can go to [argo-cd.gitpod-io-dev.com](https://argo-cd.gitpod-io-dev.com/) and use the label filter `application=monitoring-satelite` to see the status of all the satelites, or `application=monitoring-central` to see the monitoring centrals. |
| 121 | + |
| 122 | +The ArgoCD applications are configured in [gitpod-com/gitpod](https://github.com/gitpod-com/gitpod) which is also responsible for setting up all the appropriate external-variables. |
0 commit comments