Support enabling the `query_log_file` config for Prometheus #1373

philipgough · 2021-09-10T14:07:35Z

Enables setting https://prometheus.io/docs/guides/query-log/#enable-the-query-log on/off via the ConfigMap for both platform and user workload monitoring.

I added CHANGELOG entry for this change.
No user facing changes, so no entry in CHANGELOG was needed.

openshift-ci · 2021-09-10T14:07:37Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

examples/user-workload/README.md

pkg/manifests/manifests.go

pkg/manifests/manifests_test.go

test/e2e/config_test.go

sthaha · 2021-09-13T01:44:13Z

test/e2e/config_test.go

@@ -794,6 +800,10 @@ func TestUserWorkloadMonitorPrometheusK8Config(t *testing.T) {
 			name: "assert remote write url value in set in CR",
 			f:    assertRemoteWriteWasSet(f.UserWorkloadMonitoringNs, crName, "https://test.remotewrite.com/api/write"),
 		},
+		{
+			name: "assert query log file value is set and correct",


Should we also have a test that validates that after query_log_file is set, and then reset, the queryLogFile becomes "" ?

personally I don' think so, because we don't do it for any other config option and not entirely sure we should treat this any differently.

e2e tests are expensive time wise, and I feel we already rely on them more heavily than we should. Testing the happy path in e2e tests is sufficient in my mind unless its critical we cover other cases.

Having said that, lets leave it open and see what others think.

Do we require a e2e test for this ? Can't this be done in the manifest_test.go itself or is it already a path that is tested?

It's already a path that is tested. The pattern as of now in regards to config is that we test the happy path as is the case with this test.

I did already add a unit test in manifest_test.go for same.

sthaha

/lgtm
some minor comments that you may want to address

sthaha · 2021-09-13T01:51:19Z

/retest

philipgough · 2021-09-14T19:03:55Z

/retest

philipgough · 2021-09-14T20:13:24Z

/skip

philipgough · 2021-09-15T09:05:38Z

/retest

philipgough · 2021-09-15T13:49:39Z

/retest

sthaha · 2021-09-16T02:18:11Z

/lgtm

openshift-bot · 2021-09-16T02:20:41Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-09-16T04:59:41Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

simonpasquier

/lgtm

openshift-bot · 2021-11-17T06:03:39Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

philipgough · 2021-11-17T09:04:35Z

@sferich888 another nudge on this one for px-approval if we could?

Prometheus and via prometheus-operator, provide the ability to log all PromQL queries to a file. This change enables CMO to support passing through that feature to Prometheus CR and the Prometheus pod for both platform monitoring and UWM. https://prometheus.io/docs/guides/query-log/

We want to allow cluster admins to enable the query log, however it should be noted that this is a temporary solution for debugging situations ad-hoc and should not be enabled permanently. https://prometheus.io/docs/guides/query-log/

philipgough · 2021-11-17T15:09:12Z

/retest

simonpasquier · 2021-11-17T15:19:06Z

/lgtm

openshift-ci · 2021-11-17T15:21:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: PhilipGough, prashbnair, simonpasquier, sthaha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [PhilipGough,prashbnair,simonpasquier,sthaha]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-11-17T15:41:40Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

philipgough · 2021-11-17T16:25:17Z

/retest

philipgough · 2021-11-17T17:17:18Z

/retest

sferich888 · 2021-11-18T15:24:15Z

/label px-approved

juzhao · 2021-11-24T04:13:21Z

@philipgough
I see the queryLogFile size would be increased as time goes by, do we have limit for it?

sh-4.4$ du -h /tmp/test-cluster.log
6.6M /tmp/test-cluster.log

sh-4.4$ du -h /tmp/test-cluster.log
11M	/tmp/test-cluster.log

juzhao · 2021-11-24T04:42:24Z

tested with the PR, enabled UWM and set queryLogFile for openshift-monitoring prometheus and UWM prometheus, we can see the query logs for openeshift-monitoring prometheus, but can not find from UWM prometheus

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
    prometheusK8s:
      retention: 3h
      logLevel: debug
      queryLogFile: /tmp/test-cluster.log
      volumeClaimTemplate:
        spec:
          volumeMode: Filesystem
          resources:
            requests:
              storage: 10Gi

and

apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheus:
      logLevel: warn
      queryLogFile: /tmp/test-uwm.log
      volumeClaimTemplate:
        spec:
          volumeMode: Filesystem
          resources:
           requests:
              storage: 10Gi

create prometheusrules under user namespace

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: example-alert
  namespace: ns1
spec:
  groups:
  - name: example
    rules:
    - alert: HighErrors
      expr: (sum(rate(http_requests_total{code!~"2.."}[5m])) / sum(rate(http_requests_total[5m])))
        * 100 > 10
    - alert: TestAlert
      annotations:
        message: This is an alert meant to ensure that the entire alerting pipeline
          is functional.
      expr: vector(1)
      labels:
        severity: none

# oc -n openshift-monitoring rsh -c prometheus prometheus-k8s-0
sh-4.4$ cat /tmp/test-cluster.log | head -n 1
{"params":{"end":"2021-11-24T03:51:39.491Z","query":"label_replace(sum(rate(apiserver_request_total{code=~\"5..\",job=\"apiserver\",verb=~\"LIST|GET\"}[2h])) / scalar(sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\"}[2h]))), \"type\", \"error\", \"_none_\", \"\") ..

# oc -n openshift-user-workload-monitoring rsh -c prometheus prometheus-user-workload-0
sh-4.4$ cat /tmp/test-uwm.log
empty result

philipgough · 2021-11-24T10:09:35Z

@juzhao no, we have no limit for it. We will explicitly call this out in the docs and this should be enabled only as a temporary troubleshooting solution -> https://github.com/openshift/openshift-docs/pull/36495/files#diff-c934df74a3d8f522bb7fae910d0e0b4cb8f5a92facbe61d093a16297e8c9803dR12

philipgough · 2021-11-24T11:38:25Z

@juzhao - following your exact config I am not seeing the same behaviour when this PR is deployed:

oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml

produces (shortened) output showing the configmap has set the correct config as per https://prometheus.io/docs/guides/query-log/#logging-all-the-queries-to-a-file

global:
  evaluation_interval: 30s
  scrape_interval: 30s
  external_labels:
    prometheus: openshift-user-workload-monitoring/user-workload
    prometheus_replica: prometheus-user-workload-0
  query_log_file: /tmp/test-uwm.log
rule_files:
- /etc/prometheus/rules/prometheus-user-workload-rulefiles-0/*.yaml
scrape_configs: []

Run the following query:

 oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- curl 'http://localhost:9090/api/v1/query?query=up'

Followed by:

oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /tmp/test-uwm.log | jq

Results in:

{
  "httpRequest": {
    "clientIP": "127.0.0.1",
    "method": "GET",
    "path": "/api/v1/query"
  },
  "params": {
    "end": "2021-11-24T11:32:15.595Z",
    "query": "up",
    "start": "2021-11-24T11:32:15.595Z",
    "step": 0
  },
  "stats": {
    "timings": {
      "evalTotalTime": 3.5229e-05,
      "resultSortTime": 0,
      "queryPreparationTime": 2.2002e-05,
      "innerEvalTime": 7.712e-06,
      "execQueueTime": 7.8172e-05,
      "execTotalTime": 0.000137558
    }
  },
  "ts": "2021-11-24T11:32:15.595Z"

philipgough · 2021-11-24T11:48:07Z

juzhao · 2021-11-24T12:03:34Z

/label qe-approved

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2021

openshift-ci bot requested review from prashbnair and sthaha September 10, 2021 14:09

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 10, 2021

philipgough force-pushed the mon-1787 branch from aa7c7f5 to 4a3d124 Compare September 10, 2021 14:11

philipgough marked this pull request as ready for review September 10, 2021 14:13

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2021

philipgough changed the title ~~Mon 1787~~ Support enabling the query_log_file config for Prometheus Sep 10, 2021