Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support enabling the query_log_file config for Prometheus #1373

Merged
merged 3 commits into from
Nov 24, 2021

Conversation

philipgough
Copy link
Contributor

@philipgough philipgough commented Sep 10, 2021

Enables setting https://prometheus.io/docs/guides/query-log/#enable-the-query-log on/off via the ConfigMap for both platform and user workload monitoring.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 10, 2021

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2021
@openshift-ci openshift-ci bot requested review from prashbnair and sthaha September 10, 2021 14:09
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 10, 2021
@philipgough philipgough marked this pull request as ready for review September 10, 2021 14:13
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2021
@philipgough philipgough changed the title Mon 1787 Support enabling the query_log_file config for Prometheus Sep 10, 2021
@@ -794,6 +800,10 @@ func TestUserWorkloadMonitorPrometheusK8Config(t *testing.T) {
name: "assert remote write url value in set in CR",
f: assertRemoteWriteWasSet(f.UserWorkloadMonitoringNs, crName, "https://test.remotewrite.com/api/write"),
},
{
name: "assert query log file value is set and correct",
Copy link
Contributor

@sthaha sthaha Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have a test that validates that after query_log_file is set, and then reset, the queryLogFile becomes "" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I don' think so, because we don't do it for any other config option and not entirely sure we should treat this any differently.

e2e tests are expensive time wise, and I feel we already rely on them more heavily than we should. Testing the happy path in e2e tests is sufficient in my mind unless its critical we cover other cases.

Having said that, lets leave it open and see what others think.

Copy link
Contributor

@sthaha sthaha Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we require a e2e test for this ? Can't this be done in the manifest_test.go itself or is it already a path that is tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already a path that is tested. The pattern as of now in regards to config is that we test the happy path as is the case with this test.

I did already add a unit test in manifest_test.go for same.

Copy link
Contributor

@sthaha sthaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
some minor comments that you may want to address

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 13, 2021
@sthaha
Copy link
Contributor

sthaha commented Sep 13, 2021

/retest

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2021
@openshift-ci openshift-ci bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Sep 14, 2021
@philipgough
Copy link
Contributor Author

/retest

@philipgough
Copy link
Contributor Author

/skip

@philipgough
Copy link
Contributor Author

/retest

1 similar comment
@philipgough
Copy link
Contributor Author

/retest

@sthaha
Copy link
Contributor

sthaha commented Sep 16, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 16, 2021
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 16, 2021
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 17, 2021
@philipgough
Copy link
Contributor Author

@sferich888 another nudge on this one for px-approval if we could?

Prometheus and via prometheus-operator, provide the ability to
log all PromQL queries to a file.

This change enables CMO to support passing through that feature
to Prometheus CR and the Prometheus pod for both platform
monitoring and UWM.

https://prometheus.io/docs/guides/query-log/
We want to allow cluster admins to enable the query log,
however it should be noted that this is a temporary solution
for debugging situations ad-hoc and should not be enabled
permanently.

https://prometheus.io/docs/guides/query-log/
@openshift-ci openshift-ci bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 17, 2021
@philipgough
Copy link
Contributor Author

/retest

@simonpasquier
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 17, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 17, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: PhilipGough, prashbnair, simonpasquier, sthaha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [PhilipGough,prashbnair,simonpasquier,sthaha]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@philipgough
Copy link
Contributor Author

/retest

1 similar comment
@philipgough
Copy link
Contributor Author

/retest

@sferich888
Copy link

/label px-approved

@openshift-ci openshift-ci bot added the px-approved Signifies that Product Support has signed off on this PR label Nov 18, 2021
@juzhao
Copy link
Contributor

juzhao commented Nov 24, 2021

@philipgough
I see the queryLogFile size would be increased as time goes by, do we have limit for it?

sh-4.4$ du -h /tmp/test-cluster.log
6.6M /tmp/test-cluster.log

sh-4.4$ du -h /tmp/test-cluster.log
11M	/tmp/test-cluster.log

@juzhao
Copy link
Contributor

juzhao commented Nov 24, 2021

tested with the PR, enabled UWM and set queryLogFile for openshift-monitoring prometheus and UWM prometheus, we can see the query logs for openeshift-monitoring prometheus, but can not find from UWM prometheus

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
    prometheusK8s:
      retention: 3h
      logLevel: debug
      queryLogFile: /tmp/test-cluster.log
      volumeClaimTemplate:
        spec:
          volumeMode: Filesystem
          resources:
            requests:
              storage: 10Gi

and

apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheus:
      logLevel: warn
      queryLogFile: /tmp/test-uwm.log
      volumeClaimTemplate:
        spec:
          volumeMode: Filesystem
          resources:
           requests:
              storage: 10Gi

create prometheusrules under user namespace

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: example-alert
  namespace: ns1
spec:
  groups:
  - name: example
    rules:
    - alert: HighErrors
      expr: (sum(rate(http_requests_total{code!~"2.."}[5m])) / sum(rate(http_requests_total[5m])))
        * 100 > 10
    - alert: TestAlert
      annotations:
        message: This is an alert meant to ensure that the entire alerting pipeline
          is functional.
      expr: vector(1)
      labels:
        severity: none
# oc -n openshift-monitoring rsh -c prometheus prometheus-k8s-0
sh-4.4$ cat /tmp/test-cluster.log | head -n 1
{"params":{"end":"2021-11-24T03:51:39.491Z","query":"label_replace(sum(rate(apiserver_request_total{code=~\"5..\",job=\"apiserver\",verb=~\"LIST|GET\"}[2h])) / scalar(sum(rate(apiserver_request_total{job=\"apiserver\",verb=~\"LIST|GET\"}[2h]))), \"type\", \"error\", \"_none_\", \"\") ..

# oc -n openshift-user-workload-monitoring rsh -c prometheus prometheus-user-workload-0
sh-4.4$ cat /tmp/test-uwm.log
empty result

@philipgough
Copy link
Contributor Author

@juzhao no, we have no limit for it. We will explicitly call this out in the docs and this should be enabled only as a temporary troubleshooting solution -> https://github.com/openshift/openshift-docs/pull/36495/files#diff-c934df74a3d8f522bb7fae910d0e0b4cb8f5a92facbe61d093a16297e8c9803dR12

@philipgough
Copy link
Contributor Author

@juzhao - following your exact config I am not seeing the same behaviour when this PR is deployed:

oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /etc/prometheus/config_out/prometheus.env.yaml

produces (shortened) output showing the configmap has set the correct config as per https://prometheus.io/docs/guides/query-log/#logging-all-the-queries-to-a-file

global:
  evaluation_interval: 30s
  scrape_interval: 30s
  external_labels:
    prometheus: openshift-user-workload-monitoring/user-workload
    prometheus_replica: prometheus-user-workload-0
  query_log_file: /tmp/test-uwm.log
rule_files:
- /etc/prometheus/rules/prometheus-user-workload-rulefiles-0/*.yaml
scrape_configs: []

Run the following query:

 oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- curl 'http://localhost:9090/api/v1/query?query=up'

Followed by:

oc -n openshift-user-workload-monitoring exec -c prometheus prometheus-user-workload-0 -- cat /tmp/test-uwm.log | jq

Results in:

{
  "httpRequest": {
    "clientIP": "127.0.0.1",
    "method": "GET",
    "path": "/api/v1/query"
  },
  "params": {
    "end": "2021-11-24T11:32:15.595Z",
    "query": "up",
    "start": "2021-11-24T11:32:15.595Z",
    "step": 0
  },
  "stats": {
    "timings": {
      "evalTotalTime": 3.5229e-05,
      "resultSortTime": 0,
      "queryPreparationTime": 2.2002e-05,
      "innerEvalTime": 7.712e-06,
      "execQueueTime": 7.8172e-05,
      "execTotalTime": 0.000137558
    }
  },
  "ts": "2021-11-24T11:32:15.595Z"

@philipgough
Copy link
Contributor Author

Screenshot 2021-11-24 at 11 25 27

@juzhao
Copy link
Contributor

juzhao commented Nov 24, 2021

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Nov 24, 2021
@openshift-merge-robot openshift-merge-robot merged commit 5efe381 into openshift:master Nov 24, 2021
@philipgough philipgough deleted the mon-1787 branch November 24, 2021 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. docs-approved Signifies that Docs has signed off on this PR lgtm Indicates that a PR is ready to be merged. px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants