Skip to content

feat(metrics): Emit metrics for CatalogSource state #2152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

anik120
Copy link
Contributor

@anik120 anik120 commented May 10, 2021

Description of the change:

This PR introduces a prometheus gauage vector catalogsource_ready
that is emitted by each CatalogSource to indicate the connectivity.State
of the CatalogSource object. A value of 1 for the vector indicates that
the CatalogSource object is in a READY state, while a value of 0
indicates that the CatalogSource object is in one of IDLE/CONNECTING/
TRANSIENT_FAILURE/SHUTDOWN state. If/When the CatalogSource object
eventually reaches a READY state, the value for the vector is set
to 0.

Motivation for the change:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive

@openshift-ci openshift-ci bot requested review from ecordell and timflannagan May 10, 2021 18:42
@anik120 anik120 force-pushed the catsrc-metrics branch 2 times, most recently from b2303bb to 2a1ced4 Compare May 11, 2021 14:09
This PR introduces a prometheus gauage vector `catalogsource_ready`
that is emitted by each CatalogSource to indicate the connectivity.State
of the CatalogSource object. A value of 1 for the vector indicates that
the CatalogSource object is in a READY state, while a value of 0
indicates that the CatalogSource object is in one of IDLE/CONNECTING/
TRANSIENT_FAILURE/SHUTDOWN state. If/When the CatalogSource object
eventually reaches a READY state, the value for the vector is set
to 0.

Signed-off-by: Anik Bhattacharjee <[email protected]>
@anik120
Copy link
Contributor Author

anik120 commented May 11, 2021

/hold waiting for ack from monitoring team

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 11, 2021
@jan--f
Copy link

jan--f commented May 12, 2021

I think this looks good. A second set of eyes would be good though cc @paulfantom

Signed-off-by: Anik Bhattacharjee <[email protected]>
@anik120
Copy link
Contributor Author

anik120 commented May 13, 2021

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 13, 2021
@benluddy
Copy link
Contributor

/approve

@openshift-ci
Copy link

openshift-ci bot commented May 13, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: anik120, benluddy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2021
@joelanford
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2021
@openshift-merge-robot openshift-merge-robot merged commit 4168648 into operator-framework:master May 16, 2021
anik120 added a commit to anik120/operator-marketplace that referenced this pull request May 19, 2021
With the [introduction of the `catalogsource_ready` metric in olm](operator-framework/operator-lifecycle-manager#2152), alerts can be
fired for the default CatalogSources marketplace deploys if they are in a non-ready state.
This PR introduces prometheus alerts for any default CatalogSources that have
been in a Non-Ready state for more than 10 mins.
timflannagan pushed a commit to ecordell/operator-marketplace that referenced this pull request Jun 9, 2021
With the [introduction of the `catalogsource_ready` metric in olm](operator-framework/operator-lifecycle-manager#2152), alerts can be
fired for the default CatalogSources marketplace deploys if they are in a non-ready state.
This PR introduces prometheus alerts for any default CatalogSources that have
been in a Non-Ready state for more than 10 mins.
))
Consistently(func() []Metric {
return getMetricsFromPod(c, getPodWithLabel(c, "app=catalog-operator"), "8081")
}, "3m").Should(And(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anik120 Any idea on why "3m" was chosen as the consistently poll here? I was poking around this package trying to rundown some flakes and noticed specific test spec takes roughly 75-80% of the total runtime for this metrics e2e package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timflannagan I can't seem to remember why I chose 3m, but I think I wanted to prove out that the metrics with a particular value is emitted for a considerable amount of time. Looking back at it, I don't think Eventually with a pollAfter duration set is a bad idea either. That'll reduce the run time since I'm assuming the process will sleep till it's time to poll.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there are rules that depend on this metric being having a consistent value over a period of time, eg https://github.com/operator-framework/operator-marketplace/blob/master/manifests/12_prometheus_rule.yaml#L23-L24. We could reduce the value to 2m too if that'll help bring that percentage down to ~50%

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a PR that's a middle ground solution (I think) #2739

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants