-
Notifications
You must be signed in to change notification settings - Fork 551
feat(metrics): Emit metrics for CatalogSource state #2152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(metrics): Emit metrics for CatalogSource state #2152
Conversation
b2303bb
to
2a1ced4
Compare
This PR introduces a prometheus gauage vector `catalogsource_ready` that is emitted by each CatalogSource to indicate the connectivity.State of the CatalogSource object. A value of 1 for the vector indicates that the CatalogSource object is in a READY state, while a value of 0 indicates that the CatalogSource object is in one of IDLE/CONNECTING/ TRANSIENT_FAILURE/SHUTDOWN state. If/When the CatalogSource object eventually reaches a READY state, the value for the vector is set to 0. Signed-off-by: Anik Bhattacharjee <[email protected]>
/hold waiting for ack from monitoring team |
I think this looks good. A second set of eyes would be good though cc @paulfantom |
Signed-off-by: Anik Bhattacharjee <[email protected]>
/hold cancel |
Signed-off-by: Anik Bhattacharjee <[email protected]>
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: anik120, benluddy The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
With the [introduction of the `catalogsource_ready` metric in olm](operator-framework/operator-lifecycle-manager#2152), alerts can be fired for the default CatalogSources marketplace deploys if they are in a non-ready state. This PR introduces prometheus alerts for any default CatalogSources that have been in a Non-Ready state for more than 10 mins.
With the [introduction of the `catalogsource_ready` metric in olm](operator-framework/operator-lifecycle-manager#2152), alerts can be fired for the default CatalogSources marketplace deploys if they are in a non-ready state. This PR introduces prometheus alerts for any default CatalogSources that have been in a Non-Ready state for more than 10 mins.
)) | ||
Consistently(func() []Metric { | ||
return getMetricsFromPod(c, getPodWithLabel(c, "app=catalog-operator"), "8081") | ||
}, "3m").Should(And( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anik120 Any idea on why "3m" was chosen as the consistently poll here? I was poking around this package trying to rundown some flakes and noticed specific test spec takes roughly 75-80% of the total runtime for this metrics e2e package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timflannagan I can't seem to remember why I chose 3m, but I think I wanted to prove out that the metrics with a particular value is emitted for a considerable amount of time. Looking back at it, I don't think Eventually
with a pollAfter
duration set is a bad idea either. That'll reduce the run time since I'm assuming the process will sleep till it's time to poll.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there are rules that depend on this metric being having a consistent value over a period of time, eg https://github.com/operator-framework/operator-marketplace/blob/master/manifests/12_prometheus_rule.yaml#L23-L24. We could reduce the value to 2m too if that'll help bring that percentage down to ~50%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened a PR that's a middle ground solution (I think) #2739
Description of the change:
This PR introduces a prometheus gauage vector
catalogsource_ready
that is emitted by each CatalogSource to indicate the connectivity.State
of the CatalogSource object. A value of 1 for the vector indicates that
the CatalogSource object is in a READY state, while a value of 0
indicates that the CatalogSource object is in one of IDLE/CONNECTING/
TRANSIENT_FAILURE/SHUTDOWN state. If/When the CatalogSource object
eventually reaches a READY state, the value for the vector is set
to 0.
Motivation for the change:
Reviewer Checklist
/doc