-
Notifications
You must be signed in to change notification settings - Fork 40.6k
Add metrics for CEL for admission control KEP #112994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for CEL for admission control KEP #112994
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious to see the buckets.
/triage accepted |
/lgtm |
Buckets: []float64{0.001, 0.01, 0.1, 1.0}, | ||
StabilityLevel: metrics.ALPHA, | ||
}, | ||
[]string{"policy", "policy_binding", "validation_expression", "enforcement_action", "params", "state"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the cardinality on "params"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be equal to the number of parameter resouces - my understanding is that number should be no higher than the cardinality of policy_binding
(@jpbetz - pinging you here in case I'm off on this one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, less than or equal to binding cardinality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of these labels seem like they would have concerningly high cardinality. "policy", "policy_binding", "params" are all names of instances of objects? so the cardinality is only limited to the number of instances of those types as someone creates in a cluster? what is validation_expression, the string expression, name of the validation rule that failed, index of the rule that failed, or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, those are all names (validation_expression
is the name of the expression being checked). Do you think some of the labels should maybe be removed? I feel like the binding (and maybe the params?) label(s) could be removed without too much impact towards debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that metric labels tied to object names were undesirable since they have ~unbounded cardinality.
If we were going to have a name as a label, the binding name identifies the policy/params tuple (at least until singleton policies that are effective without a binding get implemented)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the label cardinality is truly capped to 100s per cluster, then even if the total cardinality is unbounded, it should not pose a terrible problem. We have far worse issues with the resource label for the apiserver request metrics. It basically just means we're not going to be able to do meaningful aggregation across multiple clusters over these dimensions. If the cardinality for a single cluster is actually unbounded, then I would highly recommend against these labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider this equivalent to the resource
label in API server request metrics... every new CRD spawns a new resource
value, and every new policy or policy_binding instance (depending on which of these labels was selected) would spawn a new label value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably fine then, moderately discouraged but if it would help debugging then I'd be okay with it.
42c28db
to
0a65adf
Compare
/lgtm (From sig instrumentation) |
/retest |
4237698
to
ac324cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @lavalamp Would you mind to take a look when you have time? Thank you :) |
Help: "Validation admission policy check total, labeled by policy and param resource, and further identified by binding, validation expression, enforcement action taken, and state.", | ||
StabilityLevel: metrics.ALPHA, | ||
}, | ||
[]string{"policy", "policy_binding", "validation_expression", "enforcement_action", "params", "state"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is incremented on every admission, even ones that are allowed. When we get to policy evaluations that can fail but still permit the request (e.g. audit or warn or fail open), how will that be reflected in this metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it's incremented on every admission - for audit
and warn
support, I think we should be able to extend the enforcement_action
label with those values.
sorry to keep asking questions, just trying to understand how these fit with the shape we expect validation admission evaluations to take |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: DangerOnTheRanger, jpbetz, liggitt, logicalhan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
1 similar comment
/retest |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds metrics, as a part of KEP-3488. An additional PR will be needed to integrate the main KEP implementation with metrics; this PR only introduces the metrics themselves.
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: