-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bodysize limit for metric scraping. #1467
Add bodysize limit for metric scraping. #1467
Conversation
cb673d4
to
e65e3f3
Compare
e65e3f3
to
3db7c73
Compare
/retest-required |
2f85824
to
df4fc86
Compare
/test e2e-agnostic-upgrade |
df4fc86
to
90bf787
Compare
90bf787
to
f65074d
Compare
Nice, this looks good to me. Is the scenario of hitting a scrape limit already covered by an existing alert? I think it should definitely be alerted upon, as otherwise user might not get relevant metrics without being aware. |
f65074d
to
b3aa278
Compare
Thanks for pointing out the alert. I almost forget it 😀 |
@@ -326,6 +326,26 @@ local patchedRules = [ | |||
}, | |||
]; | |||
|
|||
local addedRules = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be defined in upstream prometheus mixin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will submit a change to upstream prometheus mixins later. Once it is fit into upstream, we can clean it from CMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR created in Prometheus to add this alert to mixins: prometheus/prometheus#9873
Hope it can be merged someday :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a TODO so we don't forget?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, when the PR is merged, we can safely removed this alert patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let remove the patched alert and rely on upstream.
4ec552d
to
cd27c35
Compare
cd27c35
to
bd06390
Compare
A new version is pushed :)
|
edf383b
to
6ae1f21
Compare
pkg/manifests/config.go
Outdated
@@ -315,19 +325,18 @@ func (cfg *TelemeterClientConfig) IsEnabled() bool { | |||
return true | |||
} | |||
|
|||
func NewConfig(content io.Reader) (*Config, error) { | |||
func NewConfig(content io.Reader) (res *Config, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of named return values as I find that they make the code less readable. Maybe this is only me but I don't feel either that this change is required here.
pkg/manifests/config.go
Outdated
} | ||
|
||
func (c *Config) LoadEnforcedBodySizeLimit(pcr PodCapacityReader, ctx context.Context) error { | ||
if c.ClusterMonitoringConfiguration.PrometheusK8sConfig.EnforcedBodySizeLimit == "automatic" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe define a const
for "automatic".
pkg/manifests/config.go
Outdated
TelemetryMatches []string `json:"-"` | ||
AlertmanagerConfigs []AdditionalAlertmanagerConfig `json:"additionalAlertmanagerConfigs"` | ||
QueryLogFile string `json:"queryLogFile"` | ||
EnforcedBodySizeLimit string `json:"enforcedBodySizeLimit,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment about how using the parameter?
- an empty value means no enforcement
- "automatic" means that CMO picks up a value based on the cluster capacity
- A fixed size can be defined too.
pkg/manifests/config.go
Outdated
} | ||
|
||
func (c *Config) LoadEnforcedBodySizeLimit(pcr PodCapacityReader, ctx context.Context) error { | ||
if c.ClusterMonitoringConfiguration.PrometheusK8sConfig.EnforcedBodySizeLimit == "automatic" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can handle the "return-early" case first.
if c.ClusterMonitoringConfiguration.PrometheusK8sConfig.EnforcedBodySizeLimit == "automatic" { | |
if c.ClusterMonitoringConfiguration.PrometheusK8sConfig.EnforcedBodySizeLimit == "" { | |
return nil | |
} | |
if c.ClusterMonitoringConfiguration.PrometheusK8sConfig.EnforcedBodySizeLimit == "automatic" { |
pkg/manifests/config.go
Outdated
return nil | ||
} | ||
|
||
func (c *Config) UseMinimalEnforcedBodySizeLimit() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem to be used but should be incorporated in calculateBodySizeLimit()
I believe.
pkg/manifests/config.go
Outdated
func calculateBodySizeLimit(podCapacity int) string { | ||
const samplesPerPod = 400 // 400 samples per pod | ||
const sizePerSample = 200 // 200 Bytes | ||
const loadFactorPercentage = 60 // assume 80% of the maximum pods capacity per node is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume that the full capacity can be used.
pkg/manifests/config.go
Outdated
bodySize := loadFactorPercentage * podCapacity / 100 * samplesPerPod * sizePerSample | ||
if bodySize < minimalSizeLimit { | ||
bodySize = minimalSizeLimit | ||
klog.Infof("Calculated scrape body size limit is too small, using default value %v instead", minimalSizeLimit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could log both values (e.g. "calculated body size limit = ... is too small, using ... instead")
pkg/operator/operator.go
Outdated
err = c.LoadEnforcedBodySizeLimit(o.client, ctx) | ||
if err != nil { | ||
c.ClusterMonitoringConfiguration.PrometheusK8sConfig.EnforcedBodySizeLimit = "" | ||
klog.Warningf("Error loading enforced body size limit, no body size limit will be enforced. Error: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.Warningf("Error loading enforced body size limit, no body size limit will be enforced. Error: %v", err) | |
klog.Warningf("Error loading enforced body size limit, no body size limit will be enforced: %v", err) |
f.MustCreateOrUpdateConfigMap(t, configMapWithData(t, data)) | ||
|
||
f.PrometheusK8sClient.WaitForQueryReturn( | ||
t, 5*time.Minute, `ceil(sum(increase(prometheus_target_scrapes_exceeded_body_size_limit_total{job="prometheus-k8s"}[5m])))`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t, 5*time.Minute, `ceil(sum(increase(prometheus_target_scrapes_exceeded_body_size_limit_total{job="prometheus-k8s"}[5m])))`, | |
t, 5*time.Minute, `sum(increase(prometheus_target_scrapes_exceeded_body_size_limit_total{job="prometheus-k8s"}[5m]))`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to keep the ceil
function to round the result to an integer as WaitForQueryReturn
requires.
6ae1f21
to
91c7a08
Compare
91c7a08
to
86b83cb
Compare
86b83cb
to
e5cc884
Compare
/test e2e-agnostic-upgrade |
pkg/manifests/config.go
Outdated
const loadFactorPercentage = 100 // assume 100% of the maximum pods capacity per node is used | ||
|
||
bodySize := loadFactorPercentage * podCapacity / 100 * samplesPerPod * sizePerSample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get rid of loadFactorPercentage
.
const loadFactorPercentage = 100 // assume 100% of the maximum pods capacity per node is used | |
bodySize := loadFactorPercentage * podCapacity / 100 * samplesPerPod * sizePerSample | |
bodySize := podCapacity * samplesPerPod * sizePerSample |
pkg/manifests/config.go
Outdated
// Limit the body size from scrape queries | ||
// Assumptions: one node has in average 110 pods, each pod exposes 400 metrics, each metric is expressed by on average 250 bytes. | ||
// 1.5x the size for a safe margin, it rounds to 16MB (16,500,000 Bytes). | ||
minimalSizeLimit = 1.5 * 110 * 400 * 250 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need a "safe" lower-bound value for the "automatically computed" value but I'm not sure that this value is accurate. I would take a typical CI cluster, load it with as many pods/secret/configmaps as possible and measure the body size returned by kube-state-metrics /metrics
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e5cc884
to
c1a9e69
Compare
bodysize when scraping metric. Empty value or 0 means bodysize limit. "automatic" for automatically deduced bodysize limit.
c1a9e69
to
b95f371
Compare
/test e2e-agnostic-operator |
1 similar comment
/test e2e-agnostic-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold cancel |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jan--f, raptorsun, simonpasquier The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
4 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@raptorsun: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This PR adds an option to CMO config map to activate bodysize limit on metrics scraping, which can prevent potential OOM problems when scraping metric endpoints responding with an oversized HTTP body.
The dependency upgrade PR is PR #1468 . This functionality requires Prometheus-Operator 0.51+. So I upgrade it to 0.52.
Here is the JIRA ticket.
FIeld prometheusK8s.enforcedBodySizeLimit to CMO ConfigMap, accepting the size format from Prometheus: