-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: alertmanager conditional log gathering #545
Changes from 7 commits
5bd339f
03c6182
91ee86d
0d856bf
f07aae3
fadea5c
b4f7903
a627f95
6f75ce2
a25d8af
859df39
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
2021-11-15T09:21:29.542014685Z level=info ts=2021-11-15T09:21:29.540Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=rhaos-4.7-rhel-8, revision=7d7727749b9e72d483091a58e1a13cb7d4f4fa62)" | ||
2021-11-15T09:21:29.542014685Z level=info ts=2021-11-15T09:21:29.540Z caller=main.go:217 build_context="(go=go1.15.7, user=root@9e3ad46b3963, date=20210609-08:49:37)" | ||
2021-11-15T09:21:29.688729257Z level=warn ts=2021-11-15T09:21:29.685Z caller=cluster.go:228 component=cluster msg="failed to join cluster" err="3 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-2.alertmanager-operated:9094: lookup alertmanager-main-2.alertmanager-operated on 172.30.0.10:53: no such host\n\n" | ||
2021-11-15T09:21:29.688729257Z level=info ts=2021-11-15T09:21:29.685Z caller=cluster.go:230 component=cluster msg="will retry joining cluster every 10s" | ||
2021-11-15T09:21:29.688729257Z level=warn ts=2021-11-15T09:21:29.685Z caller=main.go:307 msg="unable to join gossip mesh" err="3 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-2.alertmanager-operated:9094: lookup alertmanager-main-2.alertmanager-operated on 172.30.0.10:53: no such host\n\n" | ||
2021-11-15T09:21:29.688729257Z level=info ts=2021-11-15T09:21:29.687Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s | ||
2021-11-15T09:21:29.840749198Z level=info ts=2021-11-15T09:21:29.840Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml | ||
2021-11-15T09:21:29.841639033Z level=info ts=2021-11-15T09:21:29.840Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml | ||
2021-11-15T09:21:29.844588876Z level=info ts=2021-11-15T09:21:29.844Z caller=main.go:485 msg=Listening address=127.0.0.1:9093 | ||
2021-11-15T09:21:31.688010032Z level=info ts=2021-11-15T09:21:31.687Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000160539s | ||
2021-11-15T09:21:36.069883883Z level=info ts=2021-11-15T09:21:36.065Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml | ||
2021-11-15T09:21:36.069883883Z level=info ts=2021-11-15T09:21:36.065Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml | ||
2021-11-15T09:21:39.697216107Z level=info ts=2021-11-15T09:21:39.697Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.009414061s | ||
2021-11-15T09:21:44.715090814Z level=warn ts=2021-11-15T09:21:44.715Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-main-2.alertmanager-operated:9094 err="1 error occurred:\n\t* Failed to resolve alertmanager-main-2.alertmanager-operated:9094: lookup alertmanager-main-2.alertmanager-operated on 172.30.0.10:53: no such host\n\n" | ||
2021-11-15T09:21:59.695862111Z level=warn ts=2021-11-15T09:21:59.695Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-main-2.alertmanager-operated:9094 err="1 error occurred:\n\t* Failed to resolve alertmanager-main-2.alertmanager-operated:9094: lookup alertmanager-main-2.alertmanager-operated on 172.30.0.10:53: no such host\n\n" | ||
2021-11-16T08:34:38.427677160Z level=info ts=2021-11-16T08:34:38.423Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml | ||
2021-11-16T08:34:38.427677160Z level=info ts=2021-11-16T08:34:38.423Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml | ||
2021-11-16T08:36:49.115902451Z level=warn ts=2021-11-16T08:36:49.115Z caller=notify.go:674 component=dispatcher receiver=Default integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: connect: connection timed out" | ||
2021-11-16T08:36:49.116291870Z level=warn ts=2021-11-16T08:36:49.115Z caller=notify.go:674 component=dispatcher receiver=Default integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: connect: connection timed out" | ||
2021-11-16T08:39:38.427419866Z level=error ts=2021-11-16T08:39:38.427Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Default/webhook[0]: notify retry canceled after 4 attempts: Post \"https://this-endpoint.does/not-exist\": context deadline exceeded" | ||
2021-11-16T08:39:38.428637761Z level=error ts=2021-11-16T08:39:38.428Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Default/webhook[0]: notify retry canceled after 4 attempts: Post \"https://this-endpoint.does/not-exist\": context deadline exceeded" | ||
2021-11-16T08:41:48.124399641Z level=warn ts=2021-11-16T08:41:48.123Z caller=notify.go:674 component=dispatcher receiver=Default integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: connect: connection timed out" | ||
2021-11-16T08:44:38.429245827Z level=error ts=2021-11-16T08:44:38.428Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Default/webhook[0]: notify retry canceled after 4 attempts: Post \"https://this-endpoint.does/not-exist\": context deadline exceeded" | ||
2021-11-16T08:44:44.252018413Z level=warn ts=2021-11-16T08:44:44.251Z caller=notify.go:674 component=dispatcher receiver=Default integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: connect: connection timed out" | ||
2021-11-16T08:44:44.252018413Z level=warn ts=2021-11-16T08:44:44.251Z caller=notify.go:674 component=dispatcher receiver=Critical integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: connect: connection timed out" | ||
2021-11-16T08:46:47.131868802Z level=warn ts=2021-11-16T08:46:47.131Z caller=notify.go:674 component=dispatcher receiver=Default integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: connect: connection timed out" | ||
2021-11-16T08:47:34.490605785Z level=warn ts=2021-11-16T08:47:34.490Z caller=notify.go:674 component=dispatcher receiver=Default integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=3 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: i/o timeout" | ||
2021-11-16T08:47:34.490869829Z level=error ts=2021-11-16T08:47:34.490Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Default/webhook[0]: notify retry canceled after 5 attempts: Post \"https://this-endpoint.does/not-exist\": context deadline exceeded" | ||
2021-11-16T08:47:34.495474536Z level=error ts=2021-11-16T08:47:34.495Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Critical/webhook[0]: notify retry canceled after 4 attempts: Post \"https://this-endpoint.does/not-exist\": context deadline exceeded" |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,6 +29,7 @@ var gatheringFunctionBuilders = map[GatheringFunctionName]GathererFunctionBuilde | |
GatherImageStreamsOfNamespace: (*Gatherer).BuildGatherImageStreamsOfNamespace, | ||
GatherAPIRequestCounts: (*Gatherer).BuildGatherAPIRequestCounts, | ||
GatherLogsOfUnhealthyPods: (*Gatherer).BuildGatherLogsOfUnhealthyPods, | ||
GatherAlertmanagerLogs: (*Gatherer).BuildGatherAlertmanagerLogs, | ||
} | ||
|
||
// gatheringRules contains all the rules used to run conditional gatherings. | ||
|
@@ -60,6 +61,7 @@ var gatheringFunctionBuilders = map[GatheringFunctionName]GathererFunctionBuilde | |
// per container only if cluster version is 4.8 (not implemented, just an example of possible use) and alert | ||
// ClusterVersionOperatorIsDown is firing | ||
var defaultGatheringRules = []GatheringRule{ | ||
// GatherImageStreamsOfNamespace | ||
{ | ||
Conditions: []ConditionWithParams{ | ||
{ | ||
|
@@ -79,6 +81,7 @@ var defaultGatheringRules = []GatheringRule{ | |
}, | ||
}, | ||
}, | ||
// GatherAPIRequestCounts | ||
{ | ||
Conditions: []ConditionWithParams{ | ||
{ | ||
|
@@ -94,6 +97,7 @@ var defaultGatheringRules = []GatheringRule{ | |
}, | ||
}, | ||
}, | ||
// GatherLogsOfUnhealthyPods | ||
{ | ||
Conditions: []ConditionWithParams{ | ||
{ | ||
|
@@ -128,6 +132,39 @@ var defaultGatheringRules = []GatheringRule{ | |
}, | ||
}, | ||
}, | ||
// AlertManagerLogs | ||
{ | ||
Conditions: []ConditionWithParams{ | ||
{ | ||
Type: AlertIsFiring, | ||
Alert: &AlertConditionParams{ | ||
Name: "AlertmanagerClusterFailedToSendAlerts", | ||
}, | ||
}, | ||
}, | ||
GatheringFunctions: GatheringFunctions{ | ||
GatherAlertmanagerLogs: GatherAlertmanagerLogsParams{ | ||
AlertName: "AlertmanagerClusterFailedToSendAlerts", | ||
TailLines: 50, | ||
}, | ||
}, | ||
}, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's unfortunate that the conditional gatherer doesn't pass the alert name to the gatherer function is some way because it leads to this ugly code duplication that would be very easy to accidentally mess up (forgetting to set both alert name strings when copy-pasting). Not an issue with this PR, just a general note. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeh, I was thinking exactly the same when I was implementing it. Maybe it would be a good idea to create a task to look for a better approach. What do you think?
rluders marked this conversation as resolved.
Show resolved
Hide resolved
|
||
{ | ||
Conditions: []ConditionWithParams{ | ||
{ | ||
Type: AlertIsFiring, | ||
Alert: &AlertConditionParams{ | ||
Name: "AlertmanagerFailedToSendAlerts", | ||
}, | ||
}, | ||
}, | ||
GatheringFunctions: GatheringFunctions{ | ||
GatherAlertmanagerLogs: GatherAlertmanagerLogsParams{ | ||
AlertName: "AlertmanagerFailedToSendAlerts", | ||
TailLines: 50, | ||
}, | ||
}, | ||
}, | ||
} | ||
|
||
const canConditionalGathererFail = false | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I added these comments to help to identify the conditions for each gather. It was a little bit hard for me to read it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense and I agree that these lists tend to be difficult to read, especially as they grow longer.