-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: alertmanager conditional log gathering #545
Merged
openshift-merge-robot
merged 11 commits into
openshift:master
from
rluders:ccxdev-6036-alertmanager-logs
Dec 1, 2021
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
5bd339f
feat: alertmanager conditional log gathering
03c6182
docs: adding sample data
91ee86d
refactor: merging with updates
0d856bf
refactor: remove unecessary gather
f07aae3
chore: unit tests to alertmanager logs gather
fadea5c
fix: yaml loading on get_cert_key.py
b4f7903
docs: adding some bl to readme
a627f95
docs: updating gathered-data
6f75ce2
fix: core review
a25d8af
chore: nolintlint is nonsense
859df39
chore: change sample data filename
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
29 changes: 29 additions & 0 deletions
29
...nshift-monitoring/pods/alertmanager-main-0/containers/alertmanager/logs/last-50-lines.log
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
2021-11-15T09:21:29.542014685Z level=info ts=2021-11-15T09:21:29.540Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=rhaos-4.7-rhel-8, revision=7d7727749b9e72d483091a58e1a13cb7d4f4fa62)" | ||
2021-11-15T09:21:29.542014685Z level=info ts=2021-11-15T09:21:29.540Z caller=main.go:217 build_context="(go=go1.15.7, user=root@9e3ad46b3963, date=20210609-08:49:37)" | ||
2021-11-15T09:21:29.688729257Z level=warn ts=2021-11-15T09:21:29.685Z caller=cluster.go:228 component=cluster msg="failed to join cluster" err="3 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-2.alertmanager-operated:9094: lookup alertmanager-main-2.alertmanager-operated on 172.30.0.10:53: no such host\n\n" | ||
2021-11-15T09:21:29.688729257Z level=info ts=2021-11-15T09:21:29.685Z caller=cluster.go:230 component=cluster msg="will retry joining cluster every 10s" | ||
2021-11-15T09:21:29.688729257Z level=warn ts=2021-11-15T09:21:29.685Z caller=main.go:307 msg="unable to join gossip mesh" err="3 errors occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-1.alertmanager-operated:9094: lookup alertmanager-main-1.alertmanager-operated on 172.30.0.10:53: no such host\n\t* Failed to resolve alertmanager-main-2.alertmanager-operated:9094: lookup alertmanager-main-2.alertmanager-operated on 172.30.0.10:53: no such host\n\n" | ||
2021-11-15T09:21:29.688729257Z level=info ts=2021-11-15T09:21:29.687Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s | ||
2021-11-15T09:21:29.840749198Z level=info ts=2021-11-15T09:21:29.840Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml | ||
2021-11-15T09:21:29.841639033Z level=info ts=2021-11-15T09:21:29.840Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml | ||
2021-11-15T09:21:29.844588876Z level=info ts=2021-11-15T09:21:29.844Z caller=main.go:485 msg=Listening address=127.0.0.1:9093 | ||
2021-11-15T09:21:31.688010032Z level=info ts=2021-11-15T09:21:31.687Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000160539s | ||
2021-11-15T09:21:36.069883883Z level=info ts=2021-11-15T09:21:36.065Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml | ||
2021-11-15T09:21:36.069883883Z level=info ts=2021-11-15T09:21:36.065Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml | ||
2021-11-15T09:21:39.697216107Z level=info ts=2021-11-15T09:21:39.697Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.009414061s | ||
2021-11-15T09:21:44.715090814Z level=warn ts=2021-11-15T09:21:44.715Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-main-2.alertmanager-operated:9094 err="1 error occurred:\n\t* Failed to resolve alertmanager-main-2.alertmanager-operated:9094: lookup alertmanager-main-2.alertmanager-operated on 172.30.0.10:53: no such host\n\n" | ||
2021-11-15T09:21:59.695862111Z level=warn ts=2021-11-15T09:21:59.695Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-main-2.alertmanager-operated:9094 err="1 error occurred:\n\t* Failed to resolve alertmanager-main-2.alertmanager-operated:9094: lookup alertmanager-main-2.alertmanager-operated on 172.30.0.10:53: no such host\n\n" | ||
2021-11-16T08:34:38.427677160Z level=info ts=2021-11-16T08:34:38.423Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml | ||
2021-11-16T08:34:38.427677160Z level=info ts=2021-11-16T08:34:38.423Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml | ||
2021-11-16T08:36:49.115902451Z level=warn ts=2021-11-16T08:36:49.115Z caller=notify.go:674 component=dispatcher receiver=Default integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: connect: connection timed out" | ||
2021-11-16T08:36:49.116291870Z level=warn ts=2021-11-16T08:36:49.115Z caller=notify.go:674 component=dispatcher receiver=Default integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: connect: connection timed out" | ||
2021-11-16T08:39:38.427419866Z level=error ts=2021-11-16T08:39:38.427Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Default/webhook[0]: notify retry canceled after 4 attempts: Post \"https://this-endpoint.does/not-exist\": context deadline exceeded" | ||
2021-11-16T08:39:38.428637761Z level=error ts=2021-11-16T08:39:38.428Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Default/webhook[0]: notify retry canceled after 4 attempts: Post \"https://this-endpoint.does/not-exist\": context deadline exceeded" | ||
2021-11-16T08:41:48.124399641Z level=warn ts=2021-11-16T08:41:48.123Z caller=notify.go:674 component=dispatcher receiver=Default integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: connect: connection timed out" | ||
2021-11-16T08:44:38.429245827Z level=error ts=2021-11-16T08:44:38.428Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Default/webhook[0]: notify retry canceled after 4 attempts: Post \"https://this-endpoint.does/not-exist\": context deadline exceeded" | ||
2021-11-16T08:44:44.252018413Z level=warn ts=2021-11-16T08:44:44.251Z caller=notify.go:674 component=dispatcher receiver=Default integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: connect: connection timed out" | ||
2021-11-16T08:44:44.252018413Z level=warn ts=2021-11-16T08:44:44.251Z caller=notify.go:674 component=dispatcher receiver=Critical integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: connect: connection timed out" | ||
2021-11-16T08:46:47.131868802Z level=warn ts=2021-11-16T08:46:47.131Z caller=notify.go:674 component=dispatcher receiver=Default integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: connect: connection timed out" | ||
2021-11-16T08:47:34.490605785Z level=warn ts=2021-11-16T08:47:34.490Z caller=notify.go:674 component=dispatcher receiver=Default integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=3 err="Post \"https://this-endpoint.does/not-exist\": dial tcp 200.160.2.95:443: i/o timeout" | ||
2021-11-16T08:47:34.490869829Z level=error ts=2021-11-16T08:47:34.490Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Default/webhook[0]: notify retry canceled after 5 attempts: Post \"https://this-endpoint.does/not-exist\": context deadline exceeded" | ||
2021-11-16T08:47:34.495474536Z level=error ts=2021-11-16T08:47:34.495Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="Critical/webhook[0]: notify retry canceled after 4 attempts: Post \"https://this-endpoint.does/not-exist\": context deadline exceeded" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
package conditional | ||
|
||
import ( | ||
"fmt" | ||
|
||
"k8s.io/klog/v2" | ||
) | ||
|
||
func getAlertPodName(labels AlertLabels) (string, error) { | ||
name, ok := labels["pod"] | ||
if !ok { | ||
newErr := fmt.Errorf("alert is missing 'pod' label") | ||
klog.Warningln(newErr.Error()) | ||
return "", newErr | ||
} | ||
return name, nil | ||
} | ||
|
||
func getAlertPodNamespace(labels AlertLabels) (string, error) { | ||
namespace, ok := labels["namespace"] | ||
if !ok { | ||
newErr := fmt.Errorf("alert is missing 'namespace' label") | ||
klog.Warningln(newErr.Error()) | ||
return "", newErr | ||
} | ||
return namespace, nil | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
package conditional | ||
|
||
import "testing" | ||
|
||
// nolint:dupl | ||
func Test_getAlertPodName(t *testing.T) { | ||
tests := []struct { | ||
name string | ||
labels AlertLabels | ||
want string | ||
wantErr bool | ||
}{ | ||
{ | ||
name: "Pod name exists", | ||
labels: AlertLabels{"pod": "test-name"}, | ||
want: "test-name", | ||
wantErr: false, | ||
}, | ||
{ | ||
name: "Pod name does not exists", | ||
labels: AlertLabels{}, | ||
want: "", | ||
wantErr: true, | ||
}, | ||
} | ||
for _, tt := range tests { | ||
t.Run(tt.name, func(t *testing.T) { | ||
got, err := getAlertPodName(tt.labels) | ||
if (err != nil) != tt.wantErr { | ||
t.Errorf("getAlertPodName() error = %v, wantErr %v", err, tt.wantErr) | ||
return | ||
} | ||
if got != tt.want { | ||
t.Errorf("getAlertPodName() got = %v, want %v", got, tt.want) | ||
} | ||
}) | ||
} | ||
} | ||
|
||
// nolint:dupl | ||
func Test_getAlertPodNamespace(t *testing.T) { | ||
tests := []struct { | ||
name string | ||
labels AlertLabels | ||
want string | ||
wantErr bool | ||
}{ | ||
{ | ||
name: "Pod namemespace exists", | ||
labels: AlertLabels{"namespace": "test-namespace"}, | ||
want: "test-namespace", | ||
wantErr: false, | ||
}, | ||
{ | ||
name: "Pod namespace does not exists", | ||
labels: AlertLabels{}, | ||
want: "", | ||
wantErr: true, | ||
}, | ||
} | ||
for _, tt := range tests { | ||
t.Run(tt.name, func(t *testing.T) { | ||
got, err := getAlertPodNamespace(tt.labels) | ||
if (err != nil) != tt.wantErr { | ||
t.Errorf("getAlertPodNamespace() error = %v, wantErr %v", err, tt.wantErr) | ||
return | ||
} | ||
if got != tt.want { | ||
t.Errorf("getAlertPodNamespace() got = %v, want %v", got, tt.want) | ||
} | ||
}) | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I added these comments to help to identify the conditions for each gather. It was a little bit hard for me to read it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense and I agree that these lists tend to be difficult to read, especially as they grow longer.