Skip to content

Commit 43dc72f

Browse files
authored
Fix cluster alert for watcher/monitoring IndexOutOfBoundsExcep… (#47756)
If a cluster sending monitoring data is unhealthy and triggers an alert, then stops sending data the following exception [1] can occur. This exception stops the current Watch and the behavior is actually correct in part due to the exception. Simply fixing the exception introduces some incorrect behavior. Now that the Watch does not error in the this case, it will result in an incorrectly "resolved" alert. The fix here is two parts a) fix the exception b) fix the following incorrect behavior. a) fixing the exception is as easy as checking the size of the array before accessing it. b) fixing the following incorrect behavior is a bit more intrusive - Note - the UI depends on the success/met state for each condition to determine an "OK" or "FIRING" In this scenario, where an unhealthy cluster triggers an alert and then goes silent, it should keep "FIRING" until it hears back that the cluster is green. To keep the Watch "FIRING" either the index action or the email action needs to fire. Since the Watch is neither a "new" alert or a "resolved" alert, we do not want to keep sending an email (that would be non-passive too). Without completely changing the logic of how an alert is resolved allowing the index action to take place would result in the alert being resolved. Since we can not keep "FIRING" either the email or index action (since we don't want to resolve the alert nor re-write the logic for alert resolution), we will introduce a 3rd action. A logging action that WILL fire when the cluster is unhealthy. Specifically will fire when there is an unresolved alert and it can not find the cluster state. This logging action is logged at debug, so it should be noticed much. This logging action serves as an 'anchor' for the UI to keep the state in an a "FIRING" status until the alert is resolved. This presents a possible scenario where a cluster starts firing, then goes completely silent forever, the Watch will be "FIRING" forever. This is an edge case that already exists in some scenarios and requires manual intervention to remove that Watch. This changes changes to use a template-like method to populate the version_created for the default monitoring watches. The version is set to 7.5 since that is where this is first introduced. Fixes #43184
1 parent 2abd9d5 commit 43dc72f

File tree

8 files changed

+31
-9
lines changed

8 files changed

+31
-9
lines changed

x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/exporter/ClusterAlertsUtil.java

+10-1
Original file line numberDiff line numberDiff line change
@@ -49,11 +49,19 @@ public class ClusterAlertsUtil {
4949
private static final Pattern UNIQUE_WATCH_ID_PROPERTY =
5050
Pattern.compile(Pattern.quote("${monitoring.watch.unique_id}"));
5151

52+
/**
53+
* Replace the <code>${monitoring.watch.unique_id}</code> field in the watches.
54+
*
55+
* @see #createUniqueWatchId(ClusterService, String)
56+
*/
57+
private static final Pattern VERSION_CREATED_PROPERTY =
58+
Pattern.compile(Pattern.quote("${monitoring.version_created}"));
59+
5260
/**
5361
* The last time that all watches were updated. For now, all watches have been updated in the same version and should all be replaced
5462
* together.
5563
*/
56-
public static final int LAST_UPDATED_VERSION = Version.V_7_0_0.id;
64+
public static final int LAST_UPDATED_VERSION = Version.V_7_5_0.id;
5765

5866
/**
5967
* An unsorted list of Watch IDs representing resource files for Monitoring Cluster Alerts.
@@ -113,6 +121,7 @@ public static String loadWatch(final ClusterService clusterService, final String
113121
source = CLUSTER_UUID_PROPERTY.matcher(source).replaceAll(clusterUuid);
114122
source = WATCH_ID_PROPERTY.matcher(source).replaceAll(watchId);
115123
source = UNIQUE_WATCH_ID_PROPERTY.matcher(source).replaceAll(uniqueWatchId);
124+
source = VERSION_CREATED_PROPERTY.matcher(source).replaceAll(Integer.toString(LAST_UPDATED_VERSION));
116125

117126
return source;
118127
} catch (final IOException e) {

x-pack/plugin/monitoring/src/main/resources/monitoring/watches/elasticsearch_cluster_status.json

+15-3
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
"link": "elasticsearch/indices",
88
"severity": 2100,
99
"type": "monitoring",
10-
"version_created": 7000099,
10+
"version_created": "${monitoring.version_created}",
1111
"watch": "${monitoring.watch.id}"
1212
}
1313
},
@@ -134,19 +134,31 @@
134134
},
135135
"transform": {
136136
"script": {
137-
"source": "ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ctx.vars.is_new = ctx.vars.fails_check && !ctx.vars.not_resolved;ctx.vars.is_resolved = !ctx.vars.fails_check && ctx.vars.not_resolved;def state = ctx.payload.check.hits.hits[0]._source.cluster_state.status;if (ctx.vars.not_resolved){ctx.payload = ctx.payload.alert.hits.hits[0]._source;if (ctx.vars.fails_check == false) {ctx.payload.resolved_timestamp = ctx.execution_time;}} else {ctx.payload = ['timestamp': ctx.execution_time, 'metadata': ctx.metadata.xpack];}if (ctx.vars.fails_check) {ctx.payload.prefix = 'Elasticsearch cluster status is ' + state + '.';if (state == 'red') {ctx.payload.message = 'Allocate missing primary shards and replica shards.';ctx.payload.metadata.severity = 2100;} else {ctx.payload.message = 'Allocate missing replica shards.';ctx.payload.metadata.severity = 1100;}}ctx.vars.state = state.toUpperCase();ctx.payload.update_timestamp = ctx.execution_time;return ctx.payload;"
137+
"source": "ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0 && ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack != null) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ctx.vars.is_new = ctx.vars.fails_check && !ctx.vars.not_resolved;ctx.vars.is_resolved = !ctx.vars.fails_check && ctx.vars.not_resolved;ctx.vars.found_state = ctx.payload.check.hits.total != 0;def state = ctx.vars.found_state ? ctx.payload.check.hits.hits[0]._source.cluster_state.status : 'unknown';if (ctx.vars.not_resolved){ctx.payload = ctx.payload.alert.hits.hits[0]._source;if (ctx.vars.fails_check == false) {ctx.payload.resolved_timestamp = ctx.execution_time;}} else {ctx.payload = ['timestamp': ctx.execution_time, 'metadata': ctx.metadata.xpack];}if (ctx.vars.fails_check) {ctx.payload.prefix = 'Elasticsearch cluster status is ' + state + '.';if (state == 'red') {ctx.payload.message = 'Allocate missing primary shards and replica shards.';ctx.payload.metadata.severity = 2100;} else {ctx.payload.message = 'Allocate missing replica shards.';ctx.payload.metadata.severity = 1100;}}ctx.vars.state = state.toUpperCase();ctx.payload.update_timestamp = ctx.execution_time;return ctx.payload;"
138138
}
139139
},
140140
"actions": {
141+
"log_state_not_found": {
142+
"condition": {
143+
"script": "!ctx.vars.found_state"
144+
},
145+
"logging" : {
146+
"text" : "Watch [{{ctx.metadata.xpack.watch}}] could not determine cluster state for cluster [{{ctx.metadata.xpack.cluster_uuid}}]. This likely means the cluster has not sent any monitoring data recently.",
147+
"level" : "debug"
148+
}
149+
},
141150
"add_to_alerts_index": {
151+
"condition": {
152+
"script": "ctx.vars.found_state"
153+
},
142154
"index": {
143155
"index": ".monitoring-alerts-7",
144156
"doc_id": "${monitoring.watch.unique_id}"
145157
}
146158
},
147159
"send_email_to_admin": {
148160
"condition": {
149-
"script": "return ctx.vars.email_recipient != null && (ctx.vars.is_new || ctx.vars.is_resolved)"
161+
"script": "return ctx.vars.email_recipient != null && ctx.vars.found_state && (ctx.vars.is_new || ctx.vars.is_resolved)"
150162
},
151163
"email": {
152164
"to": "X-Pack Admin <{{ctx.vars.email_recipient}}>",

x-pack/plugin/monitoring/src/main/resources/monitoring/watches/elasticsearch_nodes.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
"link": "elasticsearch/nodes",
88
"severity": 1999,
99
"type": "monitoring",
10-
"version_created": 7000099,
10+
"version_created": "${monitoring.version_created}",
1111
"watch": "${monitoring.watch.id}"
1212
}
1313
},

x-pack/plugin/monitoring/src/main/resources/monitoring/watches/elasticsearch_version_mismatch.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
"link": "elasticsearch/nodes",
88
"severity": 1000,
99
"type": "monitoring",
10-
"version_created": 7000099,
10+
"version_created": "${monitoring.version_created}",
1111
"watch": "${monitoring.watch.id}"
1212
}
1313
},

x-pack/plugin/monitoring/src/main/resources/monitoring/watches/kibana_version_mismatch.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
"link": "kibana/instances",
88
"severity": 1000,
99
"type": "monitoring",
10-
"version_created": 7000099,
10+
"version_created": "${monitoring.version_created}",
1111
"watch": "${monitoring.watch.id}"
1212
}
1313
},

x-pack/plugin/monitoring/src/main/resources/monitoring/watches/logstash_version_mismatch.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
"link": "logstash/instances",
88
"severity": 1000,
99
"type": "monitoring",
10-
"version_created": 7000099,
10+
"version_created": "${monitoring.version_created}",
1111
"watch": "${monitoring.watch.id}"
1212
}
1313
},

x-pack/plugin/monitoring/src/main/resources/monitoring/watches/xpack_license_expiration.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
"alert_index": ".monitoring-alerts-7",
99
"cluster_uuid": "${monitoring.watch.cluster_uuid}",
1010
"type": "monitoring",
11-
"version_created": 7000099,
11+
"version_created": "${monitoring.version_created}",
1212
"watch": "${monitoring.watch.id}"
1313
}
1414
},

x-pack/plugin/monitoring/src/test/java/org/elasticsearch/xpack/monitoring/exporter/ClusterAlertsUtilTests.java

+1
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ public void testLoadWatch() {
6868
assertThat(watch, notNullValue());
6969
assertThat(watch, containsString(clusterUuid));
7070
assertThat(watch, containsString(watchId));
71+
assertThat(watch, containsString(String.valueOf(ClusterAlertsUtil.LAST_UPDATED_VERSION)));
7172

7273
if ("elasticsearch_nodes".equals(watchId) == false) {
7374
assertThat(watch, containsString(clusterUuid + "_" + watchId));

0 commit comments

Comments
 (0)