Add action to decommission legacy monitoring cluster alerts #64373

jbaiera · 2020-10-29T18:09:29Z

This follows on from #62668 by adding an action that will proactively remove any watches that monitoring has configured. The action toggles on the new setting that informs the cluster to tear down any previously created cluster alerts, and after that is accepted, the action immediately attempts a best-effort refresh of cluster alert resources in order to force their removal in case collection is disabled or delayed.

Since resources are controlled lazily by the existing monitoring exporters, extra care was taken to ensure that any in-flight resource management operations do not race against any resource actions taken by the migration action. Resource installation code was updated with callbacks to report any errors instead of just logging them.

…chronous tasks. HTTP Exporter still WIP.

…ilure to publish resources.

…porters. Specifically refresh alerts as part of the migration instead of re-running resource installation. We don't want to re-publish old templates if all the old monitoring resources have already been removed.

elasticmachine · 2020-10-29T18:09:32Z

Pinging @elastic/es-core-features (:Core/Features/Monitoring)

jakelandis

Getting a bit lost in the introduction of more state via ExporterResourceStatus .

It seems to add quite a bit of complexity having multiple possible deploy states. Under what circumstances wouldn't your code to do the main work on the success of setting MIGRATION_DECOMMISSION_ALERTS be sufficient to avoid race conditions ? Is there a reasonable trade off here, such that we could simplify the code to only check a single state and error early via the migrate REST endpoint if not in that state (or obtain that state in a reasonable amount of time) ?

EDIT: for example, would it be possible to try to aquire (with a timeout) the semaphore from MonitoringExecution effectively pausing monitoring while we do the migration avoiding the race conditions. Since this is a one time call...I think that would be an OK tradeoff. Monitoring is generally no faster then every 10s, and in a healthy system executes pretty quickly.

EDIT2: ^^ that won't actually work exactly as describe due to the "router" also using the exporters...but the idea is the same...is there a trade off where we can make the use of the contended resources mutually exclusive to avoid complexity of managing so much state ?

...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java

…l after migration

jbaiera · 2020-11-11T19:16:26Z

I've gone through and made some changes. There is now a migration coordinator object that is acquired when a migration action begins. Every exporter checks this coordinator before running its installation steps. If the coordinator is acquired already, the exporter blocks as if its resources are not ready yet. Once the migration operations complete, the migration action releases the coordinator, and exporters are able to perform their installation tasks as normal.

Since we no longer have to worry about exporters running their installation tasks DURING a migration, I've removed the retry code and IN_PROGRESS states. If an exporter is ever in progress during a migration (should be impossible with the coordinator in place), I have it fail the migration for sanity's sake.

jbaiera · 2020-11-12T15:53:18Z

@elasticmachine update branch

jakelandis

a few comments

...ring/src/main/java/org/elasticsearch/xpack/monitoring/exporter/http/VersionHttpResource.java

.../monitoring/src/main/java/org/elasticsearch/xpack/monitoring/exporter/http/HttpExporter.java

...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java

...onitoring/src/main/java/org/elasticsearch/xpack/monitoring/exporter/local/LocalExporter.java

...ck/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/exporter/Exporter.java

...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java

...onitoring/src/main/java/org/elasticsearch/xpack/monitoring/exporter/local/LocalExporter.java

...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java

sgrodzicki · 2020-11-30T11:25:03Z

@jbaiera what would be the ETA for this? We want to assess whether we will be able to make changes in Kibana for 7.11.

jbaiera · 2020-12-07T17:31:21Z

@sgrodzicki Sorry for the late response, been on vacation for a bit. This should be good to go once it passes review, which I hope shouldn't be much longer.

jbaiera · 2020-12-07T18:54:13Z

@elasticmachine update branch

jbaiera · 2020-12-10T19:07:54Z

@elasticmachine update branch

jbaiera · 2020-12-10T20:10:12Z

@elasticmachine run elasticsearch-ci/2

jbaiera · 2020-12-10T21:49:02Z

@elasticmachine run elasticsearch-ci/bwc

jakelandis · 2020-12-13T20:27:32Z

@elasticmachine update branch

...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java

...rc/main/java/org/elasticsearch/xpack/monitoring/exporter/MonitoringMigrationCoordinator.java

x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/Monitoring.java

jakelandis

LGTM (a couple nitpicks around naming), thanks for the iterations.

* elastic/master: (33 commits) Add searchable snapshot cache folder to NodeEnvironment (elastic#66297) [DOCS] Add dynamic runtime fields to docs (elastic#66194) Add HDFS searchable snapshot integration (elastic#66185) Support canceling cross-clusters search requests (elastic#66206) Mute testCacheSurviveRestart (elastic#66289) Fix cat tasks api params in spec and handler (elastic#66272) Snapshot of a searchable snapshot should be empty (elastic#66162) [ML] DFA _explain API should not fail when none field is included (elastic#66281) Add action to decommission legacy monitoring cluster alerts (elastic#64373) move rollup_index param out of RollupActionConfig (elastic#66139) Improve FieldFetcher retrieval of fields (elastic#66160) Remove unsed fields in `RestAnalyzeAction` (elastic#66215) Simplify searchable snapshot CacheKey (elastic#66263) Autoscaling remove feature flags (elastic#65973) Improve searchable snapshot mount time (elastic#66198) [ML] Report cause when datafeed extraction encounters error (elastic#66167) Remove suggest reference in some API specs (elastic#66180) Fix warning when installing a plugin for different ESversion (elastic#66146) [ML] make `xpack.ml.max_ml_node_size` and `xpack.ml.use_auto_machine_memory_percent` dynamically settable (elastic#66132) [DOCS] Add `require_alias` to Bulk API (elastic#66259) ...

…4373) (#66309) Adds an action that will proactively remove any watches that monitoring has configured. The action toggles on a new setting that informs the cluster to tear down any previously created cluster alerts, and after that is accepted, the action immediately attempts a best-effort refresh of cluster alert resources in order to force their removal in case collection is disabled or delayed. Since resources are controlled lazily by the existing monitoring exporters, extra care was taken to ensure that any in-flight resource management operations do not race against any resource actions taken by the migration action. Resource installation code was updated with callbacks to report any errors instead of just logging them.

jbaiera added 8 commits October 29, 2020 13:59

Basic action definitions made

6d15cb2

Add ability to specify a consumer for when setup is completed by asyn…

fd7bc4a

…chronous tasks. HTTP Exporter still WIP.

Refactor HttpResource to return more information on its success or fa…

76e4303

…ilure to publish resources.

satisfy precommit

b0c5638

Fix duplicate setting

63704e5

Add preliminary rest handler

a9e7413

Some serious test cleanup, including a fix for a testing bug

dfa0877

jbaiera added >enhancement :Data Management/Monitoring labels Oct 29, 2020

jbaiera requested review from jakelandis and probakowski October 29, 2020 18:09

elasticmachine added the Team:Data Management Meta label for data/management team label Oct 29, 2020

The infinite struggle of forgetting to run precommit

d9c6f00

jbaiera mentioned this pull request Oct 29, 2020

Provide migration path for monitoring alerting from Watcher to Kibana alerting #50032

Closed

The struggle continues

6c3355b

jakelandis reviewed Nov 2, 2020

View reviewed changes

...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java Outdated Show resolved Hide resolved

Use a semaphore to lock the exporters out of running their setup unti…

7881f15

…l after migration

Merge branch 'master' into monitoring-decommission-watch-action

25633f1

jakelandis reviewed Nov 17, 2020

View reviewed changes

...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java Outdated Show resolved Hide resolved

PR feedback

0996f6e

precommit

8b52a5b

Merge branch 'master' into monitoring-decommission-watch-action

16065ed

jbaiera added 2 commits December 8, 2020 16:00

Wrap mocked listeners to fix calls to ActionListener.map

1642f52

overzealous with one of my reverts

e911d7a

Merge branch 'master' into monitoring-decommission-watch-action

9942a25

Add alert migration API to non operator actions

6a3e472

Merge branch 'master' into monitoring-decommission-watch-action

0622599

jakelandis reviewed Dec 13, 2020

View reviewed changes

...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java Outdated Show resolved Hide resolved

jakelandis reviewed Dec 13, 2020

View reviewed changes

...rc/main/java/org/elasticsearch/xpack/monitoring/exporter/MonitoringMigrationCoordinator.java Outdated Show resolved Hide resolved

jakelandis reviewed Dec 13, 2020

View reviewed changes

x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/Monitoring.java Show resolved Hide resolved

jakelandis approved these changes Dec 14, 2020

View reviewed changes

PR comments

ae5be4c

jbaiera merged commit abda989 into elastic:master Dec 14, 2020

jbaiera deleted the monitoring-decommission-watch-action branch December 14, 2020 18:37

jbaiera added the backport pending label Dec 14, 2020

jbaiera mentioned this pull request Dec 14, 2020

[7.x] Add action to decommission legacy monitoring cluster alerts (#64373) #66309

Merged

jbaiera added v7.11.0 v8.0.0 and removed backport pending labels Dec 15, 2020

ravikesarwani mentioned this pull request Jun 10, 2021

[Stack Monitoring] Change out of the box alerts to be opt-in, rather than auto-created elastic/kibana#100133

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

jbaiera mentioned this pull request Sep 7, 2021

Stack monitoring changes for Elasticsearch #50770

Closed

14 tasks

Add action to decommission legacy monitoring cluster alerts #64373

Add action to decommission legacy monitoring cluster alerts #64373

Uh oh!

Conversation

jbaiera commented Oct 29, 2020

Uh oh!

elasticmachine commented Oct 29, 2020

Uh oh!

jakelandis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jbaiera commented Nov 11, 2020

Uh oh!

jbaiera commented Nov 12, 2020

Uh oh!

jakelandis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sgrodzicki commented Nov 30, 2020

Uh oh!

jbaiera commented Dec 7, 2020

Uh oh!

jbaiera commented Dec 7, 2020

Uh oh!

jbaiera commented Dec 10, 2020

Uh oh!

jbaiera commented Dec 10, 2020

Uh oh!

jbaiera commented Dec 10, 2020

Uh oh!

jakelandis commented Dec 13, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jakelandis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jakelandis left a comment •

edited

Loading