-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add action to decommission legacy monitoring cluster alerts #64373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add action to decommission legacy monitoring cluster alerts #64373
Conversation
…chronous tasks. HTTP Exporter still WIP.
…ilure to publish resources.
…porters. Specifically refresh alerts as part of the migration instead of re-running resource installation. We don't want to re-publish old templates if all the old monitoring resources have already been removed.
Pinging @elastic/es-core-features (:Core/Features/Monitoring) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting a bit lost in the introduction of more state via ExporterResourceStatus
.
It seems to add quite a bit of complexity having multiple possible deploy states. Under what circumstances wouldn't your code to do the main work on the success of setting MIGRATION_DECOMMISSION_ALERTS be sufficient to avoid race conditions ? Is there a reasonable trade off here, such that we could simplify the code to only check a single state and error early via the migrate REST endpoint if not in that state (or obtain that state in a reasonable amount of time) ?
EDIT: for example, would it be possible to try to aquire (with a timeout) the semaphore from MonitoringExecution
effectively pausing monitoring while we do the migration avoiding the race conditions. Since this is a one time call...I think that would be an OK tradeoff. Monitoring is generally no faster then every 10s, and in a healthy system executes pretty quickly.
EDIT2: ^^ that won't actually work exactly as describe due to the "router" also using the exporters...but the idea is the same...is there a trade off where we can make the use of the contended resources mutually exclusive to avoid complexity of managing so much state ?
...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java
Outdated
Show resolved
Hide resolved
…l after migration
I've gone through and made some changes. There is now a migration coordinator object that is acquired when a migration action begins. Every exporter checks this coordinator before running its installation steps. If the coordinator is acquired already, the exporter blocks as if its resources are not ready yet. Once the migration operations complete, the migration action releases the coordinator, and exporters are able to perform their installation tasks as normal. Since we no longer have to worry about exporters running their installation tasks DURING a migration, I've removed the retry code and |
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few comments
...ring/src/main/java/org/elasticsearch/xpack/monitoring/exporter/http/VersionHttpResource.java
Show resolved
Hide resolved
.../monitoring/src/main/java/org/elasticsearch/xpack/monitoring/exporter/http/HttpExporter.java
Show resolved
Hide resolved
...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java
Outdated
Show resolved
Hide resolved
...onitoring/src/main/java/org/elasticsearch/xpack/monitoring/exporter/local/LocalExporter.java
Outdated
Show resolved
Hide resolved
...ck/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/exporter/Exporter.java
Outdated
Show resolved
Hide resolved
...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java
Outdated
Show resolved
Hide resolved
...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java
Outdated
Show resolved
Hide resolved
...onitoring/src/main/java/org/elasticsearch/xpack/monitoring/exporter/local/LocalExporter.java
Outdated
Show resolved
Hide resolved
...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java
Outdated
Show resolved
Hide resolved
@jbaiera what would be the ETA for this? We want to assess whether we will be able to make changes in Kibana for 7.11. |
@sgrodzicki Sorry for the late response, been on vacation for a bit. This should be good to go once it passes review, which I hope shouldn't be much longer. |
@elasticmachine update branch |
@elasticmachine update branch |
@elasticmachine run elasticsearch-ci/2 |
@elasticmachine run elasticsearch-ci/bwc |
@elasticmachine update branch |
...n/java/org/elasticsearch/xpack/monitoring/action/TransportMonitoringMigrateAlertsAction.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/xpack/monitoring/exporter/MonitoringMigrationCoordinator.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/Monitoring.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (a couple nitpicks around naming), thanks for the iterations.
* elastic/master: (33 commits) Add searchable snapshot cache folder to NodeEnvironment (elastic#66297) [DOCS] Add dynamic runtime fields to docs (elastic#66194) Add HDFS searchable snapshot integration (elastic#66185) Support canceling cross-clusters search requests (elastic#66206) Mute testCacheSurviveRestart (elastic#66289) Fix cat tasks api params in spec and handler (elastic#66272) Snapshot of a searchable snapshot should be empty (elastic#66162) [ML] DFA _explain API should not fail when none field is included (elastic#66281) Add action to decommission legacy monitoring cluster alerts (elastic#64373) move rollup_index param out of RollupActionConfig (elastic#66139) Improve FieldFetcher retrieval of fields (elastic#66160) Remove unsed fields in `RestAnalyzeAction` (elastic#66215) Simplify searchable snapshot CacheKey (elastic#66263) Autoscaling remove feature flags (elastic#65973) Improve searchable snapshot mount time (elastic#66198) [ML] Report cause when datafeed extraction encounters error (elastic#66167) Remove suggest reference in some API specs (elastic#66180) Fix warning when installing a plugin for different ESversion (elastic#66146) [ML] make `xpack.ml.max_ml_node_size` and `xpack.ml.use_auto_machine_memory_percent` dynamically settable (elastic#66132) [DOCS] Add `require_alias` to Bulk API (elastic#66259) ...
…4373) (#66309) Adds an action that will proactively remove any watches that monitoring has configured. The action toggles on a new setting that informs the cluster to tear down any previously created cluster alerts, and after that is accepted, the action immediately attempts a best-effort refresh of cluster alert resources in order to force their removal in case collection is disabled or delayed. Since resources are controlled lazily by the existing monitoring exporters, extra care was taken to ensure that any in-flight resource management operations do not race against any resource actions taken by the migration action. Resource installation code was updated with callbacks to report any errors instead of just logging them.
This follows on from #62668 by adding an action that will proactively remove any watches that monitoring has configured. The action toggles on the new setting that informs the cluster to tear down any previously created cluster alerts, and after that is accepted, the action immediately attempts a best-effort refresh of cluster alert resources in order to force their removal in case collection is disabled or delayed.
Since resources are controlled lazily by the existing monitoring exporters, extra care was taken to ensure that any in-flight resource management operations do not race against any resource actions taken by the migration action. Resource installation code was updated with callbacks to report any errors instead of just logging them.