SLM policy unexpectedly executed twice #63754

alisonelizabeth · 2020-10-15T15:28:41Z

Elasticsearch version (bin/elasticsearch --version): 7.9

Description of the problem including expected versus actual behavior: A SLM policy is getting triggered twice at the same time daily.

Policy definition:

    "policy" : {
      "name" : "<snapshotrepo01-{now/d}>",
      "schedule" : "0 30 1 * * ?",
      "repository" : "Repo_01",
      "config" : {
        "indices" : [ ]
      },
      "retention" : {
        "expire_after" : "20d",
        "max_count" : 20
      }
    },

Provide logs (if relevant):

[2020-10-09T19:29:59,873][DEBUG][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot lifecycle policy task triggered from job [Daily-Snapshots-2]
[2020-10-09T19:29:59,873][INFO ][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot lifecycle policy [Daily-Snapshots] issuing create snapshot [snapshotrepo01-daily-2020.10.10-gcgzdxlaq4iwjcyygzniag]
[2020-10-09T19:29:59,905][INFO ][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot lifecycle policy job [Daily-Snapshots-2] issued new snapshot creation for [snapshotrepo01-daily-2020.10.10-gcgzdxlaq4iwjcyygzniag] successfully
[2020-10-09T19:29:59,971][INFO ][o.e.s.SnapshotsService   ] [VST3ESATESTT01] snapshot [Repo_01:snapshotrepo01-daily-2020.10.10-gcgzdxlaq4iwjcyygzniag/sOOFeP8vToeujQIAiSPBWQ] started
[2020-10-09T19:30:00,049][DEBUG][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot lifecycle policy task triggered from job [Daily-Snapshots-2]
[2020-10-09T19:30:00,049][INFO ][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot lifecycle policy [Daily-Snapshots] issuing create snapshot [snapshotrepo01-daily-2020.10.10-zjit3qp9ro6lnjas-zv-da]
[2020-10-09T19:30:00,078][INFO ][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot lifecycle policy job [Daily-Snapshots-2] issued new snapshot creation for [snapshotrepo01-daily-2020.10.10-zjit3qp9ro6lnjas-zv-da] successfully
[2020-10-09T19:30:00,203][INFO ][o.e.s.SnapshotsService   ] [VST3ESATESTT01] snapshot [Repo_01:snapshotrepo01-daily-2020.10.10-zjit3qp9ro6lnjas-zv-da/k0vrURFyR5G3Vr8PxvF6xA] started
[2020-10-09T19:31:00,774][DEBUG][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot response for [Daily-Snapshots]: {"snapshot":{"snapshot":"snapshotrepo01-daily-2020.10.10-gcgzdxlaq4iwjcyygzniag","uuid":"sOOFeP8vToeujQIAiSPBWQ","version_id":7090299,"version":"7.9.2","indices":[".watcher-history-11-2020.10.07","metricbeat-7.5.2-2020.06.12-000006",".slm-history-1-000007",".monitoring-kibana-7-2020.10.04","kibana_sample_data_flights_restored","metricbeat-7.5.1-2020.04.06-000003",".kibana-event-log-7.9.2-000001",".apm-agent-configuration","winlogbeat-7.5.1-2020.10.10-002717","metricbeat-7.5.2-2020.09.10-000009",".slm-history-1-000005","metricbeat-7.4.2-2020.01.26-000002",".kibana_task_manager_2",".slm-history-1-000008",".security-7","metricbeat-7.5.1-2020.05.06-000004","filebeat-7.5.2-2020.07.05-000006",".kibana_3","metricbeat-7.5.2-2020.07.12-000007",".monitoring-alerts-7","ilm-history-2-000001",".monitoring-es-7-2020.10.06","metricbeat-7.4.2-2020.07.24-000008","metricbeat-7.5.1-2020.06.05-000005","metricbeat-7.5.1-2020.08.05-000007",".watcher-history-11-2020.10.05","kibana_sample_data_flights","metricbeat-7.4.2-2020.03.26-000004",".monitoring-es-7-2020.10.05",".slm-history-1-000006","ilm-history-2-000002","filebeat-7.5.2-2020.03.07-000002",".monitoring-kibana-7-2020.10.05","kibana_sample_data_flights_restored1","metricbeat-7.5.1-2020.03.07-000002",".kibana_task_manager_1",".monitoring-kibana-7-2020.10.06",".slm-history-2-000002","filebeat-7.4.2-2020.10.09-002645",".async-search","filebeat-7.9.1-2020.10.03-000002",".monitoring-es-7-2020.10.09","metricbeat-7.5.2-2020.03.14-000003",".monitoring-es-7-2020.10.04",".monitoring-kibana-7-2020.10.08","metricbeat-7.4.2-2020.04.25-000005","winlogbeat-7.5.2-2020.10.10-002724",".monitoring-es-7-2020.10.08","filebeat-7.5.2-2020.08.04-000007",".triggered_watches","filebeat-7.5.2-2020.09.03-000008",".kibana_1",".kibana-event-log-7.9.0-000002",".watcher-history-11-2020.10.08",".kibana-event-log-7.9.1-000002","filebeat-7.4.2-2020.01.14-000001","kibana_sample_data_ecommerce","winlogbeat-7.5.2-2020.10.09-002723","metricbeat-7.5.2-2020.02.13-000002",".monitoring-kibana-7-2020.10.07","metricbeat-7.5.1-2020.09.04-000008",".slm-history-2-000001",".watcher-history-11-2020.06.15",".monitoring-es-7-2020.10.07","winlogbeat-7.4.2-2020.10.10-002648","filebeat-7.5.2-2020.05.06-000004","nuix_stats","winlogbeat-7.4.2-2020.10.09-002647","metricbeat-7.4.2-2020.08.23-000009","my-index-000001",".kibana_1_restored","metricbeat-7.5.1-2020.07.06-000006","soniccopy",".kibana-event-log-7.9.0-000001","filebeat-7.5.2-2020.04.06-000003","metricbeat-7.5.1-2020.02.06-000001",".monitoring-kibana-7-2020.10.10",".watcher-history-11-2020.10.10","filebeat-7.4.2-2020.10.10-002646","winlogbeat-7.4.2-2020.01.14-000001","metricbeat-7.4.2-2020.09.22-000010",".management-beats","metricbeat-7.4.2-2020.06.24-000007",".watcher-history-11-2020.10.04","filebeat-7.9.1-2020.09.03-000001","justin_test","filebeat-7.5.2-2020.06.05-000005",".kibana-event-log-7.9.1-000001","metricbeat-7.4.2-2020.02.25-000003","metricbeat-7.5.2-2020.08.11-000008",".monitoring-kibana-7-2020.10.09",".watches",".watcher-history-11-2020.10.09","kibana_sample_data_logs","metricbeat-7.5.2-2020.04.13-000004","metricbeat-7.5.2-2020.10.10-000010",".kibana_2","metricbeat-7.4.2-2020.05.25-000006","filebeat-7.5.2-2020.10.03-000009",".apm-custom-link","metricbeat-7.5.2-2020.05.13-000005","winlogbeat-7.5.1-2020.10.09-002716","metricbeat-7.5.1-2020.10.04-000009",".watcher-history-11-2020.10.06",".monitoring-es-7-2020.10.10"],"data_streams":[],"include_global_state":true,"metadata":{"policy":"Daily-Snapshots"},"state":"SUCCESS","start_time":"2020-10-10T02:29:59.905Z","start_time_in_millis":1602296999905,"end_time":"2020-10-10T02:30:59.555Z","end_time_in_millis":1602297059555,"duration_in_millis":59650,"failures":[],"shards":{"total":105,"failed":0,"successful":105}}}
[2020-10-09T19:31:00,774][INFO ][o.e.s.SnapshotsService   ] [VST3ESATESTT01] snapshot [Repo_01:snapshotrepo01-daily-2020.10.10-gcgzdxlaq4iwjcyygzniag/sOOFeP8vToeujQIAiSPBWQ] completed with state [SUCCESS]
[2020-10-09T19:31:01,905][DEBUG][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot response for [Daily-Snapshots]: {"snapshot":{"snapshot":"snapshotrepo01-daily-2020.10.10-zjit3qp9ro6lnjas-zv-da","uuid":"k0vrURFyR5G3Vr8PxvF6xA","version_id":7090299,"version":"7.9.2","indices":[".watcher-history-11-2020.10.07","metricbeat-7.5.2-2020.06.12-000006",".slm-history-1-000007",".monitoring-kibana-7-2020.10.04","metricbeat-7.5.1-2020.04.06-000003","kibana_sample_data_flights_restored",".kibana-event-log-7.9.2-000001",".apm-agent-configuration","winlogbeat-7.5.1-2020.10.10-002717","metricbeat-7.5.2-2020.09.10-000009",".slm-history-1-000005","metricbeat-7.4.2-2020.01.26-000002",".kibana_task_manager_2",".slm-history-1-000008",".security-7","metricbeat-7.5.1-2020.05.06-000004","filebeat-7.5.2-2020.07.05-000006",".kibana_3",".monitoring-alerts-7","metricbeat-7.5.2-2020.07.12-000007","ilm-history-2-000001",".monitoring-es-7-2020.10.06","metricbeat-7.4.2-2020.07.24-000008","metricbeat-7.5.1-2020.06.05-000005","metricbeat-7.5.1-2020.08.05-000007",".watcher-history-11-2020.10.05","kibana_sample_data_flights","metricbeat-7.4.2-2020.03.26-000004",".monitoring-es-7-2020.10.05",".slm-history-1-000006","ilm-history-2-000002","filebeat-7.5.2-2020.03.07-000002",".monitoring-kibana-7-2020.10.05","kibana_sample_data_flights_restored1","metricbeat-7.5.1-2020.03.07-000002",".kibana_task_manager_1",".monitoring-kibana-7-2020.10.06",".slm-history-2-000002","filebeat-7.4.2-2020.10.09-002645",".async-search","filebeat-7.9.1-2020.10.03-000002",".monitoring-es-7-2020.10.09","metricbeat-7.5.2-2020.03.14-000003",".monitoring-es-7-2020.10.04","winlogbeat-7.5.2-2020.10.10-002724",".monitoring-kibana-7-2020.10.08","metricbeat-7.4.2-2020.04.25-000005",".monitoring-es-7-2020.10.08","filebeat-7.5.2-2020.08.04-000007",".triggered_watches","filebeat-7.5.2-2020.09.03-000008",".kibana_1",".kibana-event-log-7.9.0-000002",".watcher-history-11-2020.10.08","filebeat-7.4.2-2020.01.14-000001",".kibana-event-log-7.9.1-000002","kibana_sample_data_ecommerce","winlogbeat-7.5.2-2020.10.09-002723","metricbeat-7.5.2-2020.02.13-000002",".monitoring-kibana-7-2020.10.07",".slm-history-2-000001","metricbeat-7.5.1-2020.09.04-000008",".watcher-history-11-2020.06.15",".monitoring-es-7-2020.10.07","filebeat-7.5.2-2020.05.06-000004","winlogbeat-7.4.2-2020.10.10-002648","nuix_stats","metricbeat-7.4.2-2020.08.23-000009","winlogbeat-7.4.2-2020.10.09-002647","my-index-000001","metricbeat-7.5.1-2020.07.06-000006",".kibana_1_restored","soniccopy",".kibana-event-log-7.9.0-000001","filebeat-7.5.2-2020.04.06-000003","metricbeat-7.5.1-2020.02.06-000001",".monitoring-kibana-7-2020.10.10",".watcher-history-11-2020.10.10","filebeat-7.4.2-2020.10.10-002646","winlogbeat-7.4.2-2020.01.14-000001","metricbeat-7.4.2-2020.09.22-000010",".management-beats","metricbeat-7.4.2-2020.06.24-000007",".watcher-history-11-2020.10.04","filebeat-7.9.1-2020.09.03-000001","justin_test",".kibana-event-log-7.9.1-000001","filebeat-7.5.2-2020.06.05-000005","metricbeat-7.4.2-2020.02.25-000003","metricbeat-7.5.2-2020.08.11-000008",".monitoring-kibana-7-2020.10.09",".watches",".watcher-history-11-2020.10.09","metricbeat-7.5.2-2020.04.13-000004","kibana_sample_data_logs","metricbeat-7.5.2-2020.10.10-000010",".kibana_2","metricbeat-7.4.2-2020.05.25-000006","filebeat-7.5.2-2020.10.03-000009","metricbeat-7.5.2-2020.05.13-000005",".apm-custom-link","winlogbeat-7.5.1-2020.10.09-002716","metricbeat-7.5.1-2020.10.04-000009",".watcher-history-11-2020.10.06",".monitoring-es-7-2020.10.10"],"data_streams":[],"include_global_state":true,"metadata":{"policy":"Daily-Snapshots"},"state":"SUCCESS","start_time":"2020-10-10T02:30:00.109Z","start_time_in_millis":1602297000109,"end_time":"2020-10-10T02:31:00.586Z","end_time_in_millis":1602297060586,"duration_in_millis":60477,"failures":[],"shards":{"total":105,"failed":0,"successful":105}}}
[2020-10-09T19:31:01,905][INFO ][o.e.s.SnapshotsService   ] [VST3ESATESTT01] snapshot [Repo_01:snapshotrepo01-daily-2020.10.10-zjit3qp9ro6lnjas-zv-da/k0vrURFyR5G3Vr8PxvF6xA] completed with state [SUCCESS]

Note that job Daily-Snapshots-2 is triggered at 2020-10-09T19:29:59 and 2020-10-09T19:30:00.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-10-15T16:01:31Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

This commit ensures that jobs within the SchedulerEngine do not continue to run after they are cancelled. There was no synchronization between the cancel method of an ActiveSchedule and the run method, so an actively running schedule would go ahead and reschedule itself even if the cancel method had been called. This commit adds synchronization between cancelling and the scheduling of the next run to ensure that the job is cancelled. In real life scenarios this could manifest as a job running multiple times for SLM. This could happen if a job had been triggered and was cancelled prior to completing its run such as if the node was no longer the master node or if SLM was stopping/stopped. Closes elastic#63754

This commit ensures that jobs within the SchedulerEngine do not continue to run after they are cancelled. There was no synchronization between the cancel method of an ActiveSchedule and the run method, so an actively running schedule would go ahead and reschedule itself even if the cancel method had been called. This commit adds synchronization between cancelling and the scheduling of the next run to ensure that the job is cancelled. In real life scenarios this could manifest as a job running multiple times for SLM. This could happen if a job had been triggered and was cancelled prior to completing its run such as if the node was no longer the master node or if SLM was stopping/stopped. Closes #63754

This commit ensures that jobs within the SchedulerEngine do not continue to run after they are cancelled. There was no synchronization between the cancel method of an ActiveSchedule and the run method, so an actively running schedule would go ahead and reschedule itself even if the cancel method had been called. This commit adds synchronization between cancelling and the scheduling of the next run to ensure that the job is cancelled. In real life scenarios this could manifest as a job running multiple times for SLM. This could happen if a job had been triggered and was cancelled prior to completing its run such as if the node was no longer the master node or if SLM was stopping/stopped. Closes elastic#63754

This commit ensures that jobs within the SchedulerEngine do not continue to run after they are cancelled. There was no synchronization between the cancel method of an ActiveSchedule and the run method, so an actively running schedule would go ahead and reschedule itself even if the cancel method had been called. This commit adds synchronization between cancelling and the scheduling of the next run to ensure that the job is cancelled. In real life scenarios this could manifest as a job running multiple times for SLM. This could happen if a job had been triggered and was cancelled prior to completing its run such as if the node was no longer the master node or if SLM was stopping/stopped. Closes #63754 Backport of #63762

The SchedulerEngine used by SLM uses a custom runnable that will schedule itself for its next execution if there is one to run. For the majority of jobs, this scheduling could be many hours or days away. Due to the scheduling so far in advance, there is a chance that time drifts on the machine or even that time varies core to core so there is no guarantee that the job actually runs on or after the scheduled time. This can cause some jobs to reschedule themselves for the same scheduled time even if they ran only a millisecond prior to the scheduled time, which causes unexpected actions to be taken such as what appears as duplicated snapshots. This change resolves this by checking the triggered time against the scheduled time and using the appropriate value to ensure that we do not have unexpected job runs. Relates elastic#63754

The SchedulerEngine used by SLM uses a custom runnable that will schedule itself for its next execution if there is one to run. For the majority of jobs, this scheduling could be many hours or days away. Due to the scheduling so far in advance, there is a chance that time drifts on the machine or even that time varies core to core so there is no guarantee that the job actually runs on or after the scheduled time. This can cause some jobs to reschedule themselves for the same scheduled time even if they ran only a millisecond prior to the scheduled time, which causes unexpected actions to be taken such as what appears as duplicated snapshots. This change resolves this by checking the triggered time against the scheduled time and using the appropriate value to ensure that we do not have unexpected job runs. Relates #63754

The SchedulerEngine used by SLM uses a custom runnable that will schedule itself for its next execution if there is one to run. For the majority of jobs, this scheduling could be many hours or days away. Due to the scheduling so far in advance, there is a chance that time drifts on the machine or even that time varies core to core so there is no guarantee that the job actually runs on or after the scheduled time. This can cause some jobs to reschedule themselves for the same scheduled time even if they ran only a millisecond prior to the scheduled time, which causes unexpected actions to be taken such as what appears as duplicated snapshots. This change resolves this by checking the triggered time against the scheduled time and using the appropriate value to ensure that we do not have unexpected job runs. Relates elastic#63754 Backport of elastic#64501

The SchedulerEngine used by SLM uses a custom runnable that will schedule itself for its next execution if there is one to run. For the majority of jobs, this scheduling could be many hours or days away. Due to the scheduling so far in advance, there is a chance that time drifts on the machine or even that time varies core to core so there is no guarantee that the job actually runs on or after the scheduled time. This can cause some jobs to reschedule themselves for the same scheduled time even if they ran only a millisecond prior to the scheduled time, which causes unexpected actions to be taken such as what appears as duplicated snapshots. This change resolves this by checking the triggered time against the scheduled time and using the appropriate value to ensure that we do not have unexpected job runs. Relates #63754 Backport of #64501

alisonelizabeth added >bug needs:triage Requires assignment of a team area label labels Oct 15, 2020

original-brownbear added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Oct 15, 2020

elasticmachine added the Team:Data Management Meta label for data/management team label Oct 15, 2020

jaymode mentioned this issue Oct 15, 2020

Ensure cancelled jobs do not continue to run #63762

Merged

jaymode closed this as completed in #63762 Oct 15, 2020

jaymode mentioned this issue Oct 15, 2020

Ensure cancelled jobs do not continue to run #63770

Merged

jaymode mentioned this issue Oct 15, 2020

Ensure cancelled jobs do not continue to run #63771

Merged

jaymode mentioned this issue Nov 2, 2020

Fix job scheduling for same scheduled time #64501

Merged

jaymode mentioned this issue Nov 4, 2020

Fix job scheduling for same scheduled time #64598

Merged

jaymode mentioned this issue Nov 4, 2020

Fix job scheduling for same scheduled time #64599

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SLM policy unexpectedly executed twice #63754

SLM policy unexpectedly executed twice #63754

alisonelizabeth commented Oct 15, 2020

elasticmachine commented Oct 15, 2020

Uh oh!

SLM policy unexpectedly executed twice #63754

SLM policy unexpectedly executed twice #63754

Comments

alisonelizabeth commented Oct 15, 2020

elasticmachine commented Oct 15, 2020

Uh oh!