Skip to content

SLM policy unexpectedly executed twice #63754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alisonelizabeth opened this issue Oct 15, 2020 · 1 comment · Fixed by #63762
Closed

SLM policy unexpectedly executed twice #63754

alisonelizabeth opened this issue Oct 15, 2020 · 1 comment · Fixed by #63762
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management needs:triage Requires assignment of a team area label Team:Data Management Meta label for data/management team

Comments

@alisonelizabeth
Copy link
Contributor

Elasticsearch version (bin/elasticsearch --version): 7.9

Description of the problem including expected versus actual behavior: A SLM policy is getting triggered twice at the same time daily.

Policy definition:

    "policy" : {
      "name" : "<snapshotrepo01-{now/d}>",
      "schedule" : "0 30 1 * * ?",
      "repository" : "Repo_01",
      "config" : {
        "indices" : [ ]
      },
      "retention" : {
        "expire_after" : "20d",
        "max_count" : 20
      }
    },

Provide logs (if relevant):

[2020-10-09T19:29:59,873][DEBUG][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot lifecycle policy task triggered from job [Daily-Snapshots-2]
[2020-10-09T19:29:59,873][INFO ][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot lifecycle policy [Daily-Snapshots] issuing create snapshot [snapshotrepo01-daily-2020.10.10-gcgzdxlaq4iwjcyygzniag]
[2020-10-09T19:29:59,905][INFO ][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot lifecycle policy job [Daily-Snapshots-2] issued new snapshot creation for [snapshotrepo01-daily-2020.10.10-gcgzdxlaq4iwjcyygzniag] successfully
[2020-10-09T19:29:59,971][INFO ][o.e.s.SnapshotsService   ] [VST3ESATESTT01] snapshot [Repo_01:snapshotrepo01-daily-2020.10.10-gcgzdxlaq4iwjcyygzniag/sOOFeP8vToeujQIAiSPBWQ] started
[2020-10-09T19:30:00,049][DEBUG][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot lifecycle policy task triggered from job [Daily-Snapshots-2]
[2020-10-09T19:30:00,049][INFO ][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot lifecycle policy [Daily-Snapshots] issuing create snapshot [snapshotrepo01-daily-2020.10.10-zjit3qp9ro6lnjas-zv-da]
[2020-10-09T19:30:00,078][INFO ][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot lifecycle policy job [Daily-Snapshots-2] issued new snapshot creation for [snapshotrepo01-daily-2020.10.10-zjit3qp9ro6lnjas-zv-da] successfully
[2020-10-09T19:30:00,203][INFO ][o.e.s.SnapshotsService   ] [VST3ESATESTT01] snapshot [Repo_01:snapshotrepo01-daily-2020.10.10-zjit3qp9ro6lnjas-zv-da/k0vrURFyR5G3Vr8PxvF6xA] started
[2020-10-09T19:31:00,774][DEBUG][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot response for [Daily-Snapshots]: {"snapshot":{"snapshot":"snapshotrepo01-daily-2020.10.10-gcgzdxlaq4iwjcyygzniag","uuid":"sOOFeP8vToeujQIAiSPBWQ","version_id":7090299,"version":"7.9.2","indices":[".watcher-history-11-2020.10.07","metricbeat-7.5.2-2020.06.12-000006",".slm-history-1-000007",".monitoring-kibana-7-2020.10.04","kibana_sample_data_flights_restored","metricbeat-7.5.1-2020.04.06-000003",".kibana-event-log-7.9.2-000001",".apm-agent-configuration","winlogbeat-7.5.1-2020.10.10-002717","metricbeat-7.5.2-2020.09.10-000009",".slm-history-1-000005","metricbeat-7.4.2-2020.01.26-000002",".kibana_task_manager_2",".slm-history-1-000008",".security-7","metricbeat-7.5.1-2020.05.06-000004","filebeat-7.5.2-2020.07.05-000006",".kibana_3","metricbeat-7.5.2-2020.07.12-000007",".monitoring-alerts-7","ilm-history-2-000001",".monitoring-es-7-2020.10.06","metricbeat-7.4.2-2020.07.24-000008","metricbeat-7.5.1-2020.06.05-000005","metricbeat-7.5.1-2020.08.05-000007",".watcher-history-11-2020.10.05","kibana_sample_data_flights","metricbeat-7.4.2-2020.03.26-000004",".monitoring-es-7-2020.10.05",".slm-history-1-000006","ilm-history-2-000002","filebeat-7.5.2-2020.03.07-000002",".monitoring-kibana-7-2020.10.05","kibana_sample_data_flights_restored1","metricbeat-7.5.1-2020.03.07-000002",".kibana_task_manager_1",".monitoring-kibana-7-2020.10.06",".slm-history-2-000002","filebeat-7.4.2-2020.10.09-002645",".async-search","filebeat-7.9.1-2020.10.03-000002",".monitoring-es-7-2020.10.09","metricbeat-7.5.2-2020.03.14-000003",".monitoring-es-7-2020.10.04",".monitoring-kibana-7-2020.10.08","metricbeat-7.4.2-2020.04.25-000005","winlogbeat-7.5.2-2020.10.10-002724",".monitoring-es-7-2020.10.08","filebeat-7.5.2-2020.08.04-000007",".triggered_watches","filebeat-7.5.2-2020.09.03-000008",".kibana_1",".kibana-event-log-7.9.0-000002",".watcher-history-11-2020.10.08",".kibana-event-log-7.9.1-000002","filebeat-7.4.2-2020.01.14-000001","kibana_sample_data_ecommerce","winlogbeat-7.5.2-2020.10.09-002723","metricbeat-7.5.2-2020.02.13-000002",".monitoring-kibana-7-2020.10.07","metricbeat-7.5.1-2020.09.04-000008",".slm-history-2-000001",".watcher-history-11-2020.06.15",".monitoring-es-7-2020.10.07","winlogbeat-7.4.2-2020.10.10-002648","filebeat-7.5.2-2020.05.06-000004","nuix_stats","winlogbeat-7.4.2-2020.10.09-002647","metricbeat-7.4.2-2020.08.23-000009","my-index-000001",".kibana_1_restored","metricbeat-7.5.1-2020.07.06-000006","soniccopy",".kibana-event-log-7.9.0-000001","filebeat-7.5.2-2020.04.06-000003","metricbeat-7.5.1-2020.02.06-000001",".monitoring-kibana-7-2020.10.10",".watcher-history-11-2020.10.10","filebeat-7.4.2-2020.10.10-002646","winlogbeat-7.4.2-2020.01.14-000001","metricbeat-7.4.2-2020.09.22-000010",".management-beats","metricbeat-7.4.2-2020.06.24-000007",".watcher-history-11-2020.10.04","filebeat-7.9.1-2020.09.03-000001","justin_test","filebeat-7.5.2-2020.06.05-000005",".kibana-event-log-7.9.1-000001","metricbeat-7.4.2-2020.02.25-000003","metricbeat-7.5.2-2020.08.11-000008",".monitoring-kibana-7-2020.10.09",".watches",".watcher-history-11-2020.10.09","kibana_sample_data_logs","metricbeat-7.5.2-2020.04.13-000004","metricbeat-7.5.2-2020.10.10-000010",".kibana_2","metricbeat-7.4.2-2020.05.25-000006","filebeat-7.5.2-2020.10.03-000009",".apm-custom-link","metricbeat-7.5.2-2020.05.13-000005","winlogbeat-7.5.1-2020.10.09-002716","metricbeat-7.5.1-2020.10.04-000009",".watcher-history-11-2020.10.06",".monitoring-es-7-2020.10.10"],"data_streams":[],"include_global_state":true,"metadata":{"policy":"Daily-Snapshots"},"state":"SUCCESS","start_time":"2020-10-10T02:29:59.905Z","start_time_in_millis":1602296999905,"end_time":"2020-10-10T02:30:59.555Z","end_time_in_millis":1602297059555,"duration_in_millis":59650,"failures":[],"shards":{"total":105,"failed":0,"successful":105}}}
[2020-10-09T19:31:00,774][INFO ][o.e.s.SnapshotsService   ] [VST3ESATESTT01] snapshot [Repo_01:snapshotrepo01-daily-2020.10.10-gcgzdxlaq4iwjcyygzniag/sOOFeP8vToeujQIAiSPBWQ] completed with state [SUCCESS]
[2020-10-09T19:31:01,905][DEBUG][o.e.x.s.SnapshotLifecycleTask] [VST3ESATESTT01] snapshot response for [Daily-Snapshots]: {"snapshot":{"snapshot":"snapshotrepo01-daily-2020.10.10-zjit3qp9ro6lnjas-zv-da","uuid":"k0vrURFyR5G3Vr8PxvF6xA","version_id":7090299,"version":"7.9.2","indices":[".watcher-history-11-2020.10.07","metricbeat-7.5.2-2020.06.12-000006",".slm-history-1-000007",".monitoring-kibana-7-2020.10.04","metricbeat-7.5.1-2020.04.06-000003","kibana_sample_data_flights_restored",".kibana-event-log-7.9.2-000001",".apm-agent-configuration","winlogbeat-7.5.1-2020.10.10-002717","metricbeat-7.5.2-2020.09.10-000009",".slm-history-1-000005","metricbeat-7.4.2-2020.01.26-000002",".kibana_task_manager_2",".slm-history-1-000008",".security-7","metricbeat-7.5.1-2020.05.06-000004","filebeat-7.5.2-2020.07.05-000006",".kibana_3",".monitoring-alerts-7","metricbeat-7.5.2-2020.07.12-000007","ilm-history-2-000001",".monitoring-es-7-2020.10.06","metricbeat-7.4.2-2020.07.24-000008","metricbeat-7.5.1-2020.06.05-000005","metricbeat-7.5.1-2020.08.05-000007",".watcher-history-11-2020.10.05","kibana_sample_data_flights","metricbeat-7.4.2-2020.03.26-000004",".monitoring-es-7-2020.10.05",".slm-history-1-000006","ilm-history-2-000002","filebeat-7.5.2-2020.03.07-000002",".monitoring-kibana-7-2020.10.05","kibana_sample_data_flights_restored1","metricbeat-7.5.1-2020.03.07-000002",".kibana_task_manager_1",".monitoring-kibana-7-2020.10.06",".slm-history-2-000002","filebeat-7.4.2-2020.10.09-002645",".async-search","filebeat-7.9.1-2020.10.03-000002",".monitoring-es-7-2020.10.09","metricbeat-7.5.2-2020.03.14-000003",".monitoring-es-7-2020.10.04","winlogbeat-7.5.2-2020.10.10-002724",".monitoring-kibana-7-2020.10.08","metricbeat-7.4.2-2020.04.25-000005",".monitoring-es-7-2020.10.08","filebeat-7.5.2-2020.08.04-000007",".triggered_watches","filebeat-7.5.2-2020.09.03-000008",".kibana_1",".kibana-event-log-7.9.0-000002",".watcher-history-11-2020.10.08","filebeat-7.4.2-2020.01.14-000001",".kibana-event-log-7.9.1-000002","kibana_sample_data_ecommerce","winlogbeat-7.5.2-2020.10.09-002723","metricbeat-7.5.2-2020.02.13-000002",".monitoring-kibana-7-2020.10.07",".slm-history-2-000001","metricbeat-7.5.1-2020.09.04-000008",".watcher-history-11-2020.06.15",".monitoring-es-7-2020.10.07","filebeat-7.5.2-2020.05.06-000004","winlogbeat-7.4.2-2020.10.10-002648","nuix_stats","metricbeat-7.4.2-2020.08.23-000009","winlogbeat-7.4.2-2020.10.09-002647","my-index-000001","metricbeat-7.5.1-2020.07.06-000006",".kibana_1_restored","soniccopy",".kibana-event-log-7.9.0-000001","filebeat-7.5.2-2020.04.06-000003","metricbeat-7.5.1-2020.02.06-000001",".monitoring-kibana-7-2020.10.10",".watcher-history-11-2020.10.10","filebeat-7.4.2-2020.10.10-002646","winlogbeat-7.4.2-2020.01.14-000001","metricbeat-7.4.2-2020.09.22-000010",".management-beats","metricbeat-7.4.2-2020.06.24-000007",".watcher-history-11-2020.10.04","filebeat-7.9.1-2020.09.03-000001","justin_test",".kibana-event-log-7.9.1-000001","filebeat-7.5.2-2020.06.05-000005","metricbeat-7.4.2-2020.02.25-000003","metricbeat-7.5.2-2020.08.11-000008",".monitoring-kibana-7-2020.10.09",".watches",".watcher-history-11-2020.10.09","metricbeat-7.5.2-2020.04.13-000004","kibana_sample_data_logs","metricbeat-7.5.2-2020.10.10-000010",".kibana_2","metricbeat-7.4.2-2020.05.25-000006","filebeat-7.5.2-2020.10.03-000009","metricbeat-7.5.2-2020.05.13-000005",".apm-custom-link","winlogbeat-7.5.1-2020.10.09-002716","metricbeat-7.5.1-2020.10.04-000009",".watcher-history-11-2020.10.06",".monitoring-es-7-2020.10.10"],"data_streams":[],"include_global_state":true,"metadata":{"policy":"Daily-Snapshots"},"state":"SUCCESS","start_time":"2020-10-10T02:30:00.109Z","start_time_in_millis":1602297000109,"end_time":"2020-10-10T02:31:00.586Z","end_time_in_millis":1602297060586,"duration_in_millis":60477,"failures":[],"shards":{"total":105,"failed":0,"successful":105}}}
[2020-10-09T19:31:01,905][INFO ][o.e.s.SnapshotsService   ] [VST3ESATESTT01] snapshot [Repo_01:snapshotrepo01-daily-2020.10.10-zjit3qp9ro6lnjas-zv-da/k0vrURFyR5G3Vr8PxvF6xA] completed with state [SUCCESS]

Note that job Daily-Snapshots-2 is triggered at 2020-10-09T19:29:59 and 2020-10-09T19:30:00.

@alisonelizabeth alisonelizabeth added >bug needs:triage Requires assignment of a team area label labels Oct 15, 2020
@original-brownbear original-brownbear added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Oct 15, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Oct 15, 2020
jaymode added a commit to jaymode/elasticsearch that referenced this issue Oct 15, 2020
This commit ensures that jobs within the SchedulerEngine do not
continue to run after they are cancelled. There was no synchronization
between the cancel method of an ActiveSchedule and the run method, so
an actively running schedule would go ahead and reschedule itself even
if the cancel method had been called.

This commit adds synchronization between cancelling and the scheduling
of the next run to ensure that the job is cancelled. In real life
scenarios this could manifest as a job running multiple times for
SLM. This could happen if a job had been triggered and was cancelled
prior to completing its run such as if the node was no longer the
master node or if SLM was stopping/stopped.

Closes elastic#63754
jaymode added a commit that referenced this issue Oct 15, 2020
This commit ensures that jobs within the SchedulerEngine do not
continue to run after they are cancelled. There was no synchronization
between the cancel method of an ActiveSchedule and the run method, so
an actively running schedule would go ahead and reschedule itself even
if the cancel method had been called.

This commit adds synchronization between cancelling and the scheduling
of the next run to ensure that the job is cancelled. In real life
scenarios this could manifest as a job running multiple times for
SLM. This could happen if a job had been triggered and was cancelled
prior to completing its run such as if the node was no longer the
master node or if SLM was stopping/stopped.

Closes #63754
jaymode added a commit to jaymode/elasticsearch that referenced this issue Oct 15, 2020
This commit ensures that jobs within the SchedulerEngine do not
continue to run after they are cancelled. There was no synchronization
between the cancel method of an ActiveSchedule and the run method, so
an actively running schedule would go ahead and reschedule itself even
if the cancel method had been called.

This commit adds synchronization between cancelling and the scheduling
of the next run to ensure that the job is cancelled. In real life
scenarios this could manifest as a job running multiple times for
SLM. This could happen if a job had been triggered and was cancelled
prior to completing its run such as if the node was no longer the
master node or if SLM was stopping/stopped.

Closes elastic#63754
jaymode added a commit that referenced this issue Oct 15, 2020
This commit ensures that jobs within the SchedulerEngine do not
continue to run after they are cancelled. There was no synchronization
between the cancel method of an ActiveSchedule and the run method, so
an actively running schedule would go ahead and reschedule itself even
if the cancel method had been called.

This commit adds synchronization between cancelling and the scheduling
of the next run to ensure that the job is cancelled. In real life
scenarios this could manifest as a job running multiple times for
SLM. This could happen if a job had been triggered and was cancelled
prior to completing its run such as if the node was no longer the
master node or if SLM was stopping/stopped.

Closes #63754
Backport of #63762
jaymode added a commit that referenced this issue Oct 15, 2020
This commit ensures that jobs within the SchedulerEngine do not
continue to run after they are cancelled. There was no synchronization
between the cancel method of an ActiveSchedule and the run method, so
an actively running schedule would go ahead and reschedule itself even
if the cancel method had been called.

This commit adds synchronization between cancelling and the scheduling
of the next run to ensure that the job is cancelled. In real life
scenarios this could manifest as a job running multiple times for
SLM. This could happen if a job had been triggered and was cancelled
prior to completing its run such as if the node was no longer the
master node or if SLM was stopping/stopped.

Closes #63754
Backport of #63762
jaymode added a commit to jaymode/elasticsearch that referenced this issue Nov 2, 2020
The SchedulerEngine used by SLM uses a custom runnable that will
schedule itself for its next execution if there is one to run. For the
majority of jobs, this scheduling could be many hours or days away. Due
to the scheduling so far in advance, there is a chance that time drifts
on the machine or even that time varies core to core so there is no
guarantee that the job actually runs on or after the scheduled time.
This can cause some jobs to reschedule themselves for the same
scheduled time even if they ran only a millisecond prior to the
scheduled time, which causes unexpected actions to be taken such as
what appears as duplicated snapshots.

This change resolves this by checking the triggered time against the
scheduled time and using the appropriate value to ensure that we do
not have unexpected job runs.

Relates elastic#63754
jaymode added a commit that referenced this issue Nov 4, 2020
The SchedulerEngine used by SLM uses a custom runnable that will
schedule itself for its next execution if there is one to run. For the
majority of jobs, this scheduling could be many hours or days away. Due
to the scheduling so far in advance, there is a chance that time drifts
on the machine or even that time varies core to core so there is no
guarantee that the job actually runs on or after the scheduled time.
This can cause some jobs to reschedule themselves for the same
scheduled time even if they ran only a millisecond prior to the
scheduled time, which causes unexpected actions to be taken such as
what appears as duplicated snapshots.

This change resolves this by checking the triggered time against the
scheduled time and using the appropriate value to ensure that we do
not have unexpected job runs.

Relates #63754
jaymode added a commit to jaymode/elasticsearch that referenced this issue Nov 4, 2020
The SchedulerEngine used by SLM uses a custom runnable that will
schedule itself for its next execution if there is one to run. For the
majority of jobs, this scheduling could be many hours or days away. Due
to the scheduling so far in advance, there is a chance that time drifts
on the machine or even that time varies core to core so there is no
guarantee that the job actually runs on or after the scheduled time.
This can cause some jobs to reschedule themselves for the same
scheduled time even if they ran only a millisecond prior to the
scheduled time, which causes unexpected actions to be taken such as
what appears as duplicated snapshots.

This change resolves this by checking the triggered time against the
scheduled time and using the appropriate value to ensure that we do
not have unexpected job runs.

Relates elastic#63754
Backport of elastic#64501
jaymode added a commit that referenced this issue Nov 4, 2020
The SchedulerEngine used by SLM uses a custom runnable that will
schedule itself for its next execution if there is one to run. For the
majority of jobs, this scheduling could be many hours or days away. Due
to the scheduling so far in advance, there is a chance that time drifts
on the machine or even that time varies core to core so there is no
guarantee that the job actually runs on or after the scheduled time.
This can cause some jobs to reschedule themselves for the same
scheduled time even if they ran only a millisecond prior to the
scheduled time, which causes unexpected actions to be taken such as
what appears as duplicated snapshots.

This change resolves this by checking the triggered time against the
scheduled time and using the appropriate value to ensure that we do
not have unexpected job runs.

Relates #63754
Backport of #64501
jaymode added a commit that referenced this issue Nov 4, 2020
The SchedulerEngine used by SLM uses a custom runnable that will
schedule itself for its next execution if there is one to run. For the
majority of jobs, this scheduling could be many hours or days away. Due
to the scheduling so far in advance, there is a chance that time drifts
on the machine or even that time varies core to core so there is no
guarantee that the job actually runs on or after the scheduled time.
This can cause some jobs to reschedule themselves for the same
scheduled time even if they ran only a millisecond prior to the
scheduled time, which causes unexpected actions to be taken such as
what appears as duplicated snapshots.

This change resolves this by checking the triggered time against the
scheduled time and using the appropriate value to ensure that we do
not have unexpected job runs.

Relates #63754
Backport of #64501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management needs:triage Requires assignment of a team area label Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants