-
Notifications
You must be signed in to change notification settings - Fork 25.2k
SLM policy unexpectedly executed twice #63754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
>bug
:Data Management/ILM+SLM
Index and Snapshot lifecycle management
needs:triage
Requires assignment of a team area label
Team:Data Management
Meta label for data/management team
Comments
Pinging @elastic/es-core-features (:Core/Features/ILM+SLM) |
jaymode
added a commit
to jaymode/elasticsearch
that referenced
this issue
Oct 15, 2020
This commit ensures that jobs within the SchedulerEngine do not continue to run after they are cancelled. There was no synchronization between the cancel method of an ActiveSchedule and the run method, so an actively running schedule would go ahead and reschedule itself even if the cancel method had been called. This commit adds synchronization between cancelling and the scheduling of the next run to ensure that the job is cancelled. In real life scenarios this could manifest as a job running multiple times for SLM. This could happen if a job had been triggered and was cancelled prior to completing its run such as if the node was no longer the master node or if SLM was stopping/stopped. Closes elastic#63754
jaymode
added a commit
that referenced
this issue
Oct 15, 2020
This commit ensures that jobs within the SchedulerEngine do not continue to run after they are cancelled. There was no synchronization between the cancel method of an ActiveSchedule and the run method, so an actively running schedule would go ahead and reschedule itself even if the cancel method had been called. This commit adds synchronization between cancelling and the scheduling of the next run to ensure that the job is cancelled. In real life scenarios this could manifest as a job running multiple times for SLM. This could happen if a job had been triggered and was cancelled prior to completing its run such as if the node was no longer the master node or if SLM was stopping/stopped. Closes #63754
jaymode
added a commit
to jaymode/elasticsearch
that referenced
this issue
Oct 15, 2020
This commit ensures that jobs within the SchedulerEngine do not continue to run after they are cancelled. There was no synchronization between the cancel method of an ActiveSchedule and the run method, so an actively running schedule would go ahead and reschedule itself even if the cancel method had been called. This commit adds synchronization between cancelling and the scheduling of the next run to ensure that the job is cancelled. In real life scenarios this could manifest as a job running multiple times for SLM. This could happen if a job had been triggered and was cancelled prior to completing its run such as if the node was no longer the master node or if SLM was stopping/stopped. Closes elastic#63754
jaymode
added a commit
that referenced
this issue
Oct 15, 2020
This commit ensures that jobs within the SchedulerEngine do not continue to run after they are cancelled. There was no synchronization between the cancel method of an ActiveSchedule and the run method, so an actively running schedule would go ahead and reschedule itself even if the cancel method had been called. This commit adds synchronization between cancelling and the scheduling of the next run to ensure that the job is cancelled. In real life scenarios this could manifest as a job running multiple times for SLM. This could happen if a job had been triggered and was cancelled prior to completing its run such as if the node was no longer the master node or if SLM was stopping/stopped. Closes #63754 Backport of #63762
jaymode
added a commit
that referenced
this issue
Oct 15, 2020
This commit ensures that jobs within the SchedulerEngine do not continue to run after they are cancelled. There was no synchronization between the cancel method of an ActiveSchedule and the run method, so an actively running schedule would go ahead and reschedule itself even if the cancel method had been called. This commit adds synchronization between cancelling and the scheduling of the next run to ensure that the job is cancelled. In real life scenarios this could manifest as a job running multiple times for SLM. This could happen if a job had been triggered and was cancelled prior to completing its run such as if the node was no longer the master node or if SLM was stopping/stopped. Closes #63754 Backport of #63762
jaymode
added a commit
to jaymode/elasticsearch
that referenced
this issue
Nov 2, 2020
The SchedulerEngine used by SLM uses a custom runnable that will schedule itself for its next execution if there is one to run. For the majority of jobs, this scheduling could be many hours or days away. Due to the scheduling so far in advance, there is a chance that time drifts on the machine or even that time varies core to core so there is no guarantee that the job actually runs on or after the scheduled time. This can cause some jobs to reschedule themselves for the same scheduled time even if they ran only a millisecond prior to the scheduled time, which causes unexpected actions to be taken such as what appears as duplicated snapshots. This change resolves this by checking the triggered time against the scheduled time and using the appropriate value to ensure that we do not have unexpected job runs. Relates elastic#63754
jaymode
added a commit
that referenced
this issue
Nov 4, 2020
The SchedulerEngine used by SLM uses a custom runnable that will schedule itself for its next execution if there is one to run. For the majority of jobs, this scheduling could be many hours or days away. Due to the scheduling so far in advance, there is a chance that time drifts on the machine or even that time varies core to core so there is no guarantee that the job actually runs on or after the scheduled time. This can cause some jobs to reschedule themselves for the same scheduled time even if they ran only a millisecond prior to the scheduled time, which causes unexpected actions to be taken such as what appears as duplicated snapshots. This change resolves this by checking the triggered time against the scheduled time and using the appropriate value to ensure that we do not have unexpected job runs. Relates #63754
jaymode
added a commit
to jaymode/elasticsearch
that referenced
this issue
Nov 4, 2020
The SchedulerEngine used by SLM uses a custom runnable that will schedule itself for its next execution if there is one to run. For the majority of jobs, this scheduling could be many hours or days away. Due to the scheduling so far in advance, there is a chance that time drifts on the machine or even that time varies core to core so there is no guarantee that the job actually runs on or after the scheduled time. This can cause some jobs to reschedule themselves for the same scheduled time even if they ran only a millisecond prior to the scheduled time, which causes unexpected actions to be taken such as what appears as duplicated snapshots. This change resolves this by checking the triggered time against the scheduled time and using the appropriate value to ensure that we do not have unexpected job runs. Relates elastic#63754 Backport of elastic#64501
jaymode
added a commit
that referenced
this issue
Nov 4, 2020
The SchedulerEngine used by SLM uses a custom runnable that will schedule itself for its next execution if there is one to run. For the majority of jobs, this scheduling could be many hours or days away. Due to the scheduling so far in advance, there is a chance that time drifts on the machine or even that time varies core to core so there is no guarantee that the job actually runs on or after the scheduled time. This can cause some jobs to reschedule themselves for the same scheduled time even if they ran only a millisecond prior to the scheduled time, which causes unexpected actions to be taken such as what appears as duplicated snapshots. This change resolves this by checking the triggered time against the scheduled time and using the appropriate value to ensure that we do not have unexpected job runs. Relates #63754 Backport of #64501
jaymode
added a commit
that referenced
this issue
Nov 4, 2020
The SchedulerEngine used by SLM uses a custom runnable that will schedule itself for its next execution if there is one to run. For the majority of jobs, this scheduling could be many hours or days away. Due to the scheduling so far in advance, there is a chance that time drifts on the machine or even that time varies core to core so there is no guarantee that the job actually runs on or after the scheduled time. This can cause some jobs to reschedule themselves for the same scheduled time even if they ran only a millisecond prior to the scheduled time, which causes unexpected actions to be taken such as what appears as duplicated snapshots. This change resolves this by checking the triggered time against the scheduled time and using the appropriate value to ensure that we do not have unexpected job runs. Relates #63754 Backport of #64501
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>bug
:Data Management/ILM+SLM
Index and Snapshot lifecycle management
needs:triage
Requires assignment of a team area label
Team:Data Management
Meta label for data/management team
Elasticsearch version (
bin/elasticsearch --version
): 7.9Description of the problem including expected versus actual behavior: A SLM policy is getting triggered twice at the same time daily.
Policy definition:
Provide logs (if relevant):
Note that job
Daily-Snapshots-2
is triggered at 2020-10-09T19:29:59 and 2020-10-09T19:30:00.The text was updated successfully, but these errors were encountered: