You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Register SLM run before snapshotting to save stats (#110216)
The SLM health indicator relies on the policyMetadata.getInvocationsSinceLastSuccess to determine if the last several snapshots have failed. If a snapshot fails and the master is shutdown before setting invocationsSinceLastSuccess, the fact that failure occurred will be lost.
To solve this, before snapshotting, we register that a snapshot is about to run, in the cluster state custom metadata. If the run fails, and invocationsSinceLastSuccess is not updated before a master shutdown, the fact that the failure occurred will not be lost. On completion of a subsequent snapshot run, SnapshotLifecycleTask will observe that there exists a registered snapshot which is no longer running. It will infer that the snapshot failed, and update invocationsSinceLastSuccess and other stats accordingly.
A few parts of this change touch general snapshot code, and are worth noting:
* Snapshots can only be uniquely identified with a uuid which had previously been generated in SnapshotService. This uuid is needed when there is a snapshot failure, but was not available to SLM as it was only returned in the SnapshotInfo after a success. To make this available, the uuid is now generated in the CreateSnapshotRequest constructor and passed to SnapshotService. In mixed version clusters, there exists a special case where the uuid is still generated in SnapshotService.
* If a snapshot were registered before calling snapshot service, there would a small period of time when a snapshot is registered but not yet stored in SnapshotsInProgress. During this time, another snapshot from the same policy might incorrectly infer that the snapshot failed and update its stats accordingly. To avoid this race condition, we register snapshots within SnapshotService in the same cluster update which updates SnapshotsInProgress.
0 commit comments