Tighten up threading in snapshot finalization #124403

DaveCTurner · 2025-03-08T09:22:12Z

Today snapshot finalization does nontrivial work on the calling thread
(often the cluster applier thread) and also in theory may fork back to
the cluster applier thread in getRepositoryData, yet it always forks
at least one task (the SnapshotInfo write) to the SNAPSHOT pool
anyway. With this change we fork to the SNAPSHOT pool straight away
and then make sure to stay on this pool throughout.

Today snapshot finalization does nontrivial work on the calling thread (often the cluster applier thread) and also in theory may fork back to the cluster applier thread in `getRepositoryData`, yet it always forks at least one task (the `SnapshotInfo` write) to the `SNAPSHOT` pool anyway. With this change we fork to the `SNAPSHOT` pool straight away and then make sure to stay on this pool throughout.

DaveCTurner · 2025-03-08T09:25:27Z

Somewhat relates #108907: moving this code into a method object will let us split it up into more manageable pieces and eventually move it out of SnapshotsService altogether which will give us more opportunities to simplify finalization. Not doing that yet because it'd be quite noisy, and I've crafted this PR to have a minimal diff for ease of review.

elasticsearchmachine · 2025-03-10T10:46:29Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

DaveCTurner · 2025-03-10T17:05:14Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

-                threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(ActionRunnable.supply(metadataListener, () -> {
+                // This listener is kinda unnecessary since we now always complete it synchronously. It's only here to catch exceptions.
+                // TODO simplify this.
+                ActionListener.completeWith(metadataListener, () -> {


FWIW this is equivalent to threadPool.executor(EsExecutors.DIRECT_EXECUTOR_SERVICE).execute(ActionRunnable.supply(metadataListener, () -> {, i.e the only change from before is the move from forking to ThreadPool.Names.SNAPSHOT to running this on the current thread.

DiannaHohensee

Looks good!

I just left some comments picking on the comments :) Whatever improvements you make should be good 👍

DiannaHohensee · 2025-03-11T04:22:30Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+     * and triggering the next snapshot-related activity.
+     */
+    // This only really makes sense to run against a BlobStoreRepository, and the division of work between this class and
+    // BlobStoreRepository#finalizeSnapshot is kind of awkward and artificial; TODO consolidate all this stuff into one place and simplify


Maybe a little more concise, like:

// TODO: This only makes sense to run against a BlobStoreRepository. This logic should be consolidated into the BlobStoreRepository#finalizeSnapshot method, and hopefully simplified thereby.

I don't think it'll work just to move this into BlobStoreRepository#finalizeSnapshot so I didn't want to prescribe that as the solution.

DiannaHohensee · 2025-03-11T04:25:28Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+    /**
+     * Implements the finalization process for a snapshot: does some preparatory calculations, builds a {@link SnapshotInfo} and a
+     * {@link FinalizeSnapshotContext}, calls {@link Repository#finalizeSnapshot} and handles the outcome by notifying waiting listeners
+     * and triggering the next snapshot-related activity.


This comment is vague. What's the next snapshot-related activity? What's the preparatory calculations (maybe remove this if not important to know)? There's no context about FinalizeSnapshotContext and why it's relevant to surface a mention on the interface here.

How do I use this thing and why?

The comment is more to orient the reader about what the method does because it's a little long right now (will be shorter soon). It's an AbstractRunnable so the only way to use it is to run it, and why is because you are finalizing a snapshot?

The next snapshot-related activity is deliberately vague because it kinda could be anything that was blocked on this finalization. This includes completing another snapshot, starting a batch of deletes, running a cleanup, likely others too.

DiannaHohensee · 2025-03-11T04:26:39Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

-                threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(ActionRunnable.supply(metadataListener, () -> {
+                // This listener is kinda unnecessary since we now always complete it synchronously. It's only here to catch exceptions.
+                // TODO simplify this.
+                ActionListener.completeWith(metadataListener, () -> {


[opt] if you know what this code does, it might be worth documenting what originally caused us to fork this work over the SNAPSHOT pool -- cpu or I/o intensive? But perhaps that's overkill.

This will go away in a (hopefully fairly immediate) follow up. It used to fork because repo.getSnapshotIndexMetaData interacts with the repository.

DiannaHohensee · 2025-03-11T04:44:01Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+        @Override
+        public void onRejection(Exception e) {
+            if (e instanceof EsRejectedExecutionException esre && esre.isExecutorShutdown()) {
+                logger.debug("failing finalization of {} due to shutdown", snapshot);


Nice extra handling.

To verify my understanding, and make a note of the behavior change: previously we'd log a warning about failing due to shutdown; now we're making that quiet under debug. That seems reasonable.

Previously we didn't fork here, we just ran all this on the calling thread, so rejection was not a thing that could happen.

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

Today snapshot finalization does nontrivial work on the calling thread (often the cluster applier thread) and also in theory may fork back to the cluster applier thread in `getRepositoryData`, yet it always forks at least one task (the `SnapshotInfo` write) to the `SNAPSHOT` pool anyway. With this change we fork to the `SNAPSHOT` pool straight away and then make sure to stay on this pool throughout.

elasticsearchmachine · 2025-03-11T07:14:34Z

💚 Backport successful

Status	Branch	Result
✅	8.x

Today snapshot finalization does nontrivial work on the calling thread (often the cluster applier thread) and also in theory may fork back to the cluster applier thread in `getRepositoryData`, yet it always forks at least one task (the `SnapshotInfo` write) to the `SNAPSHOT` pool anyway. With this change we fork to the `SNAPSHOT` pool straight away and then make sure to stay on this pool throughout.

DaveCTurner added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.19.0 v9.1.0 labels Mar 8, 2025

DaveCTurner requested review from DiannaHohensee and removed request for DiannaHohensee March 8, 2025 09:22

DaveCTurner added 5 commits March 8, 2025 11:41

Comments

7b127bf

Merge branch 'main' into 2025/03/08/snapshot-finalization-reduce-forking

492defb

Remove another obviously-pointless fork

fbbd199

Comment

594202b

Merge branch 'main' into 2025/03/08/snapshot-finalization-threading

d278c14

DaveCTurner marked this pull request as ready for review March 10, 2025 10:46

DaveCTurner requested a review from DiannaHohensee March 10, 2025 10:46

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Mar 10, 2025

Merge branch 'main' into 2025/03/08/snapshot-finalization-threading

e355340

DaveCTurner commented Mar 10, 2025

View reviewed changes

DiannaHohensee approved these changes Mar 11, 2025

View reviewed changes

DaveCTurner added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport Automatically create backport pull requests when merged labels Mar 11, 2025

DaveCTurner added 2 commits March 11, 2025 06:10

Merge branch 'main' into 2025/03/08/snapshot-finalization-threading

850e2c1

Comment additions

0d70bd5

elasticsearchmachine merged commit f1f2df7 into elastic:main Mar 11, 2025
17 checks passed

DaveCTurner deleted the 2025/03/08/snapshot-finalization-threading branch March 11, 2025 07:13

DaveCTurner mentioned this pull request Mar 11, 2025

[8.x] Tighten up threading in snapshot finalization (#124403) #124535

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tighten up threading in snapshot finalization #124403

Tighten up threading in snapshot finalization #124403

Uh oh!

DaveCTurner commented Mar 8, 2025

Uh oh!

DaveCTurner commented Mar 8, 2025

Uh oh!

elasticsearchmachine commented Mar 10, 2025

Uh oh!

DaveCTurner Mar 10, 2025

Uh oh!

DiannaHohensee left a comment

Uh oh!

DiannaHohensee Mar 11, 2025

Uh oh!

DaveCTurner Mar 11, 2025

Uh oh!

DiannaHohensee Mar 11, 2025

Uh oh!

DaveCTurner Mar 11, 2025

Uh oh!

DiannaHohensee Mar 11, 2025

Uh oh!

DaveCTurner Mar 11, 2025

Uh oh!

DiannaHohensee Mar 11, 2025

Uh oh!

DaveCTurner Mar 11, 2025

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 11, 2025

Uh oh!

Uh oh!

Tighten up threading in snapshot finalization #124403

Tighten up threading in snapshot finalization #124403

Uh oh!

Conversation

DaveCTurner commented Mar 8, 2025

Uh oh!

DaveCTurner commented Mar 8, 2025

Uh oh!

elasticsearchmachine commented Mar 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 11, 2025

💚 Backport successful

Uh oh!

Uh oh!