Delay shard reassignment from nodes which are known to be restarting #75606

gwbrown · 2021-07-21T23:05:10Z

This PR makes the delayed allocation infrastructure aware of registered node shutdowns, so that reallocation of shards will be further delayed for nodes which are known to be restarting.

To make this more configurable, the Node Shutdown APIs now support a allocation_delay parameter, which defaults to 5 minutes. For example:

PUT /_nodes/USpTGYaBSIKbgSUJR2Z9lg/shutdown
{
  "type": "restart",
  "reason": "Demonstrating how the node shutdown API works",
  "allocation_delay": "20m"
}

Will cause reallocation of shards assigned to that node to another node to be delayed by 20 minutes. Note that this delay will only be used if it's longer than the index-level allocation delay, set via index.unassigned.node_left.delayed_timeout.

The allocation_delay parameter is only valid for restart-type shutdown registrations, and the request will be rejected if it's used with another shutdown type.

Relates #70338

elasticmachine · 2021-07-22T22:38:29Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticmachine · 2021-07-22T22:38:30Z

Pinging @elastic/es-distributed (Team:Distributed)

gwbrown · 2021-07-22T22:41:56Z

~~@henningandersen I've pinged you for a review on this PR mostly because I don't know who would be most appropriate from the Distributed team to review this - feel free to reassign as you see fit.~~

Henning is unavailable this week so switching the distrib-team review request to someone else.

gwbrown · 2021-07-26T17:12:46Z

Failure is #75667.

@elasticmachine run elasticsearch-ci/part-1

gwbrown · 2021-07-26T17:13:25Z

@DaveCTurner I've pinged you for a review on this PR mostly because I don't know who would be most appropriate from the Distributed team to review this - feel free to reassign as you see fit.

dakrone

Thanks for working on this Gordon! I left some pretty minor comments, but nothing major

dakrone · 2021-07-26T17:33:30Z

server/src/main/java/org/elasticsearch/cluster/metadata/SingleNodeShutdownMetadata.java

+            if (Type.RESTART.equals(type) && delayOrDefault == null) {
+                delayOrDefault = DEFAULT_RESTART_SHARD_ALLOCATION_DELAY;


This feels like the wrong place to implement this default, I think rather than here, it should go into the getter inside of SingleNodeShutdownMetadata, otherwise calling toXContent on the metadata makes it appear that it has been set (when in reality the default value will be used)

Good call, I'll change it.

dakrone · 2021-07-26T17:35:01Z

server/src/main/java/org/elasticsearch/cluster/routing/UnassignedInfo.java

 import java.util.Set;

 /**
 * Holds additional information as to why the shard is in unassigned state.
 */
 public final class UnassignedInfo implements ToXContentFragment, Writeable {

+    private static final Version LAST_ALLOCATED_NODE_VERSION = Version.V_8_0_0;


This feels a little strange to me, do we need this rather than just using the version directly in the two serialization methods?

I like doing this because it 1) adds semantic information to the version check, and 2) is less error-prone when switching the serialization version following a backport, especially if there are other serialization version checks in the same class.

I can make it inline if you feel strongly about it.

I think it's fine to keep it, but I would suggest two things. First, we should add a comment about what it is for, and second, I think renaming it to something like VERSION_LAST_ALLOCATED_NODE_ADDED would be better.

Just looking at it with its current name makes me think this is the version of the node where the shard was last allocated.

dakrone · 2021-07-26T17:36:54Z

server/src/main/java/org/elasticsearch/cluster/routing/UnassignedInfo.java

+        final Settings indexSettings,
+        final Map<String, SingleNodeShutdownMetadata> nodesShutdownMap
+    ) {
+        Map<String, SingleNodeShutdownMetadata> nodeShutdowns = nodesShutdownMap != null ? nodesShutdownMap : Collections.emptyMap();


I don't think we need this null check here, since Metadata#nodeShutdowns() returns an empty map if there are no shutdowns. Maybe we can replace it with an assert instead?

👍 I think this is left over from an intermediate version.

dakrone · 2021-07-26T17:41:46Z

...ternalClusterTest/java/org/elasticsearch/xpack/shutdown/NodeShutdownDelayedAllocationIT.java

+        // Verify that the shard's allocation is still delayed
+        assertBusy(
+            () -> { assertThat(client().admin().cluster().prepareHealth().get().getDelayedUnassignedShards(), equalTo(1)); },
+            2,
+            TimeUnit.SECONDS
+        );


I think we should remove this assert, since a long GC or other CI slowness could cause this to be flaky. I think just the ensureGreen below is enough, because it was changed from 3 hours to 10 seconds and the change took effect.

dakrone · 2021-07-26T17:42:47Z

...ternalClusterTest/java/org/elasticsearch/xpack/shutdown/NodeShutdownDelayedAllocationIT.java

+            2,
+            TimeUnit.SECONDS


I think we can remove the 2 second limit here, since we have 3 hours to check this :)

dakrone

LGTM!

gwbrown · 2021-07-29T22:16:00Z

@elasticmachine update branch

gwbrown · 2021-07-29T22:16:20Z

@DaveCTurner Would you prefer us to wait for a review from the Distributed team on this, or should we go ahead and merge?

henningandersen

I left a number of comments. I would like your input on whether we could instead capture which delay to use when we lose the node (see comments inline)?

I did not find any shutdown documentation, but if it is there (did not search exhaustively) it would be good to update as part of this PR.

henningandersen · 2021-08-02T18:24:42Z

server/src/main/java/org/elasticsearch/cluster/metadata/SingleNodeShutdownMetadata.java

@@ -145,7 +182,8 @@ public boolean equals(Object o) {
        return getStartedAtMillis() == that.getStartedAtMillis()
            && getNodeId().equals(that.getNodeId())
            && getType() == that.getType()
-            && getReason().equals(that.getReason());
+            && getReason().equals(that.getReason())
+            && Objects.equals(getShardReallocationDelay(), that.getShardReallocationDelay());


I would prefer to use the native field to avoid two semantically equivalent objects being equal. Not that I know a case where it will cause issues, but I find it odd to have two equal objects producing different xcontent. Need to update hashCode too.

henningandersen · 2021-08-02T18:34:43Z

server/src/main/java/org/elasticsearch/cluster/routing/UnassignedInfo.java

+            .filter(shutdownMetadata -> SingleNodeShutdownMetadata.Type.RESTART.equals(shutdownMetadata.getType()))
+            .map(SingleNodeShutdownMetadata::getShardReallocationDelay)
+            .map(TimeValue::nanos)
+            .orElse(INDEX_DELAYED_NODE_LEFT_TIMEOUT_SETTING.get(indexSettings).nanos());


In the case where INDEX_DELAYED_NODE_LEFT_TIMEOUT_SETTING setting is greater than the shutdown reallocation delay this means that the shutdown delay overrides the INDEX_DELAYED_NODE_LEFT_TIMEOUT_SETTING. That seems wrong? I think we should take the greater of the shutdown delay and the delayed allocation delay.

That seems wrong? I think we should take the greater of the shutdown delay and the delayed allocation delay.

I think it should be the explicitly set value in the node shutdown, regardless of the cluster-level setting for index delayed node left timeout. In that way we are doing exactly what the user asked (you asked me to wait 10 minutes, so I will wait 10 minutes regardless of the other setting) which is more expected. Why do you think it should be the max? That feels a little like action-at-a-distance when a user could not even be aware of the other cluster-wide setting.

index.unassigned.node_left.delayed_timeout is an index level setting. The shutdown allocation delay is less likely to be controllable by a user (I believe cloud will simply use our default value) and I think of it as a "default value" for the index specific delay, allowing some extra (but never less) leeway to handle a planned restart.

If a user setup index.unassigned.node_left.delayed_timeout=15m for some indices, it seems odd that a planned restart would use a smaller value for the specific index/shards.

Also, if the shutdown API was called with allocation_timeout=0, it would be somewhat strange to disregard the index specific delay. That would effectively mean that we would be treating a planned restart as a worse event than a crash.

I think of restarts as something happening automatically, based on autoscaling, upgrades or other maintenance (though it could be an admin doing this, but without discussing with users). So if users (who should control the index level settings) set up their indices with higher delays than the shutdown uses, I think that higher delay should be used.

I've changed this to use the max of index level delay and restart delay.

henningandersen · 2021-08-02T18:42:58Z

server/src/main/java/org/elasticsearch/cluster/metadata/SingleNodeShutdownMetadata.java

    }

    public static SingleNodeShutdownMetadata parse(XContentParser parser) {
        return PARSER.apply(parser, null);
    }

+    public static final TimeValue DEFAULT_RESTART_SHARD_ALLOCATION_DELAY = TimeValue.timeValueMinutes(10);


10 minutes seems a bit long, I would advocate just 2 minutes should be enough to restart a single node with some safety margin?

We did actually discuss this in the meeting. Cloud uses a 5 minute plan timeout right now and we wanted something a bit higher than that and settled on 10 minutes. Is there a reason you think this should not be 10 minutes?

Nothing very substantial, it just feels like a long time to expect a restart to take. Our documentation speaks of 5 minutes too. Our default delayed allocation is 1 minute. And if cloud uses 5 minute timeout, 10 minutes seems excessive.

We should remember that once the timeout is hit, nothing really bad happens, we "just" start doing more work than absolutely necessary in order to restore availability.

If cloud uses a 5 minute timeout, I would advocate using the same timeout here.

I've changed the default value to 5m.

henningandersen · 2021-08-02T18:50:54Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

+                boolean delayed = Optional.ofNullable(nodesShutdownMetadata.get(node.nodeId()))
+                    // If we know this node is restarting, then the allocation should be delayed
+                    .map(shutdownMetadata -> SingleNodeShutdownMetadata.Type.RESTART.equals(shutdownMetadata.getType()))
+                    // Otherwise, use the "normal" allocation delay logic
+                    .orElse(INDEX_DELAYED_NODE_LEFT_TIMEOUT_SETTING.get(indexMetadata.getSettings()).nanos() > 0);


I think this means that if the shutdown type is REMOVE or REPLACE, we now ignore the delayed left timeout?

Also, I wonder if we should only use the shutdown indication if it had completed, i.e., the node node was ready to restart when this happened? Since if it crashes on its own and they setup no delay for their indices, we should probably honor that. A flapping node during a restart could result in a long period of not assigning the shard, since we recalculate the delay every time the shard is recovered and the node subsequently dies.

Finally, if the shutdown reallocation delay and the delayed allocation delay are both 0, we should ensure delayed=false here.

I think this means that if the shutdown type is REMOVE or REPLACE, we now ignore the delayed left timeout?

Good catch. I'll fix.

Also, I wonder if we should only use the shutdown indication if it had completed, i.e., the node node was ready to restart when this happened?

We don't actually keep state about whether the shutdown preparation is complete or not, we compute the status when the Get Status API is called. So we'd have to move that logic to somewhere where it's accessible here, and ensure it always remains fast since this is called on the cluster state update thread.

It would likely be better if we checked the status, I agree, but RESTART-type shutdowns should be relatively quick to prepare. Do you think this is something we need for v1, or could we introduce it as the shutdown process becomes more involved?

A flapping node during a restart could result in a long period of not assigning the shard, since we recalculate the delay every time the shard is recovered and the node subsequently dies.

Can't this happen today with the existing allocation delay? Admittedly, it's less likely with a shorter timeout.

There's also the case of someone intentionally restarting the node before it's ready. Should we use the general delay in that case? It's not clear to me that one behavior is more intuitive than the other.

Do you think this is something we need for v1, or could we introduce it as the shutdown process becomes more involved?

I do not think it is mandatory for v1.

Can't this happen today with the existing allocation delay? Admittedly, it's less likely with a shorter timeout.

This was specifically for the case where "they setup no delay for their indices". We would be allocating those immediately. Obviously during a cloud restart today, there would be no such allocation going on though.

Given the technical difficulty in doing this, we can defer to a follow-up.

There's also the case of someone intentionally restarting the node before it's ready. Should we use the general delay in that case? It's not clear to me that one behavior is more intuitive than the other.

Thanks, good point. I suppose that for a RESTART shutdown, the node is really "ready" for restart immediately (since we always anticipate a restart), however, optimally cloud would wait a little before forcing it through. So relying on the presence of the shutdown indication seems good here.

I've changed this to track whether we knew the last allocated node was restarting at the time the shard became unassigned, and if not, to always use the index-level delay rather than the restart-level delay.

server/src/test/java/org/elasticsearch/cluster/metadata/NodesShutdownMetadataTests.java

henningandersen · 2021-08-02T19:27:15Z

...ternalClusterTest/java/org/elasticsearch/xpack/shutdown/NodeShutdownDelayedAllocationIT.java

+        // Actually stop the node
+        internalCluster().stopRandomNode(InternalTestCluster.nameFilter(nodeToRestartName));
+
+        // Verify that the shard's allocation is delayed - but with a shorter wait than the reallocation timeout


The part of this comment on "shorter wait time" seems off here?

henningandersen · 2021-08-02T19:29:07Z

...ternalClusterTest/java/org/elasticsearch/xpack/shutdown/NodeShutdownDelayedAllocationIT.java

+        ensureGreen(TimeValue.timeValueSeconds(30), "test");
+    }
+
+    public void testShardAllocationStartsImmediatelyIfShutdownDeleted() throws Exception {


This test looks nearly identical to the previous test. I think we should either:

Refactor to share most of the code.

Make just one test that randomly either lowers the timeout or deletes the shutdown indication.

I'll update this to go with option 1, I don't like relying on randomization to check actually-different code paths like that.

henningandersen · 2021-08-02T20:18:10Z

...ck/plugin/shutdown/src/main/java/org/elasticsearch/xpack/shutdown/PutShutdownNodeAction.java

+        public TimeValue getShardReallocationDelay() {
+            return shardReallocationDelay;
+        }
+
        @Override
        public ActionRequestValidationException validate() {


Perhaps add the validation of type vs reallocation delay here?

henningandersen · 2021-08-02T20:23:09Z

server/src/main/java/org/elasticsearch/cluster/routing/UnassignedInfo.java

+    ) {
+        long delayTimeoutNanos = Optional.ofNullable(lastAllocatedNodeId)
+            .map(nodesShutdownMap::get)
+            .filter(shutdownMetadata -> SingleNodeShutdownMetadata.Type.RESTART.equals(shutdownMetadata.getType()))


It is a bit odd that if a node crashes and then has a shutdown indication added, we extend the node left timeout to be the possibly longer shutdown node left timeout.

I wonder if we could instead capture whether the node was ready for shutdown when we lost it and use that to determine which of the delays to use?

See my above comments for the difficulty with doing so, but as an alternative, we could at least capture whether the node was registered for shutdown at all when the node went offline.

Yes, given your earlier comments, that approach makes sense to me.

henningandersen · 2021-08-02T20:27:49Z

server/src/main/java/org/elasticsearch/cluster/metadata/SingleNodeShutdownMetadata.java

@@ -35,14 +38,16 @@
    public static final ParseField REASON_FIELD = new ParseField("reason");
    public static final String STARTED_AT_READABLE_FIELD = "shutdown_started";
    public static final ParseField STARTED_AT_MILLIS_FIELD = new ParseField(STARTED_AT_READABLE_FIELD + "millis");
+    public static final ParseField SHARD_REALLOCATION_DELAY = new ParseField("shard_reallocation_delay");


Perhaps this argument could be just "allocation_delay"?

gwbrown · 2021-08-02T22:25:31Z

Hm, I appear to have oops'd the commit history. Please stand by.

Edit: Fixed, apologies for the force push - I think Github handles that a bit better these days, at least.

henningandersen

LGTM. Thanks for the extra iterations. I wonder if @dakrone should have a quick re-review given that it changed substantially.

henningandersen · 2021-08-13T09:25:13Z

server/src/main/java/org/elasticsearch/cluster/routing/UnassignedInfo.java

-        INDEX_CLOSED
+        INDEX_CLOSED,
+        /**
+         * Similar to NODE_LEFT, but at the time the node left, it had been registered for a restart via the Node Shutdown API.


Perhaps expand the comment here to include the detail that it might be a crash happening during a node restart procedure.

henningandersen · 2021-08-13T09:33:09Z

server/src/main/java/org/elasticsearch/cluster/routing/UnassignedInfo.java

@@ -242,8 +266,12 @@ public UnassignedInfo(Reason reason, @Nullable String message, @Nullable Excepti
        this.failedAllocations = failedAllocations;
        this.lastAllocationStatus = Objects.requireNonNull(lastAllocationStatus);
        this.failedNodeIds = Collections.unmodifiableSet(failedNodeIds);
-        assert (failedAllocations > 0) == (reason == Reason.ALLOCATION_FAILED) :
-            "failedAllocations: " + failedAllocations + " for reason " + reason;
+        this.lastAllocatedNodeId = lastAllocatedNodeId;


I think you added the opposite. The assertion above should now be:

assert reason != Reason.NODE_RESTARTING || lastAllocatedNodeId != null

meaning that if reason == restarting, we require a lastAllocatedNodeId.

Unfortunately we cannot require that for NODE_LEFT due to bwc.

henningandersen · 2021-08-13T09:44:45Z

server/src/main/java/org/elasticsearch/cluster/metadata/Metadata.java

@@ -719,6 +719,12 @@ public IndexMetadata getIndexSafe(Index index) {
            .orElse(Collections.emptyMap());
    }

+    public Map<String, SingleNodeShutdownMetadata> nodeShutdowns() {


This is not really part of this review, but I wonder if we could not risk seeing multiple shutdown indications for the same node, for instance both a RESTART and REMOVE or REPLACE? I think of ECK in particular here, but might also be relevant in cloud.

No, there's a couple things that prevent this:

In TransportPutShutdownNodeAction when we get a PUT for a node that already has a record, it's updated rather than added to, and

The data structure used to store the SingleNodeShutdownMetadata (the Map in the line you're commenting on) is keyed by node UUID, so it should be impossible to have multiple records for the same key/nodeId.

Since the node id is duplicated in the SingleNodeShutdownMetadata as well, it's conceivable that in the case of a bug we could end up with a mismatch between the id used for keying and the id used in the object, but I don't think that's likely.

Yeah, sorry if this was unclear, what I meant was whether it could be a reasonable use case to have both a RESTART and one of the two others at the same time. Not that there is anything wrong in this PR or anything, more wanted to bring this to your attention for possible discussion (and maybe you discussed it already and discarded the use case?).

henningandersen · 2021-08-13T09:47:54Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

-                UnassignedInfo unassignedInfo = new UnassignedInfo(UnassignedInfo.Reason.NODE_LEFT, "node_left [" + node.nodeId() + "]",
-                    null, 0, allocation.getCurrentNanoTime(), System.currentTimeMillis(), delayed, AllocationStatus.NO_ATTEMPT,
-                    Collections.emptySet());
+                boolean delayedDueToKnownRestart = Optional.ofNullable(nodesShutdownMetadata.get(node.nodeId()))


I think this is constant across all shards on the node and could go outside the loop over node.copyShards.

Good catch, thanks!

henningandersen · 2021-08-13T09:50:27Z

server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java

+                        boolean nodeIsRestarting = Optional.ofNullable(metadata.nodeShutdowns().get(shard.currentNodeId()))
+                            .map(shutdownInfo -> shutdownInfo.getType().equals(SingleNodeShutdownMetadata.Type.RESTART))
+                            .orElse(false);


This is now unused

Suggested change

boolean nodeIsRestarting = Optional.ofNullable(metadata.nodeShutdowns().get(shard.currentNodeId()))

.map(shutdownInfo -> shutdownInfo.getType().equals(SingleNodeShutdownMetadata.Type.RESTART))

.orElse(false);

henningandersen · 2021-08-13T09:54:25Z

server/src/test/java/org/elasticsearch/cluster/routing/UnassignedInfoTests.java

-            new UnassignedInfo(reason, randomBoolean() ? randomAlphaOfLength(4) : null, null,
-                failedAllocations, System.nanoTime(), System.currentTimeMillis(), false, AllocationStatus.NO_ATTEMPT, failedNodes):
-            new UnassignedInfo(reason, randomBoolean() ? randomAlphaOfLength(4) : null);
+        String lastAssignedNodeId = randomBoolean() ? randomAlphaOfLength(10) : null;


I think my suggested assertion will fail here. I suggest to refine the test to only fill in lastAssignedNodeId for the right two reason types.

henningandersen · 2021-08-13T10:05:29Z

server/src/test/java/org/elasticsearch/cluster/routing/UnassignedInfoTests.java

+
+        // Generate a random time value - but don't use nanos as extremely small values of nanos can break assertion calculations
+        final TimeValue shutdownDelay = TimeValue.parseTimeValue(
+            randomTimeValue(2, 1000, "d", "h", "ms", "s", "m", "micros"),


To avoid risking too many iterations on the randomValueOtherThanMany I propose:

Suggested change

randomTimeValue(2, 1000, "d", "h", "ms", "s", "m", "micros"),

randomTimeValue(100, 1000, "d", "h", "ms", "s", "m", "micros"),

henningandersen · 2021-08-13T10:09:55Z

...ternalClusterTest/java/org/elasticsearch/xpack/shutdown/NodeShutdownDelayedAllocationIT.java

+            nodeToRestartId,
+            SingleNodeShutdownMetadata.Type.RESTART,
+            this.getTestName(),
+            TimeValue.timeValueSeconds(1)


Why not just 1 millisecond?

henningandersen · 2021-08-13T10:22:40Z

server/src/test/java/org/elasticsearch/cluster/routing/UnassignedInfoTests.java

@@ -76,10 +82,20 @@ public void testSerialization() throws Exception {
        int failedAllocations = randomIntBetween(1, 100);


I wonder if we should add a test of bwc serialization? Since the functionality will only be enabled in 7.15 and I am unaware of any shutdown rolling upgrade tests, we could in theory be exposed to a bwc bug in serialization (not that I saw any, looks all good to me).

I'm not sure it's worth adding a unit test for this, but I do intend to add some rolling and full-restart upgrade tests for this soon.

Rather than only allowing NODE_LEFT/NODE_RESTARTING to have a lastAllocatedNodeId

…E_LEFT or NODE_RESTARTING" This reverts commit 8bc7442.

…of per shard

A number of tests created `UnassignedInfo`s without regard for the `reason`, which started failing after adjusting the assertion. These tests now generate their `UnassignedInfo`s via methods which handle the requirements.

gwbrown · 2021-08-16T21:48:11Z

Per brief discussion elsewhere, I'm going to go ahead and merge this - @dakrone can review the recent changes when he's back online, and any changes can be made as a follow-up.

…lastic#75606) This PR makes the delayed allocation infrastructure aware of registered node shutdowns, so that reallocation of shards will be further delayed for nodes which are known to be restarting. To make this more configurable, the Node Shutdown APIs now support a `allocation_delay` parameter, which defaults to 5 minutes. For example: ``` PUT /_nodes/USpTGYaBSIKbgSUJR2Z9lg/shutdown { "type": "restart", "reason": "Demonstrating how the node shutdown API works", "allocation_delay": "20m" } ``` Will cause reallocation of shards assigned to that node to another node to be delayed by 20 minutes. Note that this delay will only be used if it's *longer* than the index-level allocation delay, set via `index.unassigned.node_left.delayed_timeout`. The `allocation_delay` parameter is only valid for `restart`-type shutdown registrations, and the request will be rejected if it's used with another shutdown type.

* master: (868 commits) Query API key - Rest spec and yaml tests (elastic#76238) Delay shard reassignment from nodes which are known to be restarting (elastic#75606) Reenable bwc tests for elastic#76475 (elastic#76576) Set version to 7.15 in BWC code (elastic#76577) Don't remove warning headers on all failure (elastic#76434) Disable bwc tests for elastic#76475 (elastic#76541) Re-enable bwc tests (elastic#76567) Keep track of data recovered from snapshots in RecoveryState (elastic#76499) [Transform] Align transform checkpoint range with date_histogram interval for better performance (elastic#74004) EQL: Remove "wildcard" function (elastic#76099) Fix 'accept' and 'content_type' fields for search_mvt API Add persistent licensed feature tracking (elastic#76476) Add system data streams to feature state snapshots (elastic#75902) fix the error message for instance methods that don't exist (elastic#76512) ILM: Add validation of the number_of_shards parameter in Shrink Action of ILM (elastic#74219) remove dashboard only reserved role (elastic#76507) Fix Stack Overflow in UnassignedInfo in Corner Case (elastic#76480) Add (Extended)KeyUsage KeyUsage, CipherSuite & Protocol to SSL diagnostics (elastic#65634) Add recovery from snapshot to tests (elastic#76535) Reenable BwC Tests after elastic#76532 (elastic#76534) ...

…75606) This PR makes the delayed allocation infrastructure aware of registered node shutdowns, so that reallocation of shards will be further delayed for nodes which are known to be restarting. To make this more configurable, the Node Shutdown APIs now support a `allocation_delay` parameter, which defaults to 5 minutes. For example: ``` PUT /_nodes/USpTGYaBSIKbgSUJR2Z9lg/shutdown { "type": "restart", "reason": "Demonstrating how the node shutdown API works", "allocation_delay": "20m" } ``` Will cause reallocation of shards assigned to that node to another node to be delayed by 20 minutes. Note that this delay will only be used if it's *longer* than the index-level allocation delay, set via `index.unassigned.node_left.delayed_timeout`. The `allocation_delay` parameter is only valid for `restart`-type shutdown registrations, and the request will be rejected if it's used with another shutdown type.

This reverts commit 7576621.

This PR changes the serialization version for the contents of #75606 and re-enables BWC tests following the backport of that PR (backport in #76587).

gwbrown added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown v7.15.0 labels Jul 21, 2021

gwbrown marked this pull request as ready for review July 22, 2021 22:38

elasticmachine added Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Jul 22, 2021

gwbrown requested a review from dakrone July 22, 2021 22:38

gwbrown mentioned this pull request Jul 22, 2021

Add node shutdown API for shutting down nodes cleanly #70338

Closed

22 tasks

gwbrown requested a review from henningandersen July 22, 2021 22:41

gwbrown requested review from DaveCTurner and removed request for henningandersen July 26, 2021 17:12

dakrone requested changes Jul 26, 2021

View reviewed changes

gwbrown requested a review from dakrone July 28, 2021 15:24

dakrone approved these changes Jul 28, 2021

View reviewed changes

henningandersen self-requested a review August 2, 2021 14:38

henningandersen reviewed Aug 2, 2021

View reviewed changes

gwbrown added 4 commits August 2, 2021 16:25

First cut of an integration test (fails, obviously)

f2d6666

WIP - test passes!

29641f4

Clean up getRemainingDelay params + reinstate unit test

b87dfc4

Back to taking a Metadata since we need a setting...

4f8f0bd

henningandersen self-requested a review August 12, 2021 10:07

henningandersen approved these changes Aug 13, 2021

View reviewed changes

gwbrown added 11 commits August 13, 2021 14:25

Expand comment per review

ca51467

Modify assert to require lastAllocatedNodeId for NODE_RESTARTING

0491493

Rather than only allowing NODE_LEFT/NODE_RESTARTING to have a lastAllocatedNodeId

Revert "Ensure we don't try to set lastNodeId if the reason isn't NOD…

dc5d851

…E_LEFT or NODE_RESTARTING" This reverts commit 8bc7442.

Compute reason once per node when disassociating dead nodes, instead …

3269afd

…of per shard

Remove unused code block per review

6b2f2d1

Adjust UnassignedInfo test instance creation per review

1a45d0f

Adjust test delays per review

8066104

imports

cdb3f6e

Merge branch 'master' into decom/delayed-shard-reassignment-on-restart

65f5650

Alter test to generated lastAllocatedNodeId if necessary

b090715

gwbrown merged commit 58f66cf into elastic:master Aug 16, 2021

gwbrown added the backport pending label Aug 16, 2021

gwbrown mentioned this pull request Aug 16, 2021

[7.x] Delay shard reassignment from nodes which are known to be restarting (#75606) #76587

Merged

gwbrown mentioned this pull request Aug 17, 2021

Fix parsing of node shutdown allocation delay #76589

Merged

gwbrown added a commit that referenced this pull request Aug 17, 2021

Disable BWC for backport of #75606

7576621

gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Aug 17, 2021

Revert "Disable BWC for backport of elastic#75606"

230a205

This reverts commit 7576621.

gwbrown mentioned this pull request Aug 17, 2021

Reenable BWC after backport of #75606 #76617

Merged

gwbrown removed the backport pending label Aug 17, 2021

gwbrown added a commit that referenced this pull request Aug 17, 2021

Reenable BWC after backport of #75606 (#76617)

66fc127

This PR changes the serialization version for the contents of #75606 and re-enables BWC tests following the backport of that PR (backport in #76587).

gwbrown mentioned this pull request Aug 17, 2021

Fix NOT_STARTED statuses appearing inappropirately during node shutdown #75750

Merged

andreidan mentioned this pull request Aug 18, 2021

[CI] AllocationRoutedStepTests testExecuteAllocateUnassigned failing #76658

Closed

jakelandis added v8.0.0-alpha2 and removed v8.0.0 labels Sep 15, 2021

		if (Type.RESTART.equals(type) && delayOrDefault == null) {
		delayOrDefault = DEFAULT_RESTART_SHARD_ALLOCATION_DELAY;

	boolean nodeIsRestarting = Optional.ofNullable(metadata.nodeShutdowns().get(shard.currentNodeId()))
	.map(shutdownInfo -> shutdownInfo.getType().equals(SingleNodeShutdownMetadata.Type.RESTART))
	.orElse(false);

	randomTimeValue(2, 1000, "d", "h", "ms", "s", "m", "micros"),
	randomTimeValue(100, 1000, "d", "h", "ms", "s", "m", "micros"),

		@@ -76,10 +82,20 @@ public void testSerialization() throws Exception {
		int failedAllocations = randomIntBetween(1, 100);

Delay shard reassignment from nodes which are known to be restarting #75606

Delay shard reassignment from nodes which are known to be restarting #75606

Uh oh!

Conversation

gwbrown commented Jul 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Jul 22, 2021

Uh oh!

elasticmachine commented Jul 22, 2021

Uh oh!

gwbrown commented Jul 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gwbrown commented Jul 26, 2021

Uh oh!

gwbrown commented Jul 26, 2021

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

gwbrown commented Jul 29, 2021

Uh oh!

gwbrown commented Jul 29, 2021

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dakrone Aug 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gwbrown Aug 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gwbrown commented Jul 21, 2021 •

edited

Loading

gwbrown commented Jul 22, 2021 •

edited

Loading

dakrone Aug 2, 2021 •

edited

Loading

gwbrown Aug 2, 2021 •

edited

Loading

gwbrown commented Aug 2, 2021 •

edited

Loading