-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Delay shard reassignment from nodes which are known to be restarting #75606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay shard reassignment from nodes which are known to be restarting #75606
Conversation
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Pinging @elastic/es-distributed (Team:Distributed) |
Henning is unavailable this week so switching the distrib-team review request to someone else. |
Failure is #75667. @elasticmachine run elasticsearch-ci/part-1 |
@DaveCTurner I've pinged you for a review on this PR mostly because I don't know who would be most appropriate from the Distributed team to review this - feel free to reassign as you see fit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Gordon! I left some pretty minor comments, but nothing major
if (Type.RESTART.equals(type) && delayOrDefault == null) { | ||
delayOrDefault = DEFAULT_RESTART_SHARD_ALLOCATION_DELAY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like the wrong place to implement this default, I think rather than here, it should go into the getter inside of SingleNodeShutdownMetadata
, otherwise calling toXContent
on the metadata makes it appear that it has been set (when in reality the default value will be used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I'll change it.
import java.util.Set; | ||
|
||
/** | ||
* Holds additional information as to why the shard is in unassigned state. | ||
*/ | ||
public final class UnassignedInfo implements ToXContentFragment, Writeable { | ||
|
||
private static final Version LAST_ALLOCATED_NODE_VERSION = Version.V_8_0_0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a little strange to me, do we need this rather than just using the version directly in the two serialization methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like doing this because it 1) adds semantic information to the version check, and 2) is less error-prone when switching the serialization version following a backport, especially if there are other serialization version checks in the same class.
I can make it inline if you feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to keep it, but I would suggest two things. First, we should add a comment about what it is for, and second, I think renaming it to something like VERSION_LAST_ALLOCATED_NODE_ADDED
would be better.
Just looking at it with its current name makes me think this is the version of the node where the shard was last allocated.
final Settings indexSettings, | ||
final Map<String, SingleNodeShutdownMetadata> nodesShutdownMap | ||
) { | ||
Map<String, SingleNodeShutdownMetadata> nodeShutdowns = nodesShutdownMap != null ? nodesShutdownMap : Collections.emptyMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this null
check here, since Metadata#nodeShutdowns()
returns an empty map if there are no shutdowns. Maybe we can replace it with an assert
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think this is left over from an intermediate version.
// Verify that the shard's allocation is still delayed | ||
assertBusy( | ||
() -> { assertThat(client().admin().cluster().prepareHealth().get().getDelayedUnassignedShards(), equalTo(1)); }, | ||
2, | ||
TimeUnit.SECONDS | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this assert, since a long GC or other CI slowness could cause this to be flaky. I think just the ensureGreen
below is enough, because it was changed from 3 hours to 10 seconds and the change took effect.
2, | ||
TimeUnit.SECONDS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the 2 second limit here, since we have 3 hours to check this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@elasticmachine update branch |
@DaveCTurner Would you prefer us to wait for a review from the Distributed team on this, or should we go ahead and merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a number of comments. I would like your input on whether we could instead capture which delay to use when we lose the node (see comments inline)?
I did not find any shutdown documentation, but if it is there (did not search exhaustively) it would be good to update as part of this PR.
@@ -145,7 +182,8 @@ public boolean equals(Object o) { | |||
return getStartedAtMillis() == that.getStartedAtMillis() | |||
&& getNodeId().equals(that.getNodeId()) | |||
&& getType() == that.getType() | |||
&& getReason().equals(that.getReason()); | |||
&& getReason().equals(that.getReason()) | |||
&& Objects.equals(getShardReallocationDelay(), that.getShardReallocationDelay()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to use the native field to avoid two semantically equivalent objects being equal. Not that I know a case where it will cause issues, but I find it odd to have two equal objects producing different xcontent. Need to update hashCode
too.
.filter(shutdownMetadata -> SingleNodeShutdownMetadata.Type.RESTART.equals(shutdownMetadata.getType())) | ||
.map(SingleNodeShutdownMetadata::getShardReallocationDelay) | ||
.map(TimeValue::nanos) | ||
.orElse(INDEX_DELAYED_NODE_LEFT_TIMEOUT_SETTING.get(indexSettings).nanos()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where INDEX_DELAYED_NODE_LEFT_TIMEOUT_SETTING
setting is greater than the shutdown reallocation delay this means that the shutdown delay overrides the INDEX_DELAYED_NODE_LEFT_TIMEOUT_SETTING
. That seems wrong? I think we should take the greater of the shutdown delay and the delayed allocation delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems wrong? I think we should take the greater of the shutdown delay and the delayed allocation delay.
I think it should be the explicitly set value in the node shutdown, regardless of the cluster-level setting for index delayed node left timeout. In that way we are doing exactly what the user asked (you asked me to wait 10 minutes, so I will wait 10 minutes regardless of the other setting) which is more expected. Why do you think it should be the max? That feels a little like action-at-a-distance when a user could not even be aware of the other cluster-wide setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index.unassigned.node_left.delayed_timeout
is an index level setting. The shutdown allocation delay is less likely to be controllable by a user (I believe cloud will simply use our default value) and I think of it as a "default value" for the index specific delay, allowing some extra (but never less) leeway to handle a planned restart.
If a user setup index.unassigned.node_left.delayed_timeout=15m
for some indices, it seems odd that a planned restart would use a smaller value for the specific index/shards.
Also, if the shutdown API was called with allocation_timeout=0
, it would be somewhat strange to disregard the index specific delay. That would effectively mean that we would be treating a planned restart as a worse event than a crash.
I think of restarts as something happening automatically, based on autoscaling, upgrades or other maintenance (though it could be an admin doing this, but without discussing with users). So if users (who should control the index level settings) set up their indices with higher delays than the shutdown uses, I think that higher delay should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this to use the max of index level delay and restart delay.
} | ||
|
||
public static SingleNodeShutdownMetadata parse(XContentParser parser) { | ||
return PARSER.apply(parser, null); | ||
} | ||
|
||
public static final TimeValue DEFAULT_RESTART_SHARD_ALLOCATION_DELAY = TimeValue.timeValueMinutes(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 minutes seems a bit long, I would advocate just 2 minutes should be enough to restart a single node with some safety margin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did actually discuss this in the meeting. Cloud uses a 5 minute plan timeout right now and we wanted something a bit higher than that and settled on 10 minutes. Is there a reason you think this should not be 10 minutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing very substantial, it just feels like a long time to expect a restart to take. Our documentation speaks of 5 minutes too. Our default delayed allocation is 1 minute. And if cloud uses 5 minute timeout, 10 minutes seems excessive.
We should remember that once the timeout is hit, nothing really bad happens, we "just" start doing more work than absolutely necessary in order to restore availability.
If cloud uses a 5 minute timeout, I would advocate using the same timeout here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the default value to 5m
.
boolean delayed = Optional.ofNullable(nodesShutdownMetadata.get(node.nodeId())) | ||
// If we know this node is restarting, then the allocation should be delayed | ||
.map(shutdownMetadata -> SingleNodeShutdownMetadata.Type.RESTART.equals(shutdownMetadata.getType())) | ||
// Otherwise, use the "normal" allocation delay logic | ||
.orElse(INDEX_DELAYED_NODE_LEFT_TIMEOUT_SETTING.get(indexMetadata.getSettings()).nanos() > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this means that if the shutdown type is REMOVE
or REPLACE
, we now ignore the delayed left timeout?
Also, I wonder if we should only use the shutdown indication if it had completed, i.e., the node node was ready to restart when this happened? Since if it crashes on its own and they setup no delay for their indices, we should probably honor that. A flapping node during a restart could result in a long period of not assigning the shard, since we recalculate the delay every time the shard is recovered and the node subsequently dies.
Finally, if the shutdown reallocation delay and the delayed allocation delay are both 0, we should ensure delayed=false
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this means that if the shutdown type is REMOVE or REPLACE, we now ignore the delayed left timeout?
Good catch. I'll fix.
Also, I wonder if we should only use the shutdown indication if it had completed, i.e., the node node was ready to restart when this happened?
We don't actually keep state about whether the shutdown preparation is complete or not, we compute the status when the Get Status API is called. So we'd have to move that logic to somewhere where it's accessible here, and ensure it always remains fast since this is called on the cluster state update thread.
It would likely be better if we checked the status, I agree, but RESTART-type shutdowns should be relatively quick to prepare. Do you think this is something we need for v1, or could we introduce it as the shutdown process becomes more involved?
A flapping node during a restart could result in a long period of not assigning the shard, since we recalculate the delay every time the shard is recovered and the node subsequently dies.
Can't this happen today with the existing allocation delay? Admittedly, it's less likely with a shorter timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also the case of someone intentionally restarting the node before it's ready. Should we use the general delay in that case? It's not clear to me that one behavior is more intuitive than the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this is something we need for v1, or could we introduce it as the shutdown process becomes more involved?
I do not think it is mandatory for v1.
Can't this happen today with the existing allocation delay? Admittedly, it's less likely with a shorter timeout.
This was specifically for the case where "they setup no delay for their indices". We would be allocating those immediately. Obviously during a cloud restart today, there would be no such allocation going on though.
Given the technical difficulty in doing this, we can defer to a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also the case of someone intentionally restarting the node before it's ready. Should we use the general delay in that case? It's not clear to me that one behavior is more intuitive than the other.
Thanks, good point. I suppose that for a RESTART
shutdown, the node is really "ready" for restart immediately (since we always anticipate a restart), however, optimally cloud would wait a little before forcing it through. So relying on the presence of the shutdown indication seems good here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this to track whether we knew the last allocated node was restarting at the time the shard became unassigned, and if not, to always use the index-level delay rather than the restart-level delay.
server/src/test/java/org/elasticsearch/cluster/metadata/NodesShutdownMetadataTests.java
Outdated
Show resolved
Hide resolved
// Actually stop the node | ||
internalCluster().stopRandomNode(InternalTestCluster.nameFilter(nodeToRestartName)); | ||
|
||
// Verify that the shard's allocation is delayed - but with a shorter wait than the reallocation timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part of this comment on "shorter wait time" seems off here?
ensureGreen(TimeValue.timeValueSeconds(30), "test"); | ||
} | ||
|
||
public void testShardAllocationStartsImmediatelyIfShutdownDeleted() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test looks nearly identical to the previous test. I think we should either:
- Refactor to share most of the code.
- Make just one test that randomly either lowers the timeout or deletes the shutdown indication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this to go with option 1, I don't like relying on randomization to check actually-different code paths like that.
public TimeValue getShardReallocationDelay() { | ||
return shardReallocationDelay; | ||
} | ||
|
||
@Override | ||
public ActionRequestValidationException validate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add the validation of type vs reallocation delay here?
) { | ||
long delayTimeoutNanos = Optional.ofNullable(lastAllocatedNodeId) | ||
.map(nodesShutdownMap::get) | ||
.filter(shutdownMetadata -> SingleNodeShutdownMetadata.Type.RESTART.equals(shutdownMetadata.getType())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit odd that if a node crashes and then has a shutdown indication added, we extend the node left timeout to be the possibly longer shutdown node left timeout.
I wonder if we could instead capture whether the node was ready for shutdown when we lost it and use that to determine which of the delays to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my above comments for the difficulty with doing so, but as an alternative, we could at least capture whether the node was registered for shutdown at all when the node went offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, given your earlier comments, that approach makes sense to me.
@@ -35,14 +38,16 @@ | |||
public static final ParseField REASON_FIELD = new ParseField("reason"); | |||
public static final String STARTED_AT_READABLE_FIELD = "shutdown_started"; | |||
public static final ParseField STARTED_AT_MILLIS_FIELD = new ParseField(STARTED_AT_READABLE_FIELD + "millis"); | |||
public static final ParseField SHARD_REALLOCATION_DELAY = new ParseField("shard_reallocation_delay"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this argument could be just "allocation_delay"?
Hm, I appear to have oops'd the commit history. Please stand by. Edit: Fixed, apologies for the force push - I think Github handles that a bit better these days, at least. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the extra iterations. I wonder if @dakrone should have a quick re-review given that it changed substantially.
INDEX_CLOSED | ||
INDEX_CLOSED, | ||
/** | ||
* Similar to NODE_LEFT, but at the time the node left, it had been registered for a restart via the Node Shutdown API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps expand the comment here to include the detail that it might be a crash happening during a node restart procedure.
@@ -242,8 +266,12 @@ public UnassignedInfo(Reason reason, @Nullable String message, @Nullable Excepti | |||
this.failedAllocations = failedAllocations; | |||
this.lastAllocationStatus = Objects.requireNonNull(lastAllocationStatus); | |||
this.failedNodeIds = Collections.unmodifiableSet(failedNodeIds); | |||
assert (failedAllocations > 0) == (reason == Reason.ALLOCATION_FAILED) : | |||
"failedAllocations: " + failedAllocations + " for reason " + reason; | |||
this.lastAllocatedNodeId = lastAllocatedNodeId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you added the opposite. The assertion above should now be:
assert reason != Reason.NODE_RESTARTING || lastAllocatedNodeId != null
meaning that if reason == restarting, we require a lastAllocatedNodeId
.
Unfortunately we cannot require that for NODE_LEFT
due to bwc.
@@ -719,6 +719,12 @@ public IndexMetadata getIndexSafe(Index index) { | |||
.orElse(Collections.emptyMap()); | |||
} | |||
|
|||
public Map<String, SingleNodeShutdownMetadata> nodeShutdowns() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really part of this review, but I wonder if we could not risk seeing multiple shutdown indications for the same node, for instance both a RESTART and REMOVE or REPLACE? I think of ECK in particular here, but might also be relevant in cloud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there's a couple things that prevent this:
- In
TransportPutShutdownNodeAction
when we get aPUT
for a node that already has a record, it's updated rather than added to, and - The data structure used to store the
SingleNodeShutdownMetadata
(theMap
in the line you're commenting on) is keyed by node UUID, so it should be impossible to have multiple records for the same key/nodeId.
Since the node id is duplicated in the SingleNodeShutdownMetadata
as well, it's conceivable that in the case of a bug we could end up with a mismatch between the id used for keying and the id used in the object, but I don't think that's likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry if this was unclear, what I meant was whether it could be a reasonable use case to have both a RESTART and one of the two others at the same time. Not that there is anything wrong in this PR or anything, more wanted to bring this to your attention for possible discussion (and maybe you discussed it already and discarded the use case?).
UnassignedInfo unassignedInfo = new UnassignedInfo(UnassignedInfo.Reason.NODE_LEFT, "node_left [" + node.nodeId() + "]", | ||
null, 0, allocation.getCurrentNanoTime(), System.currentTimeMillis(), delayed, AllocationStatus.NO_ATTEMPT, | ||
Collections.emptySet()); | ||
boolean delayedDueToKnownRestart = Optional.ofNullable(nodesShutdownMetadata.get(node.nodeId())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is constant across all shards on the node and could go outside the loop over node.copyShards
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks!
boolean nodeIsRestarting = Optional.ofNullable(metadata.nodeShutdowns().get(shard.currentNodeId())) | ||
.map(shutdownInfo -> shutdownInfo.getType().equals(SingleNodeShutdownMetadata.Type.RESTART)) | ||
.orElse(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now unused
boolean nodeIsRestarting = Optional.ofNullable(metadata.nodeShutdowns().get(shard.currentNodeId())) | |
.map(shutdownInfo -> shutdownInfo.getType().equals(SingleNodeShutdownMetadata.Type.RESTART)) | |
.orElse(false); |
new UnassignedInfo(reason, randomBoolean() ? randomAlphaOfLength(4) : null, null, | ||
failedAllocations, System.nanoTime(), System.currentTimeMillis(), false, AllocationStatus.NO_ATTEMPT, failedNodes): | ||
new UnassignedInfo(reason, randomBoolean() ? randomAlphaOfLength(4) : null); | ||
String lastAssignedNodeId = randomBoolean() ? randomAlphaOfLength(10) : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my suggested assertion will fail here. I suggest to refine the test to only fill in lastAssignedNodeId
for the right two reason types.
|
||
// Generate a random time value - but don't use nanos as extremely small values of nanos can break assertion calculations | ||
final TimeValue shutdownDelay = TimeValue.parseTimeValue( | ||
randomTimeValue(2, 1000, "d", "h", "ms", "s", "m", "micros"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid risking too many iterations on the randomValueOtherThanMany
I propose:
randomTimeValue(2, 1000, "d", "h", "ms", "s", "m", "micros"), | |
randomTimeValue(100, 1000, "d", "h", "ms", "s", "m", "micros"), |
nodeToRestartId, | ||
SingleNodeShutdownMetadata.Type.RESTART, | ||
this.getTestName(), | ||
TimeValue.timeValueSeconds(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just 1 millisecond?
@@ -76,10 +82,20 @@ public void testSerialization() throws Exception { | |||
int failedAllocations = randomIntBetween(1, 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should add a test of bwc serialization? Since the functionality will only be enabled in 7.15 and I am unaware of any shutdown rolling upgrade tests, we could in theory be exposed to a bwc bug in serialization (not that I saw any, looks all good to me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's worth adding a unit test for this, but I do intend to add some rolling and full-restart upgrade tests for this soon.
Rather than only allowing NODE_LEFT/NODE_RESTARTING to have a lastAllocatedNodeId
…E_LEFT or NODE_RESTARTING" This reverts commit 8bc7442.
A number of tests created `UnassignedInfo`s without regard for the `reason`, which started failing after adjusting the assertion. These tests now generate their `UnassignedInfo`s via methods which handle the requirements.
Per brief discussion elsewhere, I'm going to go ahead and merge this - @dakrone can review the recent changes when he's back online, and any changes can be made as a follow-up. |
…lastic#75606) This PR makes the delayed allocation infrastructure aware of registered node shutdowns, so that reallocation of shards will be further delayed for nodes which are known to be restarting. To make this more configurable, the Node Shutdown APIs now support a `allocation_delay` parameter, which defaults to 5 minutes. For example: ``` PUT /_nodes/USpTGYaBSIKbgSUJR2Z9lg/shutdown { "type": "restart", "reason": "Demonstrating how the node shutdown API works", "allocation_delay": "20m" } ``` Will cause reallocation of shards assigned to that node to another node to be delayed by 20 minutes. Note that this delay will only be used if it's *longer* than the index-level allocation delay, set via `index.unassigned.node_left.delayed_timeout`. The `allocation_delay` parameter is only valid for `restart`-type shutdown registrations, and the request will be rejected if it's used with another shutdown type.
* master: (868 commits) Query API key - Rest spec and yaml tests (elastic#76238) Delay shard reassignment from nodes which are known to be restarting (elastic#75606) Reenable bwc tests for elastic#76475 (elastic#76576) Set version to 7.15 in BWC code (elastic#76577) Don't remove warning headers on all failure (elastic#76434) Disable bwc tests for elastic#76475 (elastic#76541) Re-enable bwc tests (elastic#76567) Keep track of data recovered from snapshots in RecoveryState (elastic#76499) [Transform] Align transform checkpoint range with date_histogram interval for better performance (elastic#74004) EQL: Remove "wildcard" function (elastic#76099) Fix 'accept' and 'content_type' fields for search_mvt API Add persistent licensed feature tracking (elastic#76476) Add system data streams to feature state snapshots (elastic#75902) fix the error message for instance methods that don't exist (elastic#76512) ILM: Add validation of the number_of_shards parameter in Shrink Action of ILM (elastic#74219) remove dashboard only reserved role (elastic#76507) Fix Stack Overflow in UnassignedInfo in Corner Case (elastic#76480) Add (Extended)KeyUsage KeyUsage, CipherSuite & Protocol to SSL diagnostics (elastic#65634) Add recovery from snapshot to tests (elastic#76535) Reenable BwC Tests after elastic#76532 (elastic#76534) ...
…75606) This PR makes the delayed allocation infrastructure aware of registered node shutdowns, so that reallocation of shards will be further delayed for nodes which are known to be restarting. To make this more configurable, the Node Shutdown APIs now support a `allocation_delay` parameter, which defaults to 5 minutes. For example: ``` PUT /_nodes/USpTGYaBSIKbgSUJR2Z9lg/shutdown { "type": "restart", "reason": "Demonstrating how the node shutdown API works", "allocation_delay": "20m" } ``` Will cause reallocation of shards assigned to that node to another node to be delayed by 20 minutes. Note that this delay will only be used if it's *longer* than the index-level allocation delay, set via `index.unassigned.node_left.delayed_timeout`. The `allocation_delay` parameter is only valid for `restart`-type shutdown registrations, and the request will be rejected if it's used with another shutdown type.
This reverts commit 7576621.
This PR makes the delayed allocation infrastructure aware of registered node shutdowns, so that reallocation of shards will be further delayed for nodes which are known to be restarting.
To make this more configurable, the Node Shutdown APIs now support a
allocation_delay
parameter, which defaults to 5 minutes. For example:Will cause reallocation of shards assigned to that node to another node to be delayed by 20 minutes. Note that this delay will only be used if it's longer than the index-level allocation delay, set via
index.unassigned.node_left.delayed_timeout
.The
allocation_delay
parameter is only valid forrestart
-type shutdown registrations, and the request will be rejected if it's used with another shutdown type.Relates #70338