Skip to content

Delay shard reassignment from nodes which are known to be restarting #75606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

gwbrown
Copy link
Contributor

@gwbrown gwbrown commented Jul 21, 2021

This PR makes the delayed allocation infrastructure aware of registered node shutdowns, so that reallocation of shards will be further delayed for nodes which are known to be restarting.

To make this more configurable, the Node Shutdown APIs now support a allocation_delay parameter, which defaults to 5 minutes. For example:

PUT /_nodes/USpTGYaBSIKbgSUJR2Z9lg/shutdown
{
  "type": "restart",
  "reason": "Demonstrating how the node shutdown API works",
  "allocation_delay": "20m"
}

Will cause reallocation of shards assigned to that node to another node to be delayed by 20 minutes. Note that this delay will only be used if it's longer than the index-level allocation delay, set via index.unassigned.node_left.delayed_timeout.

The allocation_delay parameter is only valid for restart-type shutdown registrations, and the request will be rejected if it's used with another shutdown type.

Relates #70338

@gwbrown gwbrown added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown v7.15.0 labels Jul 21, 2021
@gwbrown gwbrown marked this pull request as ready for review July 22, 2021 22:38
@elasticmachine elasticmachine added Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Jul 22, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@gwbrown
Copy link
Contributor Author

gwbrown commented Jul 22, 2021

@henningandersen I've pinged you for a review on this PR mostly because I don't know who would be most appropriate from the Distributed team to review this - feel free to reassign as you see fit.

Henning is unavailable this week so switching the distrib-team review request to someone else.

@gwbrown
Copy link
Contributor Author

gwbrown commented Jul 26, 2021

Failure is #75667.

@elasticmachine run elasticsearch-ci/part-1

@gwbrown gwbrown requested review from DaveCTurner and removed request for henningandersen July 26, 2021 17:12
@gwbrown
Copy link
Contributor Author

gwbrown commented Jul 26, 2021

@DaveCTurner I've pinged you for a review on this PR mostly because I don't know who would be most appropriate from the Distributed team to review this - feel free to reassign as you see fit.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Gordon! I left some pretty minor comments, but nothing major

Comment on lines 265 to 266
if (Type.RESTART.equals(type) && delayOrDefault == null) {
delayOrDefault = DEFAULT_RESTART_SHARD_ALLOCATION_DELAY;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like the wrong place to implement this default, I think rather than here, it should go into the getter inside of SingleNodeShutdownMetadata, otherwise calling toXContent on the metadata makes it appear that it has been set (when in reality the default value will be used)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I'll change it.

import java.util.Set;

/**
* Holds additional information as to why the shard is in unassigned state.
*/
public final class UnassignedInfo implements ToXContentFragment, Writeable {

private static final Version LAST_ALLOCATED_NODE_VERSION = Version.V_8_0_0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little strange to me, do we need this rather than just using the version directly in the two serialization methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like doing this because it 1) adds semantic information to the version check, and 2) is less error-prone when switching the serialization version following a backport, especially if there are other serialization version checks in the same class.

I can make it inline if you feel strongly about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep it, but I would suggest two things. First, we should add a comment about what it is for, and second, I think renaming it to something like VERSION_LAST_ALLOCATED_NODE_ADDED would be better.

Just looking at it with its current name makes me think this is the version of the node where the shard was last allocated.

final Settings indexSettings,
final Map<String, SingleNodeShutdownMetadata> nodesShutdownMap
) {
Map<String, SingleNodeShutdownMetadata> nodeShutdowns = nodesShutdownMap != null ? nodesShutdownMap : Collections.emptyMap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this null check here, since Metadata#nodeShutdowns() returns an empty map if there are no shutdowns. Maybe we can replace it with an assert instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think this is left over from an intermediate version.

Comment on lines 164 to 155
// Verify that the shard's allocation is still delayed
assertBusy(
() -> { assertThat(client().admin().cluster().prepareHealth().get().getDelayedUnassignedShards(), equalTo(1)); },
2,
TimeUnit.SECONDS
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this assert, since a long GC or other CI slowness could cause this to be flaky. I think just the ensureGreen below is enough, because it was changed from 3 hours to 10 seconds and the change took effect.

Comment on lines 207 to 208
2,
TimeUnit.SECONDS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the 2 second limit here, since we have 3 hours to check this :)

@gwbrown gwbrown requested a review from dakrone July 28, 2021 15:24
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@gwbrown
Copy link
Contributor Author

gwbrown commented Jul 29, 2021

@elasticmachine update branch

@gwbrown
Copy link
Contributor Author

gwbrown commented Jul 29, 2021

@DaveCTurner Would you prefer us to wait for a review from the Distributed team on this, or should we go ahead and merge?

@henningandersen henningandersen self-requested a review August 2, 2021 14:38
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a number of comments. I would like your input on whether we could instead capture which delay to use when we lose the node (see comments inline)?

I did not find any shutdown documentation, but if it is there (did not search exhaustively) it would be good to update as part of this PR.

@@ -145,7 +182,8 @@ public boolean equals(Object o) {
return getStartedAtMillis() == that.getStartedAtMillis()
&& getNodeId().equals(that.getNodeId())
&& getType() == that.getType()
&& getReason().equals(that.getReason());
&& getReason().equals(that.getReason())
&& Objects.equals(getShardReallocationDelay(), that.getShardReallocationDelay());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to use the native field to avoid two semantically equivalent objects being equal. Not that I know a case where it will cause issues, but I find it odd to have two equal objects producing different xcontent. Need to update hashCode too.

.filter(shutdownMetadata -> SingleNodeShutdownMetadata.Type.RESTART.equals(shutdownMetadata.getType()))
.map(SingleNodeShutdownMetadata::getShardReallocationDelay)
.map(TimeValue::nanos)
.orElse(INDEX_DELAYED_NODE_LEFT_TIMEOUT_SETTING.get(indexSettings).nanos());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where INDEX_DELAYED_NODE_LEFT_TIMEOUT_SETTING setting is greater than the shutdown reallocation delay this means that the shutdown delay overrides the INDEX_DELAYED_NODE_LEFT_TIMEOUT_SETTING. That seems wrong? I think we should take the greater of the shutdown delay and the delayed allocation delay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems wrong? I think we should take the greater of the shutdown delay and the delayed allocation delay.

I think it should be the explicitly set value in the node shutdown, regardless of the cluster-level setting for index delayed node left timeout. In that way we are doing exactly what the user asked (you asked me to wait 10 minutes, so I will wait 10 minutes regardless of the other setting) which is more expected. Why do you think it should be the max? That feels a little like action-at-a-distance when a user could not even be aware of the other cluster-wide setting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index.unassigned.node_left.delayed_timeout is an index level setting. The shutdown allocation delay is less likely to be controllable by a user (I believe cloud will simply use our default value) and I think of it as a "default value" for the index specific delay, allowing some extra (but never less) leeway to handle a planned restart.

If a user setup index.unassigned.node_left.delayed_timeout=15m for some indices, it seems odd that a planned restart would use a smaller value for the specific index/shards.

Also, if the shutdown API was called with allocation_timeout=0, it would be somewhat strange to disregard the index specific delay. That would effectively mean that we would be treating a planned restart as a worse event than a crash.

I think of restarts as something happening automatically, based on autoscaling, upgrades or other maintenance (though it could be an admin doing this, but without discussing with users). So if users (who should control the index level settings) set up their indices with higher delays than the shutdown uses, I think that higher delay should be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to use the max of index level delay and restart delay.

}

public static SingleNodeShutdownMetadata parse(XContentParser parser) {
return PARSER.apply(parser, null);
}

public static final TimeValue DEFAULT_RESTART_SHARD_ALLOCATION_DELAY = TimeValue.timeValueMinutes(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 minutes seems a bit long, I would advocate just 2 minutes should be enough to restart a single node with some safety margin?

Copy link
Member

@dakrone dakrone Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did actually discuss this in the meeting. Cloud uses a 5 minute plan timeout right now and we wanted something a bit higher than that and settled on 10 minutes. Is there a reason you think this should not be 10 minutes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing very substantial, it just feels like a long time to expect a restart to take. Our documentation speaks of 5 minutes too. Our default delayed allocation is 1 minute. And if cloud uses 5 minute timeout, 10 minutes seems excessive.

We should remember that once the timeout is hit, nothing really bad happens, we "just" start doing more work than absolutely necessary in order to restore availability.

If cloud uses a 5 minute timeout, I would advocate using the same timeout here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the default value to 5m.

Comment on lines 502 to 506
boolean delayed = Optional.ofNullable(nodesShutdownMetadata.get(node.nodeId()))
// If we know this node is restarting, then the allocation should be delayed
.map(shutdownMetadata -> SingleNodeShutdownMetadata.Type.RESTART.equals(shutdownMetadata.getType()))
// Otherwise, use the "normal" allocation delay logic
.orElse(INDEX_DELAYED_NODE_LEFT_TIMEOUT_SETTING.get(indexMetadata.getSettings()).nanos() > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means that if the shutdown type is REMOVE or REPLACE, we now ignore the delayed left timeout?

Also, I wonder if we should only use the shutdown indication if it had completed, i.e., the node node was ready to restart when this happened? Since if it crashes on its own and they setup no delay for their indices, we should probably honor that. A flapping node during a restart could result in a long period of not assigning the shard, since we recalculate the delay every time the shard is recovered and the node subsequently dies.

Finally, if the shutdown reallocation delay and the delayed allocation delay are both 0, we should ensure delayed=false here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means that if the shutdown type is REMOVE or REPLACE, we now ignore the delayed left timeout?

Good catch. I'll fix.

Also, I wonder if we should only use the shutdown indication if it had completed, i.e., the node node was ready to restart when this happened?

We don't actually keep state about whether the shutdown preparation is complete or not, we compute the status when the Get Status API is called. So we'd have to move that logic to somewhere where it's accessible here, and ensure it always remains fast since this is called on the cluster state update thread.

It would likely be better if we checked the status, I agree, but RESTART-type shutdowns should be relatively quick to prepare. Do you think this is something we need for v1, or could we introduce it as the shutdown process becomes more involved?

A flapping node during a restart could result in a long period of not assigning the shard, since we recalculate the delay every time the shard is recovered and the node subsequently dies.

Can't this happen today with the existing allocation delay? Admittedly, it's less likely with a shorter timeout.

Copy link
Contributor Author

@gwbrown gwbrown Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the case of someone intentionally restarting the node before it's ready. Should we use the general delay in that case? It's not clear to me that one behavior is more intuitive than the other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is something we need for v1, or could we introduce it as the shutdown process becomes more involved?

I do not think it is mandatory for v1.

Can't this happen today with the existing allocation delay? Admittedly, it's less likely with a shorter timeout.

This was specifically for the case where "they setup no delay for their indices". We would be allocating those immediately. Obviously during a cloud restart today, there would be no such allocation going on though.

Given the technical difficulty in doing this, we can defer to a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the case of someone intentionally restarting the node before it's ready. Should we use the general delay in that case? It's not clear to me that one behavior is more intuitive than the other.

Thanks, good point. I suppose that for a RESTART shutdown, the node is really "ready" for restart immediately (since we always anticipate a restart), however, optimally cloud would wait a little before forcing it through. So relying on the presence of the shutdown indication seems good here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to track whether we knew the last allocated node was restarting at the time the shard became unassigned, and if not, to always use the index-level delay rather than the restart-level delay.

// Actually stop the node
internalCluster().stopRandomNode(InternalTestCluster.nameFilter(nodeToRestartName));

// Verify that the shard's allocation is delayed - but with a shorter wait than the reallocation timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part of this comment on "shorter wait time" seems off here?

ensureGreen(TimeValue.timeValueSeconds(30), "test");
}

public void testShardAllocationStartsImmediatelyIfShutdownDeleted() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks nearly identical to the previous test. I think we should either:

  1. Refactor to share most of the code.
  2. Make just one test that randomly either lowers the timeout or deletes the shutdown indication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this to go with option 1, I don't like relying on randomization to check actually-different code paths like that.

public TimeValue getShardReallocationDelay() {
return shardReallocationDelay;
}

@Override
public ActionRequestValidationException validate() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add the validation of type vs reallocation delay here?

) {
long delayTimeoutNanos = Optional.ofNullable(lastAllocatedNodeId)
.map(nodesShutdownMap::get)
.filter(shutdownMetadata -> SingleNodeShutdownMetadata.Type.RESTART.equals(shutdownMetadata.getType()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit odd that if a node crashes and then has a shutdown indication added, we extend the node left timeout to be the possibly longer shutdown node left timeout.

I wonder if we could instead capture whether the node was ready for shutdown when we lost it and use that to determine which of the delays to use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my above comments for the difficulty with doing so, but as an alternative, we could at least capture whether the node was registered for shutdown at all when the node went offline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, given your earlier comments, that approach makes sense to me.

@@ -35,14 +38,16 @@
public static final ParseField REASON_FIELD = new ParseField("reason");
public static final String STARTED_AT_READABLE_FIELD = "shutdown_started";
public static final ParseField STARTED_AT_MILLIS_FIELD = new ParseField(STARTED_AT_READABLE_FIELD + "millis");
public static final ParseField SHARD_REALLOCATION_DELAY = new ParseField("shard_reallocation_delay");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this argument could be just "allocation_delay"?

@gwbrown
Copy link
Contributor Author

gwbrown commented Aug 2, 2021

Hm, I appear to have oops'd the commit history. Please stand by.

Edit: Fixed, apologies for the force push - I think Github handles that a bit better these days, at least.

@henningandersen henningandersen self-requested a review August 12, 2021 10:07
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the extra iterations. I wonder if @dakrone should have a quick re-review given that it changed substantially.

INDEX_CLOSED
INDEX_CLOSED,
/**
* Similar to NODE_LEFT, but at the time the node left, it had been registered for a restart via the Node Shutdown API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps expand the comment here to include the detail that it might be a crash happening during a node restart procedure.

@@ -242,8 +266,12 @@ public UnassignedInfo(Reason reason, @Nullable String message, @Nullable Excepti
this.failedAllocations = failedAllocations;
this.lastAllocationStatus = Objects.requireNonNull(lastAllocationStatus);
this.failedNodeIds = Collections.unmodifiableSet(failedNodeIds);
assert (failedAllocations > 0) == (reason == Reason.ALLOCATION_FAILED) :
"failedAllocations: " + failedAllocations + " for reason " + reason;
this.lastAllocatedNodeId = lastAllocatedNodeId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you added the opposite. The assertion above should now be:

assert reason != Reason.NODE_RESTARTING || lastAllocatedNodeId != null

meaning that if reason == restarting, we require a lastAllocatedNodeId.

Unfortunately we cannot require that for NODE_LEFT due to bwc.

@@ -719,6 +719,12 @@ public IndexMetadata getIndexSafe(Index index) {
.orElse(Collections.emptyMap());
}

public Map<String, SingleNodeShutdownMetadata> nodeShutdowns() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really part of this review, but I wonder if we could not risk seeing multiple shutdown indications for the same node, for instance both a RESTART and REMOVE or REPLACE? I think of ECK in particular here, but might also be relevant in cloud.

Copy link
Contributor Author

@gwbrown gwbrown Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there's a couple things that prevent this:

  1. In TransportPutShutdownNodeAction when we get a PUT for a node that already has a record, it's updated rather than added to, and
  2. The data structure used to store the SingleNodeShutdownMetadata (the Map in the line you're commenting on) is keyed by node UUID, so it should be impossible to have multiple records for the same key/nodeId.

Since the node id is duplicated in the SingleNodeShutdownMetadata as well, it's conceivable that in the case of a bug we could end up with a mismatch between the id used for keying and the id used in the object, but I don't think that's likely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry if this was unclear, what I meant was whether it could be a reasonable use case to have both a RESTART and one of the two others at the same time. Not that there is anything wrong in this PR or anything, more wanted to bring this to your attention for possible discussion (and maybe you discussed it already and discarded the use case?).

UnassignedInfo unassignedInfo = new UnassignedInfo(UnassignedInfo.Reason.NODE_LEFT, "node_left [" + node.nodeId() + "]",
null, 0, allocation.getCurrentNanoTime(), System.currentTimeMillis(), delayed, AllocationStatus.NO_ATTEMPT,
Collections.emptySet());
boolean delayedDueToKnownRestart = Optional.ofNullable(nodesShutdownMetadata.get(node.nodeId()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is constant across all shards on the node and could go outside the loop over node.copyShards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

Comment on lines 104 to 106
boolean nodeIsRestarting = Optional.ofNullable(metadata.nodeShutdowns().get(shard.currentNodeId()))
.map(shutdownInfo -> shutdownInfo.getType().equals(SingleNodeShutdownMetadata.Type.RESTART))
.orElse(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now unused

Suggested change
boolean nodeIsRestarting = Optional.ofNullable(metadata.nodeShutdowns().get(shard.currentNodeId()))
.map(shutdownInfo -> shutdownInfo.getType().equals(SingleNodeShutdownMetadata.Type.RESTART))
.orElse(false);

new UnassignedInfo(reason, randomBoolean() ? randomAlphaOfLength(4) : null, null,
failedAllocations, System.nanoTime(), System.currentTimeMillis(), false, AllocationStatus.NO_ATTEMPT, failedNodes):
new UnassignedInfo(reason, randomBoolean() ? randomAlphaOfLength(4) : null);
String lastAssignedNodeId = randomBoolean() ? randomAlphaOfLength(10) : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my suggested assertion will fail here. I suggest to refine the test to only fill in lastAssignedNodeId for the right two reason types.


// Generate a random time value - but don't use nanos as extremely small values of nanos can break assertion calculations
final TimeValue shutdownDelay = TimeValue.parseTimeValue(
randomTimeValue(2, 1000, "d", "h", "ms", "s", "m", "micros"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid risking too many iterations on the randomValueOtherThanMany I propose:

Suggested change
randomTimeValue(2, 1000, "d", "h", "ms", "s", "m", "micros"),
randomTimeValue(100, 1000, "d", "h", "ms", "s", "m", "micros"),

nodeToRestartId,
SingleNodeShutdownMetadata.Type.RESTART,
this.getTestName(),
TimeValue.timeValueSeconds(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just 1 millisecond?

@@ -76,10 +82,20 @@ public void testSerialization() throws Exception {
int failedAllocations = randomIntBetween(1, 100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add a test of bwc serialization? Since the functionality will only be enabled in 7.15 and I am unaware of any shutdown rolling upgrade tests, we could in theory be exposed to a bwc bug in serialization (not that I saw any, looks all good to me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth adding a unit test for this, but I do intend to add some rolling and full-restart upgrade tests for this soon.

@gwbrown
Copy link
Contributor Author

gwbrown commented Aug 16, 2021

Per brief discussion elsewhere, I'm going to go ahead and merge this - @dakrone can review the recent changes when he's back online, and any changes can be made as a follow-up.

@gwbrown gwbrown merged commit 58f66cf into elastic:master Aug 16, 2021
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Aug 16, 2021
…lastic#75606)

This PR makes the delayed allocation infrastructure aware of registered node shutdowns, so that reallocation of shards will be further delayed for nodes which are known to be restarting.

To make this more configurable, the Node Shutdown APIs now support a `allocation_delay` parameter, which defaults to 5 minutes. For example:
```
PUT /_nodes/USpTGYaBSIKbgSUJR2Z9lg/shutdown
{
  "type": "restart",
  "reason": "Demonstrating how the node shutdown API works",
  "allocation_delay": "20m"
}
```

Will cause reallocation of shards assigned to that node to another node to be delayed by 20 minutes. Note that this delay will only be used if it's *longer* than the index-level allocation delay, set via `index.unassigned.node_left.delayed_timeout`.

The `allocation_delay` parameter is only valid for `restart`-type shutdown registrations, and the request will be rejected if it's used with another shutdown type.
wjp719 added a commit to wjp719/elasticsearch that referenced this pull request Aug 17, 2021
* master: (868 commits)
  Query API key - Rest spec and yaml tests (elastic#76238)
  Delay shard reassignment from nodes which are known to be restarting (elastic#75606)
  Reenable bwc tests for elastic#76475 (elastic#76576)
  Set version to 7.15 in BWC code (elastic#76577)
  Don't remove warning headers on all failure (elastic#76434)
  Disable bwc tests for elastic#76475 (elastic#76541)
  Re-enable bwc tests (elastic#76567)
  Keep track of data recovered from snapshots in RecoveryState (elastic#76499)
  [Transform] Align transform checkpoint range with date_histogram interval for better performance (elastic#74004)
  EQL: Remove "wildcard" function (elastic#76099)
  Fix 'accept' and 'content_type' fields for search_mvt API
  Add persistent licensed feature tracking (elastic#76476)
  Add system data streams to feature state snapshots (elastic#75902)
  fix the error message for instance methods that don't exist (elastic#76512)
  ILM: Add validation of the number_of_shards parameter in Shrink Action of ILM (elastic#74219)
  remove dashboard only reserved role (elastic#76507)
  Fix Stack Overflow in UnassignedInfo in Corner Case (elastic#76480)
  Add (Extended)KeyUsage KeyUsage, CipherSuite & Protocol to SSL diagnostics (elastic#65634)
  Add recovery from snapshot to tests (elastic#76535)
  Reenable BwC Tests after elastic#76532 (elastic#76534)
  ...
gwbrown added a commit that referenced this pull request Aug 17, 2021
gwbrown added a commit that referenced this pull request Aug 17, 2021
…75606)

This PR makes the delayed allocation infrastructure aware of registered node shutdowns, so that reallocation of shards will be further delayed for nodes which are known to be restarting.

To make this more configurable, the Node Shutdown APIs now support a `allocation_delay` parameter, which defaults to 5 minutes. For example:
```
PUT /_nodes/USpTGYaBSIKbgSUJR2Z9lg/shutdown
{
  "type": "restart",
  "reason": "Demonstrating how the node shutdown API works",
  "allocation_delay": "20m"
}
```

Will cause reallocation of shards assigned to that node to another node to be delayed by 20 minutes. Note that this delay will only be used if it's *longer* than the index-level allocation delay, set via `index.unassigned.node_left.delayed_timeout`.

The `allocation_delay` parameter is only valid for `restart`-type shutdown registrations, and the request will be rejected if it's used with another shutdown type.
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Aug 17, 2021
gwbrown added a commit that referenced this pull request Aug 17, 2021
This PR changes the serialization version for the contents of #75606 and re-enables BWC tests following the backport of that PR (backport in #76587).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.15.0 v8.0.0-alpha2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants