Skip to content

Commit 6e875d0

Browse files
authored
Add node REPLACE shutdown implementation (#76247)
* WIP, basic implementation * Pull `if` branch into a variable * Remove outdated javadoc * Remove map iteration, use target name instead of id (whoops) * Remove streaming from isReplacementSource * Simplify getReplacementName * Only calculate node shutdowns if canRemain==false and forceMove==false * Move canRebalance comment in BalancedShardsAllocator * Rename canForceDuringVacate -> canForceAllocateDuringReplace * Add comment to AwarenessAllocationDecider.canForceAllocateDuringReplace * Revert changes to ClusterRebalanceAllocationDecider * Change "no replacement" decision message in NodeReplacementAllocationDecider * Only construct shutdown map once in isReplacementSource * Make node shutdowns and target shutdowns available within RoutingAllocation * Add randomization for adding the filter that is overridden in test * Add integration test with replicas: 1 * Go nuts with the verbosity of allocation decisions * Also check NODE_C in unit test * Test with randomly assigned shard * Fix test for extra verbose decision messages * Remove canAllocate(IndexMetadat, RoutingNode, RoutingAllocation) overriding * Spotless :| * Implement 100% disk usage check during force-replace-allocate * Add rudimentary documentation for "replace" shutdown type * Use RoutingAllocation shutdown map in BalancedShardsAllocator * Add canForceAllocateDuringReplace to AllocationDeciders & add test * Switch from percentage to bytes in DiskThresholdDecider force check * Enhance docs with note about rollover, creation, & shrink * Clarify decision messages, add test for target-only allocation * Simplify NodeReplacementAllocationDecider.replacementOngoing * Start nodeC before nodeB in integration test * Spotleeeessssssss! You get me every time! * Remove outdated comment
1 parent f16a699 commit 6e875d0

26 files changed

+928
-43
lines changed

docs/reference/shutdown/apis/shutdown-put.asciidoc

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Migrates ongoing tasks and index shards to other nodes as needed
2626
to prepare a node to be restarted or shut down and removed from the cluster.
2727
This ensures that {es} can be stopped safely with minimal disruption to the cluster.
2828

29-
You must specify the type of shutdown: `restart` or `remove`.
29+
You must specify the type of shutdown: `restart`, `remove`, or `replace`.
3030
If a node is already being prepared for shutdown,
3131
you can use this API to change the shutdown type.
3232

@@ -58,12 +58,16 @@ include::{es-repo-dir}/rest-api/common-parms.asciidoc[tag=timeoutparms]
5858

5959
`type`::
6060
(Required, string)
61-
Valid values are `restart` and `remove`.
61+
Valid values are `restart`, `remove`, or `replace`.
6262
Use `restart` when you need to temporarily shut down a node to perform an upgrade,
6363
make configuration changes, or perform other maintenance.
6464
Because the node is expected to rejoin the cluster, data is not migrated off of the node.
6565
Use `remove` when you need to permanently remove a node from the cluster.
6666
The node is not marked ready for shutdown until data is migrated off of the node
67+
Use `replace` to do a 1:1 replacement of a node with another node. Certain allocation decisions will
68+
be ignored (such as disk watermarks) in the interest of true replacement of the source node with the
69+
target node. During a replace-type shutdown, rollover and index creation may result in unassigned
70+
shards, and shrink may fail until the replacement is complete.
6771

6872
`reason`::
6973
(Required, string)
@@ -76,6 +80,13 @@ it does not affect the shut down process.
7680
Only valid if `type` is `restart`. Controls how long {es} will wait for the node to restart and join the cluster before reassigning its shards to other nodes. This works the same as
7781
<<delayed-allocation,delaying allocation>> with the `index.unassigned.node_left.delayed_timeout` setting. If you specify both a restart allocation delay and an index-level allocation delay, the longer of the two is used.
7882

83+
`target_node_name`::
84+
(Optional, string)
85+
Only valid if `type` is `replace`. Specifies the name of the node that is replacing the node being
86+
shut down. Shards from the shut down node are only allowed to be allocated to the target node, and
87+
no other data will be allocated to the target node. During relocation of data certain allocation
88+
rules are ignored, such as disk watermarks or user attribute filtering rules.
89+
7990
[[put-shutdown-api-example]]
8091
==== {api-examples-title}
8192

server/src/main/java/org/elasticsearch/cluster/ClusterModule.java

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@
3838
import org.elasticsearch.cluster.routing.allocation.decider.EnableAllocationDecider;
3939
import org.elasticsearch.cluster.routing.allocation.decider.FilterAllocationDecider;
4040
import org.elasticsearch.cluster.routing.allocation.decider.MaxRetryAllocationDecider;
41+
import org.elasticsearch.cluster.routing.allocation.decider.NodeReplacementAllocationDecider;
4142
import org.elasticsearch.cluster.routing.allocation.decider.NodeShutdownAllocationDecider;
4243
import org.elasticsearch.cluster.routing.allocation.decider.NodeVersionAllocationDecider;
4344
import org.elasticsearch.cluster.routing.allocation.decider.RebalanceOnlyWhenActiveAllocationDecider;
@@ -49,7 +50,6 @@
4950
import org.elasticsearch.cluster.routing.allocation.decider.SnapshotInProgressAllocationDecider;
5051
import org.elasticsearch.cluster.routing.allocation.decider.ThrottlingAllocationDecider;
5152
import org.elasticsearch.cluster.service.ClusterService;
52-
import org.elasticsearch.common.xcontent.ParseField;
5353
import org.elasticsearch.common.inject.AbstractModule;
5454
import org.elasticsearch.common.io.stream.NamedWriteable;
5555
import org.elasticsearch.common.io.stream.NamedWriteableRegistry.Entry;
@@ -60,6 +60,7 @@
6060
import org.elasticsearch.common.settings.Settings;
6161
import org.elasticsearch.common.util.concurrent.ThreadContext;
6262
import org.elasticsearch.common.xcontent.NamedXContentRegistry;
63+
import org.elasticsearch.common.xcontent.ParseField;
6364
import org.elasticsearch.gateway.GatewayAllocator;
6465
import org.elasticsearch.indices.SystemIndices;
6566
import org.elasticsearch.ingest.IngestMetadata;
@@ -202,6 +203,7 @@ public static Collection<AllocationDecider> createAllocationDeciders(Settings se
202203
addAllocationDecider(deciders, new SnapshotInProgressAllocationDecider());
203204
addAllocationDecider(deciders, new RestoreInProgressAllocationDecider());
204205
addAllocationDecider(deciders, new NodeShutdownAllocationDecider());
206+
addAllocationDecider(deciders, new NodeReplacementAllocationDecider());
205207
addAllocationDecider(deciders, new FilterAllocationDecider(settings, clusterSettings));
206208
addAllocationDecider(deciders, new SameShardAllocationDecider(settings, clusterSettings));
207209
addAllocationDecider(deciders, new DiskThresholdDecider(settings, clusterSettings));

server/src/main/java/org/elasticsearch/cluster/metadata/SingleNodeShutdownMetadata.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
import org.elasticsearch.Version;
1313
import org.elasticsearch.cluster.AbstractDiffable;
1414
import org.elasticsearch.cluster.Diffable;
15+
import org.elasticsearch.common.Strings;
1516
import org.elasticsearch.common.io.stream.StreamInput;
1617
import org.elasticsearch.common.io.stream.StreamOutput;
1718
import org.elasticsearch.common.xcontent.ConstructingObjectParser;
@@ -114,7 +115,7 @@ private SingleNodeShutdownMetadata(
114115
if (targetNodeName != null && type != Type.REPLACE) {
115116
throw new IllegalArgumentException(new ParameterizedMessage("target node name is only valid for REPLACE type shutdowns, " +
116117
"but was given type [{}] and target node name [{}]", type, targetNodeName).getFormattedMessage());
117-
} else if (targetNodeName == null && type == Type.REPLACE) {
118+
} else if (Strings.hasText(targetNodeName) == false && type == Type.REPLACE) {
118119
throw new IllegalArgumentException("target node name is required for REPLACE type shutdowns");
119120
}
120121
this.targetNodeName = targetNodeName;

server/src/main/java/org/elasticsearch/cluster/routing/allocation/RoutingAllocation.java

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
import org.elasticsearch.cluster.ClusterState;
1313
import org.elasticsearch.cluster.RestoreInProgress;
1414
import org.elasticsearch.cluster.metadata.Metadata;
15+
import org.elasticsearch.cluster.metadata.SingleNodeShutdownMetadata;
1516
import org.elasticsearch.cluster.node.DiscoveryNodes;
1617
import org.elasticsearch.cluster.routing.RoutingChangesObserver;
1718
import org.elasticsearch.cluster.routing.RoutingNodes;
@@ -24,6 +25,7 @@
2425
import org.elasticsearch.snapshots.RestoreService.RestoreInProgressUpdater;
2526
import org.elasticsearch.snapshots.SnapshotShardSizeInfo;
2627

28+
import java.util.Collections;
2729
import java.util.HashMap;
2830
import java.util.HashSet;
2931
import java.util.Map;
@@ -71,6 +73,9 @@ public class RoutingAllocation {
7173
nodesChangedObserver, indexMetadataUpdater, restoreInProgressUpdater
7274
);
7375

76+
private final Map<String, SingleNodeShutdownMetadata> nodeShutdowns;
77+
private final Map<String, SingleNodeShutdownMetadata> nodeReplacementTargets;
78+
7479

7580
/**
7681
* Creates a new {@link RoutingAllocation}
@@ -90,6 +95,14 @@ public RoutingAllocation(AllocationDeciders deciders, RoutingNodes routingNodes,
9095
this.clusterInfo = clusterInfo;
9196
this.shardSizeInfo = shardSizeInfo;
9297
this.currentNanoTime = currentNanoTime;
98+
this.nodeShutdowns = metadata.nodeShutdowns();
99+
Map<String, SingleNodeShutdownMetadata> targetNameToShutdown = new HashMap<>();
100+
for (SingleNodeShutdownMetadata shutdown : this.nodeShutdowns.values()) {
101+
if (shutdown.getType() == SingleNodeShutdownMetadata.Type.REPLACE) {
102+
targetNameToShutdown.put(shutdown.getTargetNodeName(), shutdown);
103+
}
104+
}
105+
this.nodeReplacementTargets = Collections.unmodifiableMap(targetNameToShutdown);
93106
}
94107

95108
/** returns the nano time captured at the beginning of the allocation. used to make sure all time based decisions are aligned */
@@ -145,6 +158,20 @@ public SnapshotShardSizeInfo snapshotShardSizeInfo() {
145158
return shardSizeInfo;
146159
}
147160

161+
/**
162+
* Returns the map of node id to shutdown metadata currently in the cluster
163+
*/
164+
public Map<String, SingleNodeShutdownMetadata> nodeShutdowns() {
165+
return this.nodeShutdowns;
166+
}
167+
168+
/**
169+
* Returns a map of target node name to replacement shutdown
170+
*/
171+
public Map<String, SingleNodeShutdownMetadata> replacementTargetShutdowns() {
172+
return this.nodeReplacementTargets;
173+
}
174+
148175
@SuppressWarnings("unchecked")
149176
public <T extends ClusterState.Custom> T custom(String key) {
150177
return (T) customs.get(key);

server/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

Lines changed: 29 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
import org.apache.lucene.util.IntroSorter;
1515
import org.elasticsearch.cluster.metadata.IndexMetadata;
1616
import org.elasticsearch.cluster.metadata.Metadata;
17+
import org.elasticsearch.cluster.metadata.SingleNodeShutdownMetadata;
1718
import org.elasticsearch.cluster.routing.RoutingNode;
1819
import org.elasticsearch.cluster.routing.RoutingNodes;
1920
import org.elasticsearch.cluster.routing.ShardRouting;
@@ -30,12 +31,12 @@
3031
import org.elasticsearch.cluster.routing.allocation.decider.Decision;
3132
import org.elasticsearch.cluster.routing.allocation.decider.Decision.Type;
3233
import org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider;
33-
import org.elasticsearch.core.Tuple;
3434
import org.elasticsearch.common.inject.Inject;
3535
import org.elasticsearch.common.settings.ClusterSettings;
3636
import org.elasticsearch.common.settings.Setting;
3737
import org.elasticsearch.common.settings.Setting.Property;
3838
import org.elasticsearch.common.settings.Settings;
39+
import org.elasticsearch.core.Tuple;
3940
import org.elasticsearch.gateway.PriorityComparator;
4041

4142
import java.util.ArrayList;
@@ -47,6 +48,7 @@
4748
import java.util.List;
4849
import java.util.Map;
4950
import java.util.Set;
51+
import java.util.function.BiFunction;
5052
import java.util.stream.StreamSupport;
5153

5254
import static org.elasticsearch.cluster.routing.ShardRoutingState.RELOCATING;
@@ -671,7 +673,6 @@ public MoveDecision decideMove(final ShardRouting shardRouting) {
671673
return MoveDecision.NOT_TAKEN;
672674
}
673675

674-
final boolean explain = allocation.debugDecision();
675676
final ModelNode sourceNode = nodes.get(shardRouting.currentNodeId());
676677
assert sourceNode != null && sourceNode.containsShard(shardRouting);
677678
RoutingNode routingNode = sourceNode.getRoutingNode();
@@ -687,15 +688,29 @@ public MoveDecision decideMove(final ShardRouting shardRouting) {
687688
* This is not guaranteed to be balanced after this operation we still try best effort to
688689
* allocate on the minimal eligible node.
689690
*/
691+
MoveDecision moveDecision = decideMove(shardRouting, sourceNode, canRemain, this::decideCanAllocate);
692+
if (moveDecision.canRemain() == false && moveDecision.forceMove() == false) {
693+
final SingleNodeShutdownMetadata shutdown = allocation.nodeShutdowns().get(shardRouting.currentNodeId());
694+
final boolean shardsOnReplacedNode = shutdown != null &&
695+
shutdown.getType().equals(SingleNodeShutdownMetadata.Type.REPLACE);
696+
if (shardsOnReplacedNode) {
697+
return decideMove(shardRouting, sourceNode, canRemain, this::decideCanForceAllocateForVacate);
698+
}
699+
}
700+
return moveDecision;
701+
}
702+
703+
private MoveDecision decideMove(ShardRouting shardRouting, ModelNode sourceNode, Decision remainDecision,
704+
BiFunction<ShardRouting, RoutingNode, Decision> decider) {
705+
final boolean explain = allocation.debugDecision();
690706
Type bestDecision = Type.NO;
691707
RoutingNode targetNode = null;
692708
final List<NodeAllocationResult> nodeExplanationMap = explain ? new ArrayList<>() : null;
693709
int weightRanking = 0;
694710
for (ModelNode currentNode : sorter.modelNodes) {
695711
if (currentNode != sourceNode) {
696712
RoutingNode target = currentNode.getRoutingNode();
697-
// don't use canRebalance as we want hard filtering rules to apply. See #17698
698-
Decision allocationDecision = allocation.deciders().canAllocate(shardRouting, target, allocation);
713+
Decision allocationDecision = decider.apply(shardRouting, target);
699714
if (explain) {
700715
nodeExplanationMap.add(new NodeAllocationResult(
701716
currentNode.getRoutingNode().node(), allocationDecision, ++weightRanking));
@@ -715,10 +730,19 @@ public MoveDecision decideMove(final ShardRouting shardRouting) {
715730
}
716731
}
717732

718-
return MoveDecision.cannotRemain(canRemain, AllocationDecision.fromDecisionType(bestDecision),
733+
return MoveDecision.cannotRemain(remainDecision, AllocationDecision.fromDecisionType(bestDecision),
719734
targetNode != null ? targetNode.node() : null, nodeExplanationMap);
720735
}
721736

737+
private Decision decideCanAllocate(ShardRouting shardRouting, RoutingNode target) {
738+
// don't use canRebalance as we want hard filtering rules to apply. See #17698
739+
return allocation.deciders().canAllocate(shardRouting, target, allocation);
740+
}
741+
742+
private Decision decideCanForceAllocateForVacate(ShardRouting shardRouting, RoutingNode target) {
743+
return allocation.deciders().canForceAllocateDuringReplace(shardRouting, target, allocation);
744+
}
745+
722746
/**
723747
* Builds the internal model from all shards in the given
724748
* {@link Iterable}. All shards in the {@link Iterable} must be assigned

server/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AllocationDecider.java

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,4 +104,22 @@ public Decision canForceAllocatePrimary(ShardRouting shardRouting, RoutingNode n
104104
return decision;
105105
}
106106
}
107+
108+
/**
109+
* Returns a {@link Decision} whether the given shard can be forced to the
110+
* given node in the event that the shard's source node is being replaced.
111+
* This allows nodes using a replace-type node shutdown to
112+
* override certain deciders in the interest of moving the shard away from
113+
* a node that *must* be removed.
114+
*
115+
* It defaults to returning "YES" and must be overridden by deciders that
116+
* opt-out to having their other NO decisions *not* overridden while vacating.
117+
*
118+
* The caller is responsible for first checking:
119+
* - that a replacement is ongoing
120+
* - the shard routing's current node is the source of the replacement
121+
*/
122+
public Decision canForceAllocateDuringReplace(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
123+
return Decision.YES;
124+
}
107125
}

server/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AllocationDeciders.java

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,25 @@ public Decision canForceAllocatePrimary(ShardRouting shardRouting, RoutingNode n
212212
return ret;
213213
}
214214

215+
@Override
216+
public Decision canForceAllocateDuringReplace(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
217+
Decision.Multi ret = new Decision.Multi();
218+
for (AllocationDecider allocationDecider : allocations) {
219+
Decision decision = allocationDecider.canForceAllocateDuringReplace(shardRouting, node, allocation);
220+
// short track if a NO is returned.
221+
if (decision.type() == Decision.Type.NO) {
222+
if (allocation.debugDecision() == false) {
223+
return Decision.NO;
224+
} else {
225+
ret.add(decision);
226+
}
227+
} else {
228+
addDecision(ret, decision, allocation);
229+
}
230+
}
231+
return ret;
232+
}
233+
215234
private void addDecision(Decision.Multi ret, Decision decision, RoutingAllocation allocation) {
216235
// We never add ALWAYS decisions and only add YES decisions when requested by debug mode (since Multi default is YES).
217236
if (decision != Decision.ALWAYS

server/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AwarenessAllocationDecider.java

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,14 @@ public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, Routing
120120
return underCapacity(shardRouting, node, allocation, true);
121121
}
122122

123+
@Override
124+
public Decision canForceAllocateDuringReplace(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
125+
// We need to meet the criteria for shard awareness even during a replacement so that all
126+
// copies of a shard do not get allocated to the same host/rack/AZ, so this explicitly
127+
// checks the awareness 'canAllocate' to ensure we don't violate that constraint.
128+
return canAllocate(shardRouting, node, allocation);
129+
}
130+
123131
@Override
124132
public Decision canRemain(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
125133
return underCapacity(shardRouting, node, allocation, false);

server/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/DiskThresholdDecider.java

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -317,6 +317,33 @@ public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, Routing
317317
new ByteSizeValue(freeBytesAfterShard));
318318
}
319319

320+
@Override
321+
public Decision canForceAllocateDuringReplace(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
322+
ImmutableOpenMap<String, DiskUsage> usages = allocation.clusterInfo().getNodeMostAvailableDiskUsages();
323+
final Decision decision = earlyTerminate(allocation, usages);
324+
if (decision != null) {
325+
return decision;
326+
}
327+
328+
if (allocation.metadata().index(shardRouting.index()).ignoreDiskWatermarks()) {
329+
return YES_DISK_WATERMARKS_IGNORED;
330+
}
331+
332+
final DiskUsageWithRelocations usage = getDiskUsage(node, allocation, usages, false);
333+
final long shardSize = getExpectedShardSize(shardRouting, 0L,
334+
allocation.clusterInfo(), allocation.snapshotShardSizeInfo(), allocation.metadata(), allocation.routingTable());
335+
assert shardSize >= 0 : shardSize;
336+
final long freeBytesAfterShard = usage.getFreeBytes() - shardSize;
337+
if (freeBytesAfterShard < 0) {
338+
return Decision.single(Decision.Type.NO, NAME,
339+
"unable to force allocate shard to [%s] during replacement, " +
340+
"as allocating to this node would cause disk usage to exceed 100%% ([%s] bytes above available disk space)",
341+
node.nodeId(), -freeBytesAfterShard);
342+
} else {
343+
return super.canForceAllocateDuringReplace(shardRouting, node, allocation);
344+
}
345+
}
346+
320347
private static final Decision YES_NOT_MOST_UTILIZED_DISK = Decision.single(Decision.Type.YES, NAME,
321348
"this shard is not allocated on the most utilized disk and can remain");
322349

server/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/EnableAllocationDecider.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@
88

99
package org.elasticsearch.cluster.routing.allocation.decider;
1010

11-
import java.util.Locale;
12-
1311
import org.elasticsearch.cluster.metadata.IndexMetadata;
1412
import org.elasticsearch.cluster.routing.RecoverySource;
1513
import org.elasticsearch.cluster.routing.RoutingNode;
@@ -20,6 +18,8 @@
2018
import org.elasticsearch.common.settings.Setting.Property;
2119
import org.elasticsearch.common.settings.Settings;
2220

21+
import java.util.Locale;
22+
2323
/**
2424
* This allocation decider allows shard allocations / rebalancing via the cluster wide settings
2525
* {@link #CLUSTER_ROUTING_ALLOCATION_ENABLE_SETTING} / {@link #CLUSTER_ROUTING_REBALANCE_ENABLE_SETTING} and the per index setting

server/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/MaxRetryAllocationDecider.java

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,4 +73,9 @@ public Decision canForceAllocatePrimary(ShardRouting shardRouting, RoutingNode n
7373
// if so, we don't want to force the primary allocation here
7474
return canAllocate(shardRouting, node, allocation);
7575
}
76+
77+
@Override
78+
public Decision canForceAllocateDuringReplace(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
79+
return canAllocate(shardRouting, node, allocation);
80+
}
7681
}

0 commit comments

Comments
 (0)