Use primary terms as authority to fail shards #19715

ywelsch · 2016-08-01T10:06:05Z

A primary shard currently instructs the master to fail a replica shard that it fails to replicate writes to before acknowledging the writes to the client. To ensure that the primary instructing the master to fail the replica is still the current primary in the cluster state on the master, it submits not only the identity of the replica shard to fail to the master but also its own shard identity. This can be problematic however when the primary is relocating. After primary relocation handoff but before the primary relocation target is activated, the primary relocation target is replicating writes through the authority of the primary relocation source. This means that the primary relocation target should probably send the identity of the primary relocation source as authority. However, this is not good enough either, as primary shard activation and shard failure instructions can arrive out-of-order. This means that the relocation target would have to send both relocation source and target identity as authority. Fortunately, there is another concept in the cluster state that represents this joint authority, namely primary terms. The primary term is only increased on initial assignment or when a replica is promoted. It stays the same however when a primary relocates.

This PR changes ShardStateAction to rely on primary terms for shard authority. It also changes the wire format to only transmit ShardId and allocation id of the shard to fail (instead of the full ShardRouting), so that the same action can be used in a subsequent PR to remove allocation ids from the active allocation set for which there exist no ShardRouting in the cluster anymore. Last but not least, this PR also makes AllocationService less lenient, requiring ShardRouting instances that are passed to its applyStartedShards and applyFailedShards methods to exist in the routing table. ShardStateAction, which is calling these methods, now has the responsibility to resolve the ShardRouting objects that are to be started / failed, and remove duplicates.

bleskes · 2016-08-01T12:06:15Z

core/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java

@@ -308,6 +308,11 @@ public static boolean isConflictException(Throwable t) {
        ShardRouting routingEntry();

        /**
+         * primary term for this primary (fixed when primary shard is created)
+         */
+        long primaryTerm();


this also means that we can fold setting the primary term on a request into the ReplicationOperation - we currently only assert that it was set.

conversely, we could as well use the primaryTerm on the replicaRequest when failing the shard. This means that we don't need to extend the Primary interface with primaryTerm.

+1 to explore that option. Although, I always doubted whether we should set the primary terms on the ReplicationOperation level, I didn't do it because it wasn't needed and reluctantly opted to keep things as simple as possible and left it at that assert we have now. With the move the primary term based failures we bring the terms into scope, which makes me more inclined to also set them. As said, I'm still on the fence about it myself so I will go along with not doing it, if you feel differently.

bleskes · 2016-08-01T12:33:02Z

Thanks @ywelsch . I started having a look but I wonder why does the move from a complete shard routing for the failing/starting shard need to be remove in favor of the new ShardAllocationId class. I like the full ShardRouting because it allows us to log more debug/throw exception with more information when needed. What does the extra class buy us?

bleskes · 2016-08-01T12:38:17Z

core/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

-     *  @param shardRouting       the shard to fail
-     * @param sourceShardRouting the source shard requesting the failure (must be the shard itself, or the primary shard)
+     * @param shardRouting       the shard to fail
+     * @param primaryTerm        the primary term associated with the primary shard when the shard is failed by the primary. Use 0L if the


Using 0 as a non-value (because we serialize using out.writeVLong ) seems brittle to me. I rather use -1 and serialize using normal out.writeLong. With 0 we will always be asking ourselves if it's a valid value. We can also potentially have two methods here - failReplica and failShard, where failReplica requires a valid primary term and failShard doesn't take one (but uses -1 internally) so users won't need to know about this implementation detail.

ywelsch · 2016-08-01T13:53:09Z

@bleskes In a follow-up PR, I want to extend ShardStateAction with the capability for primary shards to mark shard copies as stale, some of which don't have a ShardRouting, but for which we only know ShardId and allocation id. The suggested interface here gives this uniform handling, and allows batching with shard failed as well. Furthermore, the ShardAllocationId provides a lean interface for AllocationService.applyStartedShards / applyFailedShards, making it clear what information is actually used, and allows code sharing between the two (e.g. to resolve shard) as well as the to be introduced method to remove active allocation ids. Regarding additional debug information, logging ShardRouting at sender site and task information to correlate requests to receiver should be enough I think?

bleskes · 2016-08-02T10:04:07Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

        boolean changed = false;
-        // as failing primaries also fail associated replicas, we fail replicas first here so that their nodes are added to ignore list


why do you feel this is not important?

bleskes · 2016-08-02T11:32:09Z

@ywelsch discussed this and I asked to remove ShardAllocationId in favor of doing all resolving in ShardStateAction. This also means that AllocationService will become strict and can expect the callers to supply shard routings that actually exist rather than being lenient try to resolve what it can.

ywelsch · 2016-08-03T06:00:11Z

@bleskes I've pushed 3c0172b that makes AllocationService strict, expecting callers to supply shard routings that exist in the routing table.

bleskes · 2016-08-03T11:01:51Z

core/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

+                    batchResultBuilder.success(task);
+                } else {
+                    // non-local requests
+                    if (task.primaryTerm > 0 && task.primaryTerm != indexMetaData.primaryTerm(task.shardId.id())) {


can we assert in the if clause that the primary term is smaller than the current one?

bleskes · 2016-08-03T12:22:23Z

I like a lot how this turned out. I left some minor comments but a few important ones.

ywelsch · 2016-08-03T14:40:41Z

@bleskes I've pushed another commit addressing review comments. It mainly contains improvements to the logging.

bleskes · 2016-08-04T09:33:07Z

core/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

+                    if (task.primaryTerm > 0) {
+                        long currentPrimaryTerm = indexMetaData.primaryTerm(task.shardId.id());
+                        if (currentPrimaryTerm != task.primaryTerm) {
+                            assert currentPrimaryTerm >= task.primaryTerm : "received a primary term with a higher term than in the " +


nit: >= -> > (equality is not an option :))

bleskes · 2016-08-04T09:41:09Z

LGTM. great work. 🎉

A primary shard currently instructs the master to fail a replica shard that it fails to replicate writes to before acknowledging the writes to the client. To ensure that the primary instructing the master to fail the replica is still the current primary in the cluster state on the master, it submits not only the identity of the replica shard to fail to the master but also its own shard identity. This can be problematic however when the primary is relocating. After primary relocation handoff but before the primary relocation target is activated, the primary relocation target is replicating writes through the authority of the primary relocation source. This means that the primary relocation target should probably send the identity of the primary relocation source as authority. However, this is not good enough either, as primary shard activation and shard failure instructions can arrive out-of-order. This means that the relocation target would have to send both relocation source and target identity as authority. Fortunately, there is another concept in the cluster state that represents this joint authority, namely primary terms. The primary term is only increased on initial assignment or when a replica is promoted. It stays the same however when a primary relocates. This commit changes ShardStateAction to rely on primary terms for shard authority. It also changes the wire format to only transmit ShardId and allocation id of the shard to fail (instead of the full ShardRouting), so that the same action can be used in a subsequent PR to remove allocation ids from the active allocation set for which there exist no ShardRouting in the cluster anymore. Last but not least, this commit also makes AllocationService less lenient, requiring ShardRouting instances that are passed to its applyStartedShards and applyFailedShards methods to exist in the routing table. ShardStateAction, which is calling these methods, now has the responsibility to resolve the ShardRouting objects that are to be started / failed, and remove duplicates.

ywelsch · 2016-08-04T10:04:21Z

@bleskes Thanks for the review and the suggestion to resolve ShardRouting instances in ShardStateAction! This creates a clear separation of concerns between AllocationService and ShardStateAction.

…me batch update PR elastic#19715 made AllocationService less lenient, requiring ShardRouting instances that are passed to its applyStartedShards and applyFailedShards methods to exist in the routing table. As primary shard failures also fail initializing replica shards, concurrent replica shard failures that are treated in the same cluster state update might not reference existing replica entries in the routing table anymore. To solve this, PR elastic#19715 ordered the failures by first handling replica before primary failures. There are other failures that influence more than one routing entry, however. When we have a failed shard entry for both a relocation source and target, then, depending on the order, either one or the other might point to an out-dated shard entry. As finding a good order is more difficult than applying the failures, this commit re-adds parts of the ShardRouting re-resolve logic so that the applyFailedShards method can properly treat shard failure batches.

…me batch update PR #19715 made AllocationService less lenient, requiring ShardRouting instances that are passed to its applyStartedShards and applyFailedShards methods to exist in the routing table. As primary shard failures also fail initializing replica shards, concurrent replica shard failures that are treated in the same cluster state update might not reference existing replica entries in the routing table anymore. To solve this, PR #19715 ordered the failures by first handling replica before primary failures. There are other failures that influence more than one routing entry, however. When we have a failed shard entry for both a relocation source and target, then, depending on the order, either one or the other might point to an out-dated shard entry. As finding a good order is more difficult than applying the failures, this commit re-adds parts of the ShardRouting re-resolve logic so that the applyFailedShards method can properly treat shard failure batches.

ywelsch added >enhancement review v5.0.0-beta1 labels Aug 1, 2016

ywelsch assigned bleskes Aug 1, 2016

bleskes reviewed Aug 1, 2016
View reviewed changes

bleskes reviewed Aug 2, 2016
View reviewed changes

bleskes reviewed Aug 3, 2016
View reviewed changes

bleskes reviewed Aug 4, 2016
View reviewed changes

ywelsch force-pushed the enhance/use-primary-term-for-shard-fail-authority branch from 44b144f to 4ee490a Compare August 4, 2016 09:57

ywelsch force-pushed the enhance/use-primary-term-for-shard-fail-authority branch from 4ee490a to 1128707 Compare August 4, 2016 09:59

ywelsch merged commit ede78ad into elastic:master Aug 4, 2016

ywelsch mentioned this pull request Aug 18, 2016

Mark shard as stale on non-replicated write, not on node shutdown #20023

Merged

lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use primary terms as authority to fail shards #19715

Use primary terms as authority to fail shards #19715

Uh oh!

ywelsch commented Aug 1, 2016 •

edited

Loading

Uh oh!

bleskes Aug 1, 2016

Uh oh!

ywelsch Aug 2, 2016

Uh oh!

bleskes Aug 2, 2016

Uh oh!

bleskes commented Aug 1, 2016

Uh oh!

bleskes Aug 1, 2016 •

edited

Loading

Uh oh!

ywelsch commented Aug 1, 2016

Uh oh!

bleskes Aug 2, 2016

Uh oh!

bleskes commented Aug 2, 2016

Uh oh!

ywelsch commented Aug 3, 2016

Uh oh!

bleskes Aug 3, 2016

Uh oh!

bleskes commented Aug 3, 2016

Uh oh!

ywelsch commented Aug 3, 2016

Uh oh!

bleskes Aug 4, 2016

Uh oh!

bleskes commented Aug 4, 2016

Uh oh!

ywelsch commented Aug 4, 2016

Uh oh!

Uh oh!

		boolean changed = false;
		// as failing primaries also fail associated replicas, we fail replicas first here so that their nodes are added to ignore list

Use primary terms as authority to fail shards #19715

Use primary terms as authority to fail shards #19715

Uh oh!

Conversation

ywelsch commented Aug 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bleskes Aug 1, 2016

Choose a reason for hiding this comment

Uh oh!

ywelsch Aug 2, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes Aug 2, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes commented Aug 1, 2016

Uh oh!

bleskes Aug 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywelsch commented Aug 1, 2016

Uh oh!

bleskes Aug 2, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes commented Aug 2, 2016

Uh oh!

ywelsch commented Aug 3, 2016

Uh oh!

bleskes Aug 3, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes commented Aug 3, 2016

Uh oh!

ywelsch commented Aug 3, 2016

Uh oh!

bleskes Aug 4, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes commented Aug 4, 2016

Uh oh!

ywelsch commented Aug 4, 2016

Uh oh!

Uh oh!

ywelsch commented Aug 1, 2016 •

edited

Loading

bleskes Aug 1, 2016 •

edited

Loading