Skip to content

Use primary terms as authority to fail shards #19715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Aug 1, 2016

A primary shard currently instructs the master to fail a replica shard that it fails to replicate writes to before acknowledging the writes to the client. To ensure that the primary instructing the master to fail the replica is still the current primary in the cluster state on the master, it submits not only the identity of the replica shard to fail to the master but also its own shard identity. This can be problematic however when the primary is relocating. After primary relocation handoff but before the primary relocation target is activated, the primary relocation target is replicating writes through the authority of the primary relocation source. This means that the primary relocation target should probably send the identity of the primary relocation source as authority. However, this is not good enough either, as primary shard activation and shard failure instructions can arrive out-of-order. This means that the relocation target would have to send both relocation source and target identity as authority. Fortunately, there is another concept in the cluster state that represents this joint authority, namely primary terms. The primary term is only increased on initial assignment or when a replica is promoted. It stays the same however when a primary relocates.

This PR changes ShardStateAction to rely on primary terms for shard authority. It also changes the wire format to only transmit ShardId and allocation id of the shard to fail (instead of the full ShardRouting), so that the same action can be used in a subsequent PR to remove allocation ids from the active allocation set for which there exist no ShardRouting in the cluster anymore. Last but not least, this PR also makes AllocationService less lenient, requiring ShardRouting instances that are passed to its applyStartedShards and applyFailedShards methods to exist in the routing table. ShardStateAction, which is calling these methods, now has the responsibility to resolve the ShardRouting objects that are to be started / failed, and remove duplicates.

@@ -308,6 +308,11 @@ public static boolean isConflictException(Throwable t) {
ShardRouting routingEntry();

/**
* primary term for this primary (fixed when primary shard is created)
*/
long primaryTerm();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also means that we can fold setting the primary term on a request into the ReplicationOperation - we currently only assert that it was set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conversely, we could as well use the primaryTerm on the replicaRequest when failing the shard. This means that we don't need to extend the Primary interface with primaryTerm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to explore that option. Although, I always doubted whether we should set the primary terms on the ReplicationOperation level, I didn't do it because it wasn't needed and reluctantly opted to keep things as simple as possible and left it at that assert we have now. With the move the primary term based failures we bring the terms into scope, which makes me more inclined to also set them. As said, I'm still on the fence about it myself so I will go along with not doing it, if you feel differently.

@bleskes
Copy link
Contributor

bleskes commented Aug 1, 2016

Thanks @ywelsch . I started having a look but I wonder why does the move from a complete shard routing for the failing/starting shard need to be remove in favor of the new ShardAllocationId class. I like the full ShardRouting because it allows us to log more debug/throw exception with more information when needed. What does the extra class buy us?

* @param shardRouting the shard to fail
* @param sourceShardRouting the source shard requesting the failure (must be the shard itself, or the primary shard)
* @param shardRouting the shard to fail
* @param primaryTerm the primary term associated with the primary shard when the shard is failed by the primary. Use 0L if the
Copy link
Contributor

@bleskes bleskes Aug 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 0 as a non-value (because we serialize using out.writeVLong ) seems brittle to me. I rather use -1 and serialize using normal out.writeLong. With 0 we will always be asking ourselves if it's a valid value. We can also potentially have two methods here - failReplica and failShard, where failReplica requires a valid primary term and failShard doesn't take one (but uses -1 internally) so users won't need to know about this implementation detail.

@ywelsch
Copy link
Contributor Author

ywelsch commented Aug 1, 2016

@bleskes In a follow-up PR, I want to extend ShardStateAction with the capability for primary shards to mark shard copies as stale, some of which don't have a ShardRouting, but for which we only know ShardId and allocation id. The suggested interface here gives this uniform handling, and allows batching with shard failed as well. Furthermore, the ShardAllocationId provides a lean interface for AllocationService.applyStartedShards / applyFailedShards, making it clear what information is actually used, and allows code sharing between the two (e.g. to resolve shard) as well as the to be introduced method to remove active allocation ids. Regarding additional debug information, logging ShardRouting at sender site and task information to correlate requests to receiver should be enough I think?

boolean changed = false;
// as failing primaries also fail associated replicas, we fail replicas first here so that their nodes are added to ignore list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you feel this is not important?

@bleskes
Copy link
Contributor

bleskes commented Aug 2, 2016

@ywelsch discussed this and I asked to remove ShardAllocationId in favor of doing all resolving in ShardStateAction. This also means that AllocationService will become strict and can expect the callers to supply shard routings that actually exist rather than being lenient try to resolve what it can.

@ywelsch
Copy link
Contributor Author

ywelsch commented Aug 3, 2016

@bleskes I've pushed 3c0172b that makes AllocationService strict, expecting callers to supply shard routings that exist in the routing table.

batchResultBuilder.success(task);
} else {
// non-local requests
if (task.primaryTerm > 0 && task.primaryTerm != indexMetaData.primaryTerm(task.shardId.id())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assert in the if clause that the primary term is smaller than the current one?

@bleskes
Copy link
Contributor

bleskes commented Aug 3, 2016

I like a lot how this turned out. I left some minor comments but a few important ones.

@ywelsch
Copy link
Contributor Author

ywelsch commented Aug 3, 2016

@bleskes I've pushed another commit addressing review comments. It mainly contains improvements to the logging.

if (task.primaryTerm > 0) {
long currentPrimaryTerm = indexMetaData.primaryTerm(task.shardId.id());
if (currentPrimaryTerm != task.primaryTerm) {
assert currentPrimaryTerm >= task.primaryTerm : "received a primary term with a higher term than in the " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: >= -> > (equality is not an option :))

@bleskes
Copy link
Contributor

bleskes commented Aug 4, 2016

LGTM. great work. 🎉

@ywelsch ywelsch force-pushed the enhance/use-primary-term-for-shard-fail-authority branch from 44b144f to 4ee490a Compare August 4, 2016 09:57
A primary shard currently instructs the master to fail a replica shard that it fails to replicate writes to before acknowledging
the writes to the client. To ensure that the primary instructing the master to fail the replica is still the current primary in
the cluster state on the master, it submits not only the identity of the replica shard to fail to the master but also its own
shard identity. This can be problematic however when the primary is relocating. After primary relocation handoff but before the
primary relocation target is activated, the primary relocation target is replicating writes through the authority of the primary
relocation source. This means that the primary relocation target should probably send the identity of the primary relocation
source as authority. However, this is not good enough either, as primary shard activation and shard failure instructions can
arrive out-of-order. This means that the relocation target would have to send both relocation source and target identity as
authority. Fortunately, there is another concept in the cluster state that represents this joint authority, namely primary terms.
The primary term is only increased on initial assignment or when a replica is promoted. It stays the same however when a primary
relocates.

This commit changes ShardStateAction to rely on primary terms for shard authority. It also changes the wire format to only
transmit ShardId and allocation id of the shard to fail (instead of the full ShardRouting), so that the same action can be used in a
subsequent PR to remove allocation ids from the active allocation set for which there exist no ShardRouting in the cluster
anymore. Last but not least, this commit also makes AllocationService less lenient, requiring ShardRouting instances that are
passed to its applyStartedShards and applyFailedShards methods to exist in the routing table. ShardStateAction, which is calling these
methods, now has the responsibility to resolve the ShardRouting objects that are to be started / failed, and remove duplicates.
@ywelsch ywelsch force-pushed the enhance/use-primary-term-for-shard-fail-authority branch from 4ee490a to 1128707 Compare August 4, 2016 09:59
@ywelsch ywelsch merged commit ede78ad into elastic:master Aug 4, 2016
@ywelsch
Copy link
Contributor Author

ywelsch commented Aug 4, 2016

@bleskes Thanks for the review and the suggestion to resolve ShardRouting instances in ShardStateAction! This creates a clear separation of concerns between AllocationService and ShardStateAction.

ywelsch added a commit to ywelsch/elasticsearch that referenced this pull request Aug 8, 2016
…me batch update

PR elastic#19715 made AllocationService less lenient, requiring ShardRouting instances that are passed to its applyStartedShards and
applyFailedShards methods to exist in the routing table. As primary shard failures also fail initializing replica shards,
concurrent replica shard failures that are treated in the same cluster state update might not reference existing replica entries
in the routing table anymore. To solve this, PR elastic#19715 ordered the failures by first handling replica before
primary failures. There are other failures that influence more than one routing entry, however. When we have a failed shard entry
for both a relocation source and target, then, depending on the order, either one or the other might point to an out-dated shard
entry. As finding a good order is more difficult than applying the failures, this commit re-adds parts of the ShardRouting
re-resolve logic so that the applyFailedShards method can properly treat shard failure batches.
ywelsch added a commit that referenced this pull request Aug 8, 2016
…me batch update

PR #19715 made AllocationService less lenient, requiring ShardRouting instances that are passed to its applyStartedShards and
applyFailedShards methods to exist in the routing table. As primary shard failures also fail initializing replica shards,
concurrent replica shard failures that are treated in the same cluster state update might not reference existing replica entries
in the routing table anymore. To solve this, PR #19715 ordered the failures by first handling replica before
primary failures. There are other failures that influence more than one routing entry, however. When we have a failed shard entry
for both a relocation source and target, then, depending on the order, either one or the other might point to an out-dated shard
entry. As finding a good order is more difficult than applying the failures, this commit re-adds parts of the ShardRouting
re-resolve logic so that the applyFailedShards method can properly treat shard failure batches.
@lcawl lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. >enhancement v5.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants