-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Use primary terms as authority to fail shards #19715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use primary terms as authority to fail shards #19715
Conversation
@@ -308,6 +308,11 @@ public static boolean isConflictException(Throwable t) { | |||
ShardRouting routingEntry(); | |||
|
|||
/** | |||
* primary term for this primary (fixed when primary shard is created) | |||
*/ | |||
long primaryTerm(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also means that we can fold setting the primary term on a request into the ReplicationOperation - we currently only assert that it was set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conversely, we could as well use the primaryTerm on the replicaRequest when failing the shard. This means that we don't need to extend the Primary
interface with primaryTerm
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to explore that option. Although, I always doubted whether we should set the primary terms on the ReplicationOperation level, I didn't do it because it wasn't needed and reluctantly opted to keep things as simple as possible and left it at that assert we have now. With the move the primary term based failures we bring the terms into scope, which makes me more inclined to also set them. As said, I'm still on the fence about it myself so I will go along with not doing it, if you feel differently.
Thanks @ywelsch . I started having a look but I wonder why does the move from a complete shard routing for the failing/starting shard need to be remove in favor of the new ShardAllocationId class. I like the full ShardRouting because it allows us to log more debug/throw exception with more information when needed. What does the extra class buy us? |
* @param shardRouting the shard to fail | ||
* @param sourceShardRouting the source shard requesting the failure (must be the shard itself, or the primary shard) | ||
* @param shardRouting the shard to fail | ||
* @param primaryTerm the primary term associated with the primary shard when the shard is failed by the primary. Use 0L if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using 0 as a non-value (because we serialize using out.writeVLong
) seems brittle to me. I rather use -1 and serialize using normal out.writeLong
. With 0 we will always be asking ourselves if it's a valid value. We can also potentially have two methods here - failReplica and failShard, where failReplica requires a valid primary term and failShard doesn't take one (but uses -1 internally) so users won't need to know about this implementation detail.
@bleskes In a follow-up PR, I want to extend |
boolean changed = false; | ||
// as failing primaries also fail associated replicas, we fail replicas first here so that their nodes are added to ignore list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you feel this is not important?
@ywelsch discussed this and I asked to remove |
@bleskes I've pushed 3c0172b that makes |
batchResultBuilder.success(task); | ||
} else { | ||
// non-local requests | ||
if (task.primaryTerm > 0 && task.primaryTerm != indexMetaData.primaryTerm(task.shardId.id())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we assert in the if clause that the primary term is smaller than the current one?
I like a lot how this turned out. I left some minor comments but a few important ones. |
@bleskes I've pushed another commit addressing review comments. It mainly contains improvements to the logging. |
if (task.primaryTerm > 0) { | ||
long currentPrimaryTerm = indexMetaData.primaryTerm(task.shardId.id()); | ||
if (currentPrimaryTerm != task.primaryTerm) { | ||
assert currentPrimaryTerm >= task.primaryTerm : "received a primary term with a higher term than in the " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: >=
-> >
(equality is not an option :))
LGTM. great work. 🎉 |
44b144f
to
4ee490a
Compare
A primary shard currently instructs the master to fail a replica shard that it fails to replicate writes to before acknowledging the writes to the client. To ensure that the primary instructing the master to fail the replica is still the current primary in the cluster state on the master, it submits not only the identity of the replica shard to fail to the master but also its own shard identity. This can be problematic however when the primary is relocating. After primary relocation handoff but before the primary relocation target is activated, the primary relocation target is replicating writes through the authority of the primary relocation source. This means that the primary relocation target should probably send the identity of the primary relocation source as authority. However, this is not good enough either, as primary shard activation and shard failure instructions can arrive out-of-order. This means that the relocation target would have to send both relocation source and target identity as authority. Fortunately, there is another concept in the cluster state that represents this joint authority, namely primary terms. The primary term is only increased on initial assignment or when a replica is promoted. It stays the same however when a primary relocates. This commit changes ShardStateAction to rely on primary terms for shard authority. It also changes the wire format to only transmit ShardId and allocation id of the shard to fail (instead of the full ShardRouting), so that the same action can be used in a subsequent PR to remove allocation ids from the active allocation set for which there exist no ShardRouting in the cluster anymore. Last but not least, this commit also makes AllocationService less lenient, requiring ShardRouting instances that are passed to its applyStartedShards and applyFailedShards methods to exist in the routing table. ShardStateAction, which is calling these methods, now has the responsibility to resolve the ShardRouting objects that are to be started / failed, and remove duplicates.
4ee490a
to
1128707
Compare
@bleskes Thanks for the review and the suggestion to resolve ShardRouting instances in ShardStateAction! This creates a clear separation of concerns between AllocationService and ShardStateAction. |
…me batch update PR elastic#19715 made AllocationService less lenient, requiring ShardRouting instances that are passed to its applyStartedShards and applyFailedShards methods to exist in the routing table. As primary shard failures also fail initializing replica shards, concurrent replica shard failures that are treated in the same cluster state update might not reference existing replica entries in the routing table anymore. To solve this, PR elastic#19715 ordered the failures by first handling replica before primary failures. There are other failures that influence more than one routing entry, however. When we have a failed shard entry for both a relocation source and target, then, depending on the order, either one or the other might point to an out-dated shard entry. As finding a good order is more difficult than applying the failures, this commit re-adds parts of the ShardRouting re-resolve logic so that the applyFailedShards method can properly treat shard failure batches.
…me batch update PR #19715 made AllocationService less lenient, requiring ShardRouting instances that are passed to its applyStartedShards and applyFailedShards methods to exist in the routing table. As primary shard failures also fail initializing replica shards, concurrent replica shard failures that are treated in the same cluster state update might not reference existing replica entries in the routing table anymore. To solve this, PR #19715 ordered the failures by first handling replica before primary failures. There are other failures that influence more than one routing entry, however. When we have a failed shard entry for both a relocation source and target, then, depending on the order, either one or the other might point to an out-dated shard entry. As finding a good order is more difficult than applying the failures, this commit re-adds parts of the ShardRouting re-resolve logic so that the applyFailedShards method can properly treat shard failure batches.
A primary shard currently instructs the master to fail a replica shard that it fails to replicate writes to before acknowledging the writes to the client. To ensure that the primary instructing the master to fail the replica is still the current primary in the cluster state on the master, it submits not only the identity of the replica shard to fail to the master but also its own shard identity. This can be problematic however when the primary is relocating. After primary relocation handoff but before the primary relocation target is activated, the primary relocation target is replicating writes through the authority of the primary relocation source. This means that the primary relocation target should probably send the identity of the primary relocation source as authority. However, this is not good enough either, as primary shard activation and shard failure instructions can arrive out-of-order. This means that the relocation target would have to send both relocation source and target identity as authority. Fortunately, there is another concept in the cluster state that represents this joint authority, namely primary terms. The primary term is only increased on initial assignment or when a replica is promoted. It stays the same however when a primary relocates.
This PR changes ShardStateAction to rely on primary terms for shard authority. It also changes the wire format to only transmit
ShardId
and allocation id of the shard to fail (instead of the fullShardRouting
), so that the same action can be used in a subsequent PR to remove allocation ids from the active allocation set for which there exist noShardRouting
in the cluster anymore. Last but not least, this PR also makes AllocationService less lenient, requiring ShardRouting instances that are passed to its applyStartedShards and applyFailedShards methods to exist in the routing table. ShardStateAction, which is calling these methods, now has the responsibility to resolve the ShardRouting objects that are to be started / failed, and remove duplicates.