-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Write shard state metadata as soon as shard is created / initializing #16625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write shard state metadata as soon as shard is created / initializing #16625
Conversation
@@ -342,8 +342,6 @@ public void cleanFiles(int totalTranslogOps, Store.MetadataSnapshot sourceMetaDa | |||
// first, we go and move files that were created with the recovery id suffix to | |||
// the actual names, its ok if we have a corrupted index here, since we have replicas | |||
// to recover from in case of a full cluster shutdown just when this code executes... | |||
indexShard().deleteShardState(); // we have to delete it first since even if we fail to rename the shard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bleskes I'm not sure what effect removing this has. The issue that made me remove this is that the shard state metadata was written when shard is created, then it was removed again if shard was recovery target, and not updated anymore since the shard state metadata did not change from point of view of IndexShard.persistMetadata()
. With writing shard state metadata directly, we now know that the shard state metadata is up-to-date before we do recovery (hence no need to delete shard state?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9b0988e
to
8b252b7
Compare
@bleskes ping |
assert nodeShardState.allocationId() == null : "Allocation id and legacy version cannot be both present"; | ||
logger.trace("[{}] on node [{}] has version [{}] of shard", shard, nodeShardState.getNode(), version); | ||
} else { | ||
// shard was already selected in a 3.x cluster as best candidate for recovery but did not make it to STARTED state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I understand this correctly, this part is relevant where we assigned a primary after a cluster upgrade and the shard initialized (and wrote a new state file) but we never got around to activating it before crushing again. if that's correct, can you add this to the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
change looks good to me. Left some suggestions and questions re testing.. |
fe713f0
to
ef3f69e
Compare
Pushed another commit addressing review comments. Also found a copy-paste bug in a test. |
LGTM. Thanks @ywelsch |
As we rely on active allocation ids persisted in the cluster state to select the primary shard copy, we can write shard state metadata on the allocated node as soon as the node knows about receiving this shard. This also ensures that in case of primary relocation, when the relocation target is marked as started by the master node, the shard state metadata with the correct allocation id has already been written on the relocation target. Before this change, shard state metadata was only written once the node knows it is marked as started. In case of failures between master marking the node as started and the node receiving and processing this event, the relation between the shard copy on disk and the cluster state could get lost. This means that manual allocation of the shard using the reroute command allocate_stale_primary was necessary. Closes elastic#16625
ef3f69e
to
d76161d
Compare
…e-metadata Write shard state metadata as soon as shard is created / initializing
As we now rely on active allocation ids persisted in the cluster state to select
the primary shard copy, we can write shard state metadata on the allocated node
as soon as the node knows about receiving this shard. This also ensures that
in case of primary relocation, when the relocation target is marked as started
by the master node, the shard state metadata with the correct allocation id has
already been written on the relocation target. Before this change, shard state
metadata was only written once the node knows it is marked as started. In case
of failures between master marking the node as started and the node
receiving and processing this event, the relation between the shard copy on disk
and the cluster state could get lost. This means that manual allocation of
the shard using the reroute command allocate_stale_primary was necessary.
Relates to #14739