Skip to content

[CCR] Record follower index historyUUIDs #34078

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

martijnvg
Copy link
Member

and verify on each bulk shard operation execution that these history UUIDs did not change.

The tricky bit in this change is that adds another async step to the create and follow api (soon to be renamed to follow api).
After the follower index has been created an its shards have started, the history UUIDs of the follower shards need be
fetched and stored in the follower index's custom metadata.

The follow api (soon to be renamed resume follow api) will fail if follower index shard history UUIDs are missing. (leader index shard UUIDs too)

Closes #33956

and verify on each bulk shard operation execution that these history UUIDs did not change.

The tricky bit in this change is that adds another async step to the create and follow api (soon to be renamed to follow api).
After the follower index has been created an its shards have started, the history UUIDs of the follower shards need be
fetched and stored in the follower index's custom metadata.

The follow api (soon to be renamed resume follow api) will fail if follower index shard history UUIDs are missing.
(leader index shard UUIDs too)

Closes elastic#33956
@martijnvg martijnvg added review :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Sep 26, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@bleskes
Copy link
Contributor

bleskes commented Sep 26, 2018

I'm a bit on the fence about recording the history uuids in the index meta data. I get the need now, but I'm wondering whether long term we want to remove this. I suspect we will transfer history uuids with the future recovery mechanism. This would mean that we need to continuously validate that the history of the source with the one of the target with each operation and we will not need to store anything.

With that in mind, I want to understand if we're viewing this a temporary mechanism (i.e., accept compromises) and if so whether we really want to do it now rather than wait. As said, I'm on the fence myself.

@martijnvg
Copy link
Member Author

Good question. I don't feel comfortable with releasing CCR without the follower index UUID check that this change adds. I agree that this is a temporary solution until there is the new ccr recovery mechanism is added. So when that is added, we can remove the check that this PR is adding?

@bleskes
Copy link
Contributor

bleskes commented Sep 27, 2018

That's my sentiment too. I that case I'm ok with the async extra call and there is no need to invest in cleaner solutions that will take more refactoring (will review shortly)

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. I think we need some more discussions here in the sync

@@ -50,13 +50,15 @@
public static final ParseField MAX_WRITE_BUFFER_SIZE = new ParseField("max_write_buffer_size");
public static final ParseField MAX_RETRY_DELAY = new ParseField("max_retry_delay");
public static final ParseField POLL_TIMEOUT = new ParseField("poll_timeout");
public static final ParseField RECORDED_HISTORY_UUID = new ParseField("recorded_history_uuid");
public static final ParseField RECORDED_LEADER_HISTORY_UUID = new ParseField("recorded_leader_history_uuid");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these properties of the Task - they are properties of the index (meta data)? we should get them from there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We can remove them here and then make the IndexMetaData of the follower available in ShardFollowNodeTask. Are you ok if I do this in a follow up? (it affect the UUIDs of both leader and follower index)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with a follow up

));
}
});
});
} else {
listener.onResponse(new CreateAndFollowIndexAction.Response(true, false, false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondering if we should throw an exception here. Normally when something isn't acked, if you wait it will go through (or at has a chance of doing it). This one will just get stuck.

super(shardId);
setRefreshPolicy(RefreshPolicy.NONE);
this.expectedHistoryUUID = expectedHistoryUUID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can call this history uuid, as it is the one that the ops belong to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(working towards how we want it in the future)

final List<Translog.Operation> sourceOperations,
final long maxSeqNoOfUpdatesOrDeletes,
final IndexShard primary,
final Logger logger) throws IOException {
if (expectedHistoryUUID.equalsIgnoreCase(primary.getHistoryUUID()) == false) {
throw new IllegalStateException("unexpected history uuid, expected [" + expectedHistoryUUID +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add "shard is likely restored from snapshot or force allocated".

@bleskes
Copy link
Contributor

bleskes commented Oct 1, 2018

I talked this through with @ywelsch and we came to the conclusion that those history uuids of the follower can be captured on the instance level of the ShardFollowNodeTask. They basically mean that the information in that class (which was captured when it was bootstrapped) is safe to use. This implies that we can capture the history uuid when starting the task (like we do with the global checkpoint) and store it in a field of the task instance and use that to validate. Any error there should cause the task to pause and people can use resume to force another task to be started which can safely proceed.

@martijnvg
Copy link
Member Author

@bleskes I've made the following changes:

  • Don't record leader history uuid in the ShardFollowTask, but fetch it from the follower index's custom metadata.
  • Bootstrap the follower history uuid when the shard follow task is started instead of recording it in ShardFollowTask when put follow api is executed.

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! nice work.

@@ -96,6 +96,7 @@
public static final String CCR_THREAD_POOL_NAME = "ccr";
public static final String CCR_CUSTOM_METADATA_KEY = "ccr";
public static final String CCR_CUSTOM_METADATA_LEADER_INDEX_SHARD_HISTORY_UUIDS = "leader_index_shard_history_uuids";
public static final String CCR_CUSTOM_METADATA_FOLLOWER_INDEX_SHARD_HISTORY_UUIDS = "follower_index_shard_history_uuids";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used any more?

@@ -132,8 +132,9 @@ public boolean isCcrAllowed() {
final Client leaderClient = client.getRemoteClusterClient(clusterAlias);
hasPrivilegesToFollowIndices(leaderClient, new String[] {leaderIndex}, e -> {
if (e == null) {
fetchLeaderHistoryUUIDs(leaderClient, leaderIndexMetaData, onFailure, historyUUIDs ->
consumer.accept(historyUUIDs, leaderIndexMetaData));
fetchHistoryUUIDs(leaderClient, leaderIndexMetaData, onFailure, historyUUIDs -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is unneeded now?

public void fetchLeaderHistoryUUIDs(
final Client leaderClient,
final IndexMetaData leaderIndexMetaData,
public void fetchHistoryUUIDs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert all of this?


IndexMetaData followIndexMetaData = clusterService.state().metaData().index(params.getFollowShardId().getIndex());
Map<String, String> ccrIndexMetadata = followIndexMetaData.getCustomData(Ccr.CCR_CUSTOM_METADATA_KEY);
String[] recordedLeaderShardHistoryUUIDs = extractLeaderShardHistoryUUIDs(ccrIndexMetadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - maybe :

final String leaderHistoryUUID = getLeaderShardHistoryUUID(ccIndexMetadata][params.getLeaderShardId().id())]

instead of these 3 lines?

@@ -174,7 +174,7 @@ private void createFollowerIndex(
listener::onFailure);
// Can't use create index api here, because then index templates can alter the mappings / settings.
// And index templates could introduce settings / mappings that are incompatible with the leader index.
clusterService.submitStateUpdateTask("follow_index_action", new AckedClusterStateUpdateTask<Boolean>(request, handler) {
clusterService.submitStateUpdateTask("create_follow_index", new AckedClusterStateUpdateTask<Boolean>(request, handler) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_following_index ?

@martijnvg martijnvg merged commit 7f5c2f1 into elastic:master Oct 2, 2018
martijnvg added a commit that referenced this pull request Oct 2, 2018
The follower index shard history UUID will be fetched from the indices stats api when the shard follow task starts and will be provided with the bulk shard operation requests. The bulk shard operations api will fail if the provided history uuid is unequal to the actual history uuid.

No longer record the leader history uuid in shard follow task params, but rather use the leader history UUIDs directly from follower index's custom metadata. The resume follow api will remain to fail if leader index shard history UUIDs are missing.

Closes #33956
kcm pushed a commit that referenced this pull request Oct 30, 2018
The follower index shard history UUID will be fetched from the indices stats api when the shard follow task starts and will be provided with the bulk shard operation requests. The bulk shard operations api will fail if the provided history uuid is unequal to the actual history uuid.

No longer record the leader history uuid in shard follow task params, but rather use the leader history UUIDs directly from follower index's custom metadata. The resume follow api will remain to fail if leader index shard history UUIDs are missing.

Closes #33956
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >non-issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants