Skip to content

Associate translog with Lucene index commit for searchable snapshots shards #53459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Mar 12, 2020

Today searchable snapshot shards can be restored from a snapshot, recovered from a peer or force-allocated as stale primary on data node without copying any files on disk. It works because SearchableSnapshotDirectory does not rely on local files on disk and because every time such a Directory is instantiated an empty translog is associated to it through a new Lucene commit (holds in memory).

The translog/lucene commit association is done within a cluster state update when the IndexShard's directory is created and requires to open an IndexWriter and to create the translog and translog checkpoint files on disk. Opening the IndexWriter causes multiple files to be accessed (to verify checksums and to load field infos) and for some shards this can take a lot of time, causing cluster state applying timeouts. Creating the translog files triggers multiple accesses to disk in order to create or delete directories and fsync files.

This PR moves the translog/lucene commit association out of the Directory instantiation - and therefore out of the cluster state update thread - in order to make it happen in a pre-recovery phase
at the beginning of the recovery process. It introduces a new hook method named preRecovery() that in turns execute the registered IndexEventListeners. This allows the searchable snapshot module to register a specific IndexEventListener that will create a new empty translog with a given translog UUID so that it will be associated with the last lucene commit.

This translog creation will happen when restoring the shard from a snapshot; right before recovering a shard from a peer; and when recovering the shard from the existing store after a node restart or a forced allocation. Associating a new translog with the Lucene index (and not the other way around like it is usually done during recoveries) prevent more Lucene commits to happen (as they required an IndexWriter, which triggers many file accesses).

Relates  #50999

@tlrx tlrx added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Mar 12, 2020
@tlrx tlrx requested review from ywelsch and DaveCTurner March 12, 2020 09:25
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@tlrx
Copy link
Member Author

tlrx commented Mar 12, 2020

@ywelsch and I talked about this via another channel and we'd like to investigate the possibility to add a pre-recovery hook that would allow to create the empty translog file we want without adding too many code paths based on a new index setting like this PR does in this first version. Such a hook could also help to warm up the cache before entering the recovery process, which is also very interesting. I'll investigate this option.

@tlrx
Copy link
Member Author

tlrx commented Mar 12, 2020

@DaveCTurner @ywelsch I revisited this PR by using and extending the existing IndexEventListener infrastructure so that a listener can react to a pre-recovery event. I've documented the new beforeIndexShardRecovery() method, let me know if you find this approach acceptable or not.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one nit, o.w. looking good.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I'd like Yannick's opinion too since I am not 100% comfortable with the IndexShard lifecycle. I left a couple of comments.

@@ -1313,6 +1313,13 @@ public void close(String reason, boolean flushEngine) throws IOException {
}
}

public void preRecovery() {
if (state != IndexShardState.RECOVERING) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More for my curiosity than something that needs changing: why can we not assert state == IndexShardState.RECOVERING here? The lifecycle of an IndexShard is still a bit opaque to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to assert here, there's no reason to not do it.

return createEmptyTranslog(location, shardId, initialGlobalCheckpoint, primaryTerm, null, channelFactory);
}

public static String createEmptyTranslog(final Path location,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding a big warning in a comment here saying how dangerous it is to specify the translog UUID and that it should only be used for shards that will see no indexing.

I'm also idly wondering about how hard it would be to make this Translog read-only when it's not been created with a fresh UUID. Probably too hard to be worth doing, but thought I'd mention it anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding a big warning in a comment here saying how dangerous it is to specify the translog UUID and that it should only be used for shards that will see no indexing.

Sure, I added some doc in 557d924

I'm also idly wondering about how hard it would be to make this Translog read-only when it's not been created with a fresh UUID. Probably too hard to be worth doing, but thought I'd mention it anyway.

I find the idea interesting but I'm not sure if it worths it; I'd prefer to not create translogs at all if they were not to be used.

@tlrx
Copy link
Member Author

tlrx commented Mar 13, 2020

@elasticmachine update branch

@tlrx tlrx merged commit 4bde03a into elastic:feature/searchable-snapshots Mar 13, 2020
@tlrx tlrx deleted the associate-translog-with-lucene-index branch March 13, 2020 11:46
@tlrx
Copy link
Member Author

tlrx commented Mar 13, 2020

Thanks Yannick and David

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants