Don't create snapshot repositories on the cluster state update thread #9488

bleskes · 2015-01-29T20:54:18Z

Current, when the master indicates a new repository needs to be created, the Snapshot and Restore code creates it while on the cluster state update thread. This is tricky because this typically involves network calls and which may slow down the cluster state processing. We should do it async.

imotov · 2015-01-29T22:56:27Z

It sounds like a good idea but it might be really tricky to implement. Imagine somebody creates and deletes the repository with the same name several times in quick succession and repository creation takes long time. We would have to have some sort of repository creation/destruction pipeline to handle this properly. Even with the pipeline if somebody performs a snapshot right after repository creation we will have to hold the snapshot start until a proper repository is created.

I think it might be more prudent to ensure that repository creation doesn't block. After we added repository verification, there is really no good reason for a repository to open a network connection during its initialization - it's possible that such connection will not be even used if a node doesn't have any primary shards. @dadoonet, @tlrx thoughts?

tlrx · 2015-01-30T08:39:04Z

I agree with Igor, this sounds like a good idea but I'm not sure if we can implement it correctly right now. In a near future, maybe we could use a task management API like the one described in #6914 to pipe the repository creation/destruction and snapshot/restore task?

Also, I agree that the repository verification can take some time. This is usually a quick process but I experienced some latency/network problem while using it.

I'm wondering if we can keep the repository creation process synchronous (ie on the cluster state update thread) and set a repository property like verification_status: unverified. Then make the verification process asynchronous, going from unverified to executing and finally verified. We may block snapshot requests on non verified repositories, as well as blocking repository creation/deletion.

bleskes · 2015-01-30T13:44:40Z

I think it might be more prudent to ensure that repository creation doesn't block.

+1 to that. But we may not always control it, especially if people extend it.

This is usually a quick process but I experienced some latency/network problem while using it.

For what it's worth - I saw a 4 minute block. Causing all kind of secondary issues in the cluster.

I'm wondering if we can keep the repository creation process synchronous (ie on the cluster state update thread) and set a repository property like verification_status: unverified. Then make the verification process asynchronous, going from unverified to executing and finally verified. We may block snapshot requests on non verified repositories, as well as blocking repository creation/deletion.

That would be good, but I think it's not too far away from having a repository wrapper created by the framework upon cluster state updates and started async. If the repo is deleted while initializing we can mark the wrapper with a deleted flag which will cause it to immediately apply delete code once intialization is completed.

clintongormley · 2016-11-26T18:58:47Z

@imotov does this still need doing?

imotov · 2016-11-28T15:36:46Z

@clintongormley I don't think anything changed in the last year. @tlrx, @abeyad what do you think?

tlrx · 2016-11-28T16:39:42Z

I checked the code again and I don't think anything changed there... so it would be nice to implement any of the suggested solution.

abeyad · 2016-11-28T17:03:30Z

The S3 and GCE repositories definitely makes network calls during initialization (which will occur on the cluster state update thread). I don't see the same for the Azure repository though I could've missed it. In any case, I like @tlrx 's proposed solution of handling repository verification asyc but unless we remove repository initialization itself outside of the cluster state update task, then we will have to require that all plugin developers know not to have any blocking operations like network calls executed in repository construction.

abeyad · 2016-11-28T17:30:29Z

we could have a custom method like onInit that each repository implementation must override and implement, which would contain any blocking logic and called outside of the cluster state update thread. Will have to think about it some more.

tlrx · 2018-03-21T15:20:30Z

We talked about this today with @ywelsch and this is still something we want to do. We think that the repository could be registered as it is today, but the existence of the filesystem/bucket/whatever could be delayed to the first access to the repository.

Related elastic#9488

original-brownbear · 2018-12-03T11:56:18Z

@ywelsch @tlrx I looked into this a bit and it looks like this has been done already in #31606?

ywelsch · 2018-12-03T12:06:34Z

@original-brownbear A follow-up to this is to make sure the master does not manipulate the list of repositories in the cluster state update task (it calls registerRepository). If that cluster state update fails to be published, the master will end up with a dangling repo. The cleaner way is to have the master just do createRepository, and this validation can even be done before submitting the cluster state update task. The master then updates the list of repositories in the same way as the other nodes, by applying the cluster state update task. Finally, this prevents concurrent writes to the list of repositories (the repositories field in RepositoriesService), as only the single cluster applier service thread will be updating the list of repos.

* Move `createRepository` call out of cluster state tasks * Now only `RepositoriesService#applyClusterState` manipulates `this.repositories` * Closes elastic#9488

original-brownbear · 2018-12-03T15:00:04Z

Done in #36157 I think

* Move `createRepository` call out of cluster state tasks * Now only `RepositoriesService#applyClusterState` manipulates `this.repositories` * Closes #9488

bleskes added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Jan 29, 2015

imotov self-assigned this Jan 29, 2015

imotov added the >bug label Jan 29, 2015

clintongormley unassigned imotov Dec 4, 2015

clintongormley added the help wanted adoptme label Dec 4, 2015

tlrx added a commit to tlrx/elasticsearch that referenced this issue Jun 4, 2018

Improve repository registration

1dab674

Related elastic#9488

tlrx mentioned this issue Jun 4, 2018

[WIP] Improve repository registration #31070

Closed

original-brownbear self-assigned this Nov 26, 2018

original-brownbear removed their assignment Dec 3, 2018

original-brownbear self-assigned this Dec 3, 2018

original-brownbear mentioned this issue Dec 3, 2018

SNAPSHOT: Repo Creation out of ClusterStateTask #36157

Merged

original-brownbear closed this as completed in #36157 Dec 4, 2018

original-brownbear added a commit that referenced this issue Dec 4, 2018

SNAPSHOT: Repo Creation out of ClusterStateTask (#36157)

3c54b41

* Move `createRepository` call out of cluster state tasks * Now only `RepositoriesService#applyClusterState` manipulates `this.repositories` * Closes #9488

original-brownbear added a commit that referenced this issue Dec 4, 2018

SNAPSHOT: Repo Creation out of ClusterStateTask (#36157)

6d8954d

* Move `createRepository` call out of cluster state tasks * Now only `RepositoriesService#applyClusterState` manipulates `this.repositories` * Closes #9488

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't create snapshot repositories on the cluster state update thread #9488

Don't create snapshot repositories on the cluster state update thread #9488

bleskes commented Jan 29, 2015

imotov commented Jan 29, 2015

tlrx commented Jan 30, 2015

bleskes commented Jan 30, 2015

clintongormley commented Nov 26, 2016

imotov commented Nov 28, 2016

tlrx commented Nov 28, 2016

abeyad commented Nov 28, 2016

abeyad commented Nov 28, 2016

tlrx commented Mar 21, 2018

original-brownbear commented Dec 3, 2018 •

edited

Loading

ywelsch commented Dec 3, 2018

original-brownbear commented Dec 3, 2018

Don't create snapshot repositories on the cluster state update thread #9488

Don't create snapshot repositories on the cluster state update thread #9488

Comments

bleskes commented Jan 29, 2015

imotov commented Jan 29, 2015

tlrx commented Jan 30, 2015

bleskes commented Jan 30, 2015

clintongormley commented Nov 26, 2016

imotov commented Nov 28, 2016

tlrx commented Nov 28, 2016

abeyad commented Nov 28, 2016

abeyad commented Nov 28, 2016

tlrx commented Mar 21, 2018

original-brownbear commented Dec 3, 2018 • edited Loading

ywelsch commented Dec 3, 2018

original-brownbear commented Dec 3, 2018

original-brownbear commented Dec 3, 2018 •

edited

Loading