-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Don't create snapshot repositories on the cluster state update thread #9488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It sounds like a good idea but it might be really tricky to implement. Imagine somebody creates and deletes the repository with the same name several times in quick succession and repository creation takes long time. We would have to have some sort of repository creation/destruction pipeline to handle this properly. Even with the pipeline if somebody performs a snapshot right after repository creation we will have to hold the snapshot start until a proper repository is created. I think it might be more prudent to ensure that repository creation doesn't block. After we added repository verification, there is really no good reason for a repository to open a network connection during its initialization - it's possible that such connection will not be even used if a node doesn't have any primary shards. @dadoonet, @tlrx thoughts? |
I agree with Igor, this sounds like a good idea but I'm not sure if we can implement it correctly right now. In a near future, maybe we could use a task management API like the one described in #6914 to pipe the repository creation/destruction and snapshot/restore task? Also, I agree that the repository verification can take some time. This is usually a quick process but I experienced some latency/network problem while using it. I'm wondering if we can keep the repository creation process synchronous (ie on the cluster state update thread) and set a repository property like |
+1 to that. But we may not always control it, especially if people extend it.
For what it's worth - I saw a 4 minute block. Causing all kind of secondary issues in the cluster.
That would be good, but I think it's not too far away from having a repository wrapper created by the framework upon cluster state updates and started async. If the repo is deleted while initializing we can mark the wrapper with a deleted flag which will cause it to immediately apply delete code once intialization is completed. |
@imotov does this still need doing? |
@clintongormley I don't think anything changed in the last year. @tlrx, @abeyad what do you think? |
I checked the code again and I don't think anything changed there... so it would be nice to implement any of the suggested solution. |
The S3 and GCE repositories definitely makes network calls during initialization (which will occur on the cluster state update thread). I don't see the same for the Azure repository though I could've missed it. In any case, I like @tlrx 's proposed solution of handling repository verification asyc but unless we remove repository initialization itself outside of the cluster state update task, then we will have to require that all plugin developers know not to have any blocking operations like network calls executed in repository construction. |
we could have a custom method like |
We talked about this today with @ywelsch and this is still something we want to do. We think that the repository could be registered as it is today, but the existence of the filesystem/bucket/whatever could be delayed to the first access to the repository. |
@original-brownbear A follow-up to this is to make sure the master does not manipulate the list of repositories in the cluster state update task (it calls |
* Move `createRepository` call out of cluster state tasks * Now only `RepositoriesService#applyClusterState` manipulates `this.repositories` * Closes elastic#9488
Done in #36157 I think |
* Move `createRepository` call out of cluster state tasks * Now only `RepositoriesService#applyClusterState` manipulates `this.repositories` * Closes #9488
* Move `createRepository` call out of cluster state tasks * Now only `RepositoriesService#applyClusterState` manipulates `this.repositories` * Closes #9488
Current, when the master indicates a new repository needs to be created, the Snapshot and Restore code creates it while on the cluster state update thread. This is tricky because this typically involves network calls and which may slow down the cluster state processing. We should do it async.
The text was updated successfully, but these errors were encountered: