-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Allow reindex API to move documents instead of just copying #17998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@honzakral I'm struggling to see how this would be useful, especially when dealing with the complexity of both documents existing (or neither document existing) for a period. Could you elaborate on use cases? |
The use case I have in mind is a migration from a single index and aliases to separate index. Let's assume you store user-generated data in one index and use aliases to give the app an illusion of an index-per-user architecture. Now one user proves to be too big to live in a common index and needs to be put into it's own index. With aliases it is very easy to do, but you need to move the data - copy them to the newly created index and delete them from the old one. You can do it with Similar as #17997 it can be viewed as a generalization of the reindex api to allow for any bulk operation - not just index, but also a delete/update and possibly against different indices. In client code, using |
Aliases make the transition atomic. Doing this doc-by-doc (besides being a much heavier operation) would result in moments when either the same doc is visible in both indices or is visible in neither index (because of the differences in refresh times). This makes life more complex for the user, rather than less. |
Well, with aliases you have several options, each with its set of problems, none of those are really atomic - switch when the empty index is created and then wait for the reindex to populate it, in this case your user sees missing data for a long time. Other option is to point the alias to both indices, but then you have to have a separate write alias and you still need to solve moving the data - if you use reindex then you will start seeing duplicates until all the data is copied and then you remove the original index from the alias. During the transition updates can be also problematic. Or you can first copy the data and then switch the alias. Here you just have to keep track of all the documents that have changed after the reindex operation began, apply those updates and only then switch the alias. This can be also problematic. None of the options are atomic and there is always a room for discrepancy unless you want to make the application aware of this mechanics, which can mean a lot of code and complexity for a few transitions. This solution is by no means perfect, maybe even not better than the ones described, but it is simplest to implement and idempotent. I also agree that it would be difficult to manage expectations. Another use case would be to help with entity-centric indexing (#17997) where you can just run a "move" with update periodically. |
We should not add features that suggest a certain behavior like atomicity of a move. As clinton said, building this means that you could see 1, 2 or even 0 results for a given document querying the two indices (source and target). The bigger issue that I see with this is that you potentially be stuck with 2 indices both half broken. I think reindexing should be always be able to trash the target index and don't loose data. We discussed this in a wider audience and decided to close it for now. |
Any news here? |
Would also be interested in the outcome of the discussing or at least a best practice for moving big indices into a rollover setup. |
A use case we see a lot with users is that they want to move some data out of one index to another. Would it be possible to combine the
reindex
withdelete-by-query
essentially? After a document is indexed in the target index a delete operation will be issued on the source index.Of course this couldn't be done atomically, but even on best effort basis this would be super useful for a lot of people - essentially executing
reindex
anddelete-by-query
at the same time (on the same point in time snapshot of the index) with no additional guarantees than those two operations have individually.The text was updated successfully, but these errors were encountered: