Skip to content

Allow reindex API to move documents instead of just copying #17998

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
honzakral opened this issue Apr 26, 2016 · 7 comments
Closed

Allow reindex API to move documents instead of just copying #17998

honzakral opened this issue Apr 26, 2016 · 7 comments
Labels
discuss :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >enhancement

Comments

@honzakral
Copy link
Contributor

A use case we see a lot with users is that they want to move some data out of one index to another. Would it be possible to combine the reindex with delete-by-query essentially? After a document is indexed in the target index a delete operation will be issued on the source index.

Of course this couldn't be done atomically, but even on best effort basis this would be super useful for a lot of people - essentially executing reindex and delete-by-query at the same time (on the same point in time snapshot of the index) with no additional guarantees than those two operations have individually.

@clintongormley
Copy link
Contributor

@honzakral I'm struggling to see how this would be useful, especially when dealing with the complexity of both documents existing (or neither document existing) for a period. Could you elaborate on use cases?

@honzakral
Copy link
Contributor Author

The use case I have in mind is a migration from a single index and aliases to separate index. Let's assume you store user-generated data in one index and use aliases to give the app an illusion of an index-per-user architecture. Now one user proves to be too big to live in a common index and needs to be put into it's own index. With aliases it is very easy to do, but you need to move the data - copy them to the newly created index and delete them from the old one. You can do it with reindex and delete-by-query but making it into one command would be nicer and would also minimize the discrepancies during the process (where a document exists in both old and new indices).

Similar as #17997 it can be viewed as a generalization of the reindex api to allow for any bulk operation - not just index, but also a delete/update and possibly against different indices.

In client code, using scan/bulk combination this can be simply achieved by adjusting the code generating the bulk request (both the action and data lines).

@clintongormley
Copy link
Contributor

Aliases make the transition atomic. Doing this doc-by-doc (besides being a much heavier operation) would result in moments when either the same doc is visible in both indices or is visible in neither index (because of the differences in refresh times). This makes life more complex for the user, rather than less.

@honzakral
Copy link
Contributor Author

Well, with aliases you have several options, each with its set of problems, none of those are really atomic - switch when the empty index is created and then wait for the reindex to populate it, in this case your user sees missing data for a long time.

Other option is to point the alias to both indices, but then you have to have a separate write alias and you still need to solve moving the data - if you use reindex then you will start seeing duplicates until all the data is copied and then you remove the original index from the alias. During the transition updates can be also problematic.

Or you can first copy the data and then switch the alias. Here you just have to keep track of all the documents that have changed after the reindex operation began, apply those updates and only then switch the alias. This can be also problematic.

None of the options are atomic and there is always a room for discrepancy unless you want to make the application aware of this mechanics, which can mean a lot of code and complexity for a few transitions.

This solution is by no means perfect, maybe even not better than the ones described, but it is simplest to implement and idempotent. I also agree that it would be difficult to manage expectations.

Another use case would be to help with entity-centric indexing (#17997) where you can just run a "move" with update periodically.

@s1monw
Copy link
Contributor

s1monw commented May 6, 2016

We should not add features that suggest a certain behavior like atomicity of a move. As clinton said, building this means that you could see 1, 2 or even 0 results for a given document querying the two indices (source and target). The bigger issue that I see with this is that you potentially be stuck with 2 indices both half broken. I think reindexing should be always be able to trash the target index and don't loose data. We discussed this in a wider audience and decided to close it for now.

@s1monw s1monw closed this as completed May 6, 2016
@lcawl lcawl added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Reindex API labels Feb 13, 2018
@Brentbin
Copy link

Any news here?

@CONCETO-flow
Copy link

Would also be interested in the outcome of the discussing or at least a best practice for moving big indices into a rollover setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >enhancement
Projects
None yet
Development

No branches or pull requests

6 participants