Skip to content

Allow reindex to do update/upsert operations #17997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
honzakral opened this issue Apr 26, 2016 · 9 comments
Closed

Allow reindex to do update/upsert operations #17997

honzakral opened this issue Apr 26, 2016 · 9 comments
Labels
:Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down >enhancement high hanging fruit

Comments

@honzakral
Copy link
Contributor

The idea is to provide additional functionality to the reindex API to allow update on the target index except of only index operations.

My use case for this is entity-centric indexing - imagine you have an index containing events and wish to group them by session. With the reindex api it should be possible to read the source events, apply a script (or just extract a field) to get the ID of a target document and pass it as a parameter to a specified update script.

@clintongormley
Copy link
Contributor

@honzakral i had similar ideas for the reindex API way back when, but I'm not convinced that this will be enough for practical entity-centric indexing (but would be happy to be proven wrong).

Do you have some practical real-world examples of how you would use this?

@honzakral
Copy link
Contributor Author

My example is web server logs -> web sessions. The update script would go like:

session.start = min(session.start, event.timestamp)
session.end = max(session.end, event.timestamp)
session.length = session.end - session.start
if event.user:
    session.username = event.user.name
    session.subscribed = event.user.subscribed
...

Where session is the aggregated doc and event is the single event. This is a fairly simple example that could easily be done and would be very useful in real life.

@clintongormley
Copy link
Contributor

Makes sense. It wouldn't be super-simple fitting this into the reindex functionality because reindex gets a document, but the example you provide would actually need to receive the document as a parameter to a script, and it would need to handle upserts as scripted_upsert.

I'm wondering if reindex is the right place for this, or if we can think of a better dedicated API which makes this job easier.

@dmarkhas
Copy link

dmarkhas commented May 4, 2017

Are there any plans to incorporate this in a future release?
We also have a use case for doing updates with reindex, since we're trying to accomplish something similar to SQL Server merge process.

We have an index of raw ingested data, and an index of "processed" data where the processing is really just merging the inbound logs by some key (a field or a set of fields).
Right now we have to read the data out of the raw data index (with logstash or Spark or whatever), and write it back in to the processed index with the update API, which is very wasteful (there are some reasons why we can't ingest data directly into the processed index).
It would be nice if reindex could implement an update functionality, not just overwriting the existing documents.

@nik9000
Copy link
Member

nik9000 commented May 4, 2017

I'm not planning on working on this, no.

@markharwood
Copy link
Contributor

There's an added wrinkle to entity-centric updates that makes it unlike a reindex.
Ideally multiple events for the same entity are incorporated into a single update for efficiency's sake. Think of the flurry of weblog events generated by your browser opening this webpage and requests to load html/css/js/images etc. We don't want a one-to-one correlation between logged events and update operations on entities for efficiency's sake.

@shaharmor
Copy link

We need this as well to create pre-aggregated indices for bigger time intervals.

@lcawl lcawl added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Reindex API labels Feb 13, 2018
@henningandersen henningandersen added the :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down label Apr 12, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@henningandersen henningandersen removed the :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Apr 12, 2019
@henningandersen
Copy link
Contributor

The primary purpose of reindex is to copy data from one index to another either for upgrades, mapping/schema changes or migration between clusters. These are all one to one cases which is also the assumption in reindex. Adding aggregation style functionality into the mix should be carefully considered in order to not unnecessarily complicate both the API and the implementation. We currently think this is better handled separately, as described in #40002. That issue addresses entity centric indexing and we suggest to continue the conversation there and will therefore close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down >enhancement high hanging fruit
Projects
None yet
Development

No branches or pull requests

9 participants