Allow reindex to do update/upsert operations #17997

honzakral · 2016-04-26T21:00:19Z

The idea is to provide additional functionality to the reindex API to allow update on the target index except of only index operations.

My use case for this is entity-centric indexing - imagine you have an index containing events and wish to group them by session. With the reindex api it should be possible to read the source events, apply a script (or just extract a field) to get the ID of a target document and pass it as a parameter to a specified update script.

The text was updated successfully, but these errors were encountered:

clintongormley · 2016-04-28T17:36:38Z

@honzakral i had similar ideas for the reindex API way back when, but I'm not convinced that this will be enough for practical entity-centric indexing (but would be happy to be proven wrong).

Do you have some practical real-world examples of how you would use this?

honzakral · 2016-04-29T21:04:12Z

My example is web server logs -> web sessions. The update script would go like:

session.start = min(session.start, event.timestamp)
session.end = max(session.end, event.timestamp)
session.length = session.end - session.start
if event.user:
    session.username = event.user.name
    session.subscribed = event.user.subscribed
...

Where session is the aggregated doc and event is the single event. This is a fairly simple example that could easily be done and would be very useful in real life.

clintongormley · 2016-05-02T10:42:33Z

Makes sense. It wouldn't be super-simple fitting this into the reindex functionality because reindex gets a document, but the example you provide would actually need to receive the document as a parameter to a script, and it would need to handle upserts as scripted_upsert.

I'm wondering if reindex is the right place for this, or if we can think of a better dedicated API which makes this job easier.

dmarkhas · 2017-05-04T08:52:11Z

Are there any plans to incorporate this in a future release?
We also have a use case for doing updates with reindex, since we're trying to accomplish something similar to SQL Server merge process.

We have an index of raw ingested data, and an index of "processed" data where the processing is really just merging the inbound logs by some key (a field or a set of fields).
Right now we have to read the data out of the raw data index (with logstash or Spark or whatever), and write it back in to the processed index with the update API, which is very wasteful (there are some reasons why we can't ingest data directly into the processed index).
It would be nice if reindex could implement an update functionality, not just overwriting the existing documents.

nik9000 · 2017-05-04T16:26:59Z

I'm not planning on working on this, no.

markharwood · 2017-08-11T10:46:45Z

There's an added wrinkle to entity-centric updates that makes it unlike a reindex.
Ideally multiple events for the same entity are incorporated into a single update for efficiency's sake. Think of the flurry of weblog events generated by your browser opening this webpage and requests to load html/css/js/images etc. We don't want a one-to-one correlation between logged events and update operations on entities for efficiency's sake.

shaharmor · 2017-10-08T07:01:41Z

We need this as well to create pre-aggregated indices for bigger time intervals.

elasticmachine · 2019-04-12T07:50:24Z

Pinging @elastic/es-distributed

henningandersen · 2019-07-03T14:34:16Z

The primary purpose of reindex is to copy data from one index to another either for upgrades, mapping/schema changes or migration between clusters. These are all one to one cases which is also the assumption in reindex. Adding aggregation style functionality into the mix should be carefully considered in order to not unnecessarily complicate both the API and the implementation. We currently think this is better handled separately, as described in #40002. That issue addresses entity centric indexing and we suggest to continue the conversation there and will therefore close this issue.

honzakral added >enhancement :Reindex API labels Apr 26, 2016

honzakral mentioned this issue Apr 26, 2016

Reindex API #15201

Closed

5 tasks

clintongormley added the high hanging fruit label Apr 28, 2016

honzakral mentioned this issue Apr 29, 2016

Allow reindex API to move documents instead of just copying #17998

Closed

clintongormley added discuss and removed discuss labels May 2, 2016

lcawl added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Reindex API labels Feb 13, 2018

henningandersen added the :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down label Apr 12, 2019

henningandersen removed the :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Apr 12, 2019

henningandersen added team-discuss and removed team-discuss labels Jul 3, 2019

henningandersen closed this as completed Jul 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow reindex to do update/upsert operations #17997

Allow reindex to do update/upsert operations #17997

honzakral commented Apr 26, 2016

clintongormley commented Apr 28, 2016

honzakral commented Apr 29, 2016

clintongormley commented May 2, 2016

dmarkhas commented May 4, 2017

nik9000 commented May 4, 2017

markharwood commented Aug 11, 2017

shaharmor commented Oct 8, 2017

elasticmachine commented Apr 12, 2019

henningandersen commented Jul 3, 2019

Allow reindex to do update/upsert operations #17997

Allow reindex to do update/upsert operations #17997

Comments

honzakral commented Apr 26, 2016

clintongormley commented Apr 28, 2016

honzakral commented Apr 29, 2016

clintongormley commented May 2, 2016

dmarkhas commented May 4, 2017

nik9000 commented May 4, 2017

markharwood commented Aug 11, 2017

shaharmor commented Oct 8, 2017

elasticmachine commented Apr 12, 2019

henningandersen commented Jul 3, 2019