Support ingest pipelines as part of the update api? #17895

javanna · 2016-04-20T16:54:49Z

Would it make sense to support ingest pipelines as part of the update api?

Ingest pipelines can already be used in the reindex api as an alternative to scripts. There is some kind of overlap between ingest and scripts: scripts are much more flexible (and safe with painless) but for people that get familiar with ingest it may be more convenient to reuse defined pipelines. Also some of the enrichments made through ingest are not easily performed using scripts (think geoip).

We can debate that ingest enrichments are really meant to be executed pre-indexing. But while we are reindexing a document, which is what the update api does, maybe it would be handy to be able to use ingest pipelines as well? Or maybe this is a bad idea at all and it would introduce yet another way to update documents other than scripts and partial document. I am on the fence on this myself, would be good to discuss this.

This comment made me think about this. Not sure ingest would help there though.

martijnvg · 2016-04-21T12:30:14Z

I'm not sure myself too is this needs to be supported, this sound tempting. However I think it is tricky. Ingest is flexible on where it runs if a node isn't ingest it executes elsewhere. This can become tricky with the update api, since it always executes on the node holding the primary.

martijnvg · 2016-04-22T09:19:02Z

Discussed during fix it friday and this looks like a useful enhancement, but there are corner cases which would make it very tricky to support this. (index name or routing is changed during ingestion or when a node isn't allowed to run ingest) Therefor I'm closing this issue and we can re-evaluate this at a later time if this is still useful and the technical concerns can fixed easily.

djschny · 2016-08-09T15:46:15Z

I feel it is very important to support ingest pipelines as part of the update API. Now that _timestamp has been removed, it was suggested that the alternative is to accomplish this with an ingest node pipeline. Without having this on the _update API, it makes for an incomplete solution. For example consider the following pipeline:

PUT _ingest/pipeline/timestamps
{
  "description": "Adds createdAt and updatedAt style timestamps",
  "processors": [
    {
      "set": {
        "field": "metadata.timestamp_created",
        "value": "{{_ingest.timestamp}}",
        "override": false
      }
    },
    {
      "set": {
        "field": "metadata.timestamp_updated",
        "value": "{{_ingest.timestamp}}",
        "override": true
      }
    }
  ]
}

When doing a partial update, we offer no alternative ability to set these timestamps appropriately that I am aware of. A script could probably be used, but resuse and consistency of the pipelines is something that we would want?

clintongormley · 2016-08-11T18:25:25Z

That is an interesting idea... although it'd be a lot slower than just using a script (as the document would have to be converted from JSON -> object -> JSON -> object -> JSON).

As mentioned above, exposing ingest to the update api has some complexity (eg changing _index, _type, _parent, _routing shouldn't be allowed). That said, update-by-query already exposes ingest.

martijnvg · 2016-08-15T15:17:03Z

+1 I think that if don't allow modifying _index, _parent or _routing metadata fields then we can support ingest in the update api and for bulk items in the bulk api.

ejsmith · 2016-09-28T17:26:21Z

I feel like having a guaranteed timestamp field is critically important for many use cases such as online reindexing. Telling us to use an ingest pipeline, but then having no way to guarantee that those pipelines are run to ensure that the timestamp is updated is not a good solution IMO.

clintongormley · 2016-10-08T11:36:49Z

#17895 (comment) is a good point, especially given the fact that we've removed now() from painless #20766

ejsmith · 2016-10-09T01:20:43Z

I really hope the team reconsiders this decision because I think this is really going to hurt the product.

martijnvg · 2016-10-10T06:32:11Z

@ejsmith I think there is consensus about adding support for ingest to the update api.

@clintongormley Lets change the discuss label for an adoptme label?

ejsmith · 2016-10-10T13:22:03Z

I fail to see how that helps. If it's a setting on the update API that means it's optional and I don't have any way to have a guaranteed timestamp that is reliable.

Update scripts might want to update the documents `_timestamp` but need a notion of `now()`. Painless doesn't support any notion of now() since it would make scripts non-pure functions. Yet, in the update case this is a valid value and we can pass it with the context together to allow the script to record the timestamp the documetn was updated. Relates to elastic#17895

s1monw · 2016-10-10T15:41:05Z

@ejsmith I think #20835 should resolve this issue?

ejsmith · 2016-10-10T15:45:08Z

Again, all of those features don't give me a very basic thing which is a guaranteed timestamp on my documents that I can rely on for things like reindexing and other operations. They are all dependent on everyone playing nice and making sure to set the timestamp.

s1monw · 2016-10-10T15:49:13Z

I understand what you mean but we never had such a think as the reliable way of getting a timestamp. The reason why we are moving away from it was it's inconsistency and people had wrong expectations. Now we are trying to provide building blocks to get the desired behavior. They involve work, I understand that but it's the most flexible and contained way of managing expectations.

ejsmith · 2016-10-10T15:53:20Z

So your saying that the timestamp feature isn't reliable?

s1monw · 2016-10-10T15:59:36Z

We had tons of bug related to TTL that caused confusion together with _timestamp. Also if you use an update script and you don't explicitly update _timestamp it keeps the original timestamp when it was created. While if you send the document again from the outside (ie. get -> modify -> update) you'd get a new value for _timestamp. Same is true for reindex where stuff would not be preserved. now sometimes you want that, sometimes you don't but it's usecase specific and not really reliable. Either way you turn it it's better to have it addable by a user since we can't do the right thing for everybody.

niemyjski · 2016-10-10T16:40:24Z

I'd rather have it be updated all the time than never or not having this feature. It caused way more problems by being removed than just leaving it there.

ejsmith · 2016-10-10T17:12:15Z

I don't care about TTL. That makes perfect sense and can easily be handled by outside code. Reliable timestamp can't be handled reliably from the outside. I understand that there were issues with timestamp in the past, but I guess I don't get why it can't be implemented on the server side. Again, to me it seems like a very critical thing and pretty much every other database system has that functionality. I know you guys aren't specifically a database, but the system very much works like a document database and we need to do reindexing and other bulk operations that very much need to have some sort of reliable way of knowing when documents were updated last.

Update scripts might want to update the documents `_timestamp` but need a notion of `now()`. Painless doesn't support any notion of now() since it would make scripts non-pure functions. Yet, in the update case this is a valid value and we can pass it with the context together to allow the script to record the timestamp the document was updated. Relates to #17895

s1monw · 2016-10-10T19:20:34Z

I'd rather have it be updated all the time than never or not having this feature. It caused way more problems by being removed than just leaving it there.

this is again how you used it but it's not how it was used by the majority. Majority provided timestamp from the outside and then it's suddenly a simple date field. This is how we should treat it, it was meant for the logging case nothing else. We are discussing X Y problems. You guys want a last_modified timestamp that wasn't what _timestamp read the documentation. I am up for discussing a last modified timestamp that is updated on every change and has clear semantics. That is, if you like it or not a new feature and should not be discussed on this issue. Feel free to open a new issue to add this specific field and we take it from there. @ejsmith @niemyjski WDYT

s1monw · 2016-10-11T13:27:13Z

@ejsmith @niemyjski I opened #20859 for this can you please take a look

niemyjski · 2016-10-24T19:59:26Z

I just ran into a case where I needed to use a pipeline as part of the update api to ensure my data is consistent and up-to-date.

elastic/elasticsearch#17895

niemyjski · 2016-11-02T00:31:07Z

Is there any reason the update api doesn't currently support pipelines? This is pretty huge and I'm already running into real world scenarios where I need this.

ssunka · 2017-03-02T03:33:00Z

We really need ingest pipelines to work for update API. Its really a good feature to be considered by Elastic Search experts

icode · 2017-04-24T10:07:17Z

has not support???????

aalkilani · 2017-05-06T01:46:19Z

+1 We're looking to use the Ingest pipeline to support updates from Logstash and get away from using scripts in Logstash where the entire document gets shipped back to LS for every update. With the amount of data we're processing, this is becoming critical for us and may push us away from using ES altogether.

brunosdiniz · 2017-09-27T02:11:24Z

+1 for this enhancement.

For the use case of ingest attachment pipelines, how are people currently dealing with updates to documents that have been previously indexed based on binary files?

In the scenario I´m exploring the document has an array of attachments, and having to fetch entire documents to be able to change a single attribute (via index api) doesn´t really seem like something one would have to resort to with ES at this stage.

Is there any good alternative to ingest attachment plugin for those using Logstash that does not involve ingestion pipelines?

I thought about maybe handling the text extraction in a ruby filter and avoiding the ingestion pipeline altogether, but that seemed to hacky for such a scenario I think might be quite common for the majority of folks.

Appreciate any insight on this.

dadoonet · 2017-09-27T08:02:38Z

@brunosdiniz I'm pretty sure it will happen at some point in LS. LS is moving some part of the code to Java so integrating ingest-attachment as a codec or a filter should be "easy" in the future.

See elastic/logstash#3815. I started an implementation of this 3 years ago but never went farer than the POC.

One thing you can may be use is my open source project named FSCrawler. It comes with a REST interface which you may call from Logstash with something like this REST Filter plugin (not merged though so it will require some work I guess).

consulthys · 2017-12-23T06:18:31Z

I'm pretty sure there are many real-world use case scenarii which would benefit from having the Update API support ingest pipelines.

For instance, I have a very concrete case where documents are coming from a third-party system (which is out of our control). When we index those documents the first time, we send them through an ingest pipeline to apply some transformations (split values, etc) and standardization (field renaming, etc). At a later point in the process, those documents are enriched by various software components under our control (mainly new fields are added). However, those documents are also updated from time to time in the 3rd party system and need to be repulled and re-indexed when that happens. If we simply index the document with the new data again, the current version of the document which contains the additional enriched fields will be erased. Just to illustrate this:

At time T, the document coming from the 3rd-party system looks like:

{ 
     "field1": "data11,data12",              // raw
     "field2": "data2"                       // raw
}

At time T+1, after going through an ingest pipeline, it looks like

{ 
    "field1_renamed": ["data11", "data12"],  // renamed + split
    "field2_renamed": "data2"                // renamed
}

At time T+2, various processes enrich the document which now looks like

{ 
     "field1_renamed": ["data11", "data12"], // same
     "field2_renamed": "data2",              // same
     "custom_field1": "customdata1",         // custom/enriched
     "custom_field2": "customdata2"          // custom/enriched
}

At time T+3, the document is updated in the 3rd-party system and now looks like:

{ 
     "field1": "data3",                      // raw
     "field2": "data4"                       // raw
}

At this point, we need to do two things:

apply the transformations in the ingest pipeline (rename fields, split fields, etc)
keep the custom fields that have been created at T+2

i.e. the new version of the document should look like this:

{ 
     "field1_renamed": "data3",              // renamed
     "field2_renamed": "data4",              // renamed 
     "custom_field1": "customdata1",         // same 
     "custom_field2": "customdata2"          // same
}

I don't think what we're doing above is anything out of the ordinary, I can imagine plenty of use cases along the same lines. Since the Update API doesn't support pipelines, we cannot use them at all and have to reimplement exactly what ingest pipelines do but shove that functionality inside of our software stack to apply the transformation ourselves.

Another possibility would boil down to do all this with Logstash (rename fields, etc) + use a ruby filter to apply the diff, but for having implemented it a few times, I think it's a pretty convoluted solution which won't scale well in our case because there are many 3rd party systems and the field renaming is different for each of them (i.e. it's easier to create pipelines for each of them, than different Logstash configs for each of them).

I've thought of using parent/child docs for this, so that our custom fields would be in a parent document and data from 3rd party systems would be in child documents that can be indexed/overridden as usual (hence pipelines would work), but this setup would make queries overkill afterwards.

Yet another idea would be to have a _meta field per document so we can shove our custom fields in there and not override them when indexing a new version, but as far as I know this feature only exists in my head.

In light of this concrete case, I appreciate if anyone has any thoughts on how to best handle this.

consulthys · 2018-01-15T11:50:48Z

For those who are interested, I've figured out yet another way to circumvent this. It involves an additional step, but it works and does exactly what I want.

The idea is to:

first use the Index API to index the data into a temporary index using the ingestion pipeline (to apply the transformations, renaming, etc)
then to read the data from that temporary index
and finally to use the Update API (doc/script upsert) to update the real index with the transformed data read from the temporary index.

Agreed, it would be much easier if the Update API would support ingestion pipelines, and I'm still hoping that it will be the case one day.

talevy · 2018-03-13T16:31:47Z

Closing since the performance ramifications of this are too great to consider implementing this functionality and exporting the document (at update time) to a separate ingest node.

javanna added discuss :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Apr 20, 2016

martijnvg closed this as completed Apr 22, 2016

clintongormley reopened this Aug 11, 2016

clintongormley added >enhancement help wanted adoptme and removed discuss labels Oct 10, 2016

s1monw mentioned this issue Oct 10, 2016

Expose ctx._now in update scripts #20835

Merged

s1monw mentioned this issue Oct 11, 2016

Introduce a new _last_modified metafield #20859

Closed

niemyjski mentioned this issue Oct 24, 2016

5.0.0: Update Api doesn't support pipelines #21100

Closed

niemyjski added a commit to FoundatioFx/Foundatio.Repositories that referenced this issue Nov 2, 2016

Ignore pipelined update tests as it's not supported

07b2743

elastic/elasticsearch#17895

clintongormley mentioned this issue Feb 28, 2017

logstash ingest attachment plugin not working while using update action #23408

Closed

kgilden mentioned this issue Nov 23, 2017

Attachment ingest issue FriendsOfSymfony/FOSElasticaBundle#1227

Closed

stof mentioned this issue Feb 14, 2018

Add support for specifying a pipeline when indexing documents ruflin/Elastica#1455

Merged

clintongormley mentioned this issue Feb 22, 2018

Using a named pipeline with partial document updates elastic/elasticsearch-perl#148

Closed

talevy closed this as completed Mar 13, 2018

markharwood mentioned this issue Mar 14, 2019

Incremental update tooling #40002

Closed

juntezhang mentioned this issue Jul 12, 2023

[FEATURE] Add support with the Update API opensearch-project/neural-search#213

Closed

joegallo mentioned this issue Jan 31, 2024

Better behavior and documentation on ingest pipelines and update operations #104941

Open

reta mentioned this issue Nov 19, 2024

[BUG] Ingest pipeline bulk update issue opensearch-project/OpenSearch#16663

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ingest pipelines as part of the update api? #17895

Support ingest pipelines as part of the update api? #17895

javanna commented Apr 20, 2016

martijnvg commented Apr 21, 2016

martijnvg commented Apr 22, 2016

djschny commented Aug 9, 2016 •

edited by javanna

Loading

clintongormley commented Aug 11, 2016

martijnvg commented Aug 15, 2016

ejsmith commented Sep 28, 2016

clintongormley commented Oct 8, 2016

ejsmith commented Oct 9, 2016

martijnvg commented Oct 10, 2016

ejsmith commented Oct 10, 2016

s1monw commented Oct 10, 2016

ejsmith commented Oct 10, 2016

s1monw commented Oct 10, 2016

ejsmith commented Oct 10, 2016

s1monw commented Oct 10, 2016

niemyjski commented Oct 10, 2016

ejsmith commented Oct 10, 2016

s1monw commented Oct 10, 2016

s1monw commented Oct 11, 2016

niemyjski commented Oct 24, 2016

niemyjski commented Nov 2, 2016

ssunka commented Mar 2, 2017

icode commented Apr 24, 2017

aalkilani commented May 6, 2017

brunosdiniz commented Sep 27, 2017

dadoonet commented Sep 27, 2017

consulthys commented Dec 23, 2017 •

edited

Loading

consulthys commented Jan 15, 2018 •

edited

Loading

talevy commented Mar 13, 2018

Support ingest pipelines as part of the update api? #17895

Support ingest pipelines as part of the update api? #17895

Comments

javanna commented Apr 20, 2016

martijnvg commented Apr 21, 2016

martijnvg commented Apr 22, 2016

djschny commented Aug 9, 2016 • edited by javanna Loading

clintongormley commented Aug 11, 2016

martijnvg commented Aug 15, 2016

ejsmith commented Sep 28, 2016

clintongormley commented Oct 8, 2016

ejsmith commented Oct 9, 2016

martijnvg commented Oct 10, 2016

ejsmith commented Oct 10, 2016

s1monw commented Oct 10, 2016

ejsmith commented Oct 10, 2016

s1monw commented Oct 10, 2016

ejsmith commented Oct 10, 2016

s1monw commented Oct 10, 2016

niemyjski commented Oct 10, 2016

ejsmith commented Oct 10, 2016

s1monw commented Oct 10, 2016

s1monw commented Oct 11, 2016

niemyjski commented Oct 24, 2016

niemyjski commented Nov 2, 2016

ssunka commented Mar 2, 2017

icode commented Apr 24, 2017

aalkilani commented May 6, 2017

brunosdiniz commented Sep 27, 2017

dadoonet commented Sep 27, 2017

consulthys commented Dec 23, 2017 • edited Loading

consulthys commented Jan 15, 2018 • edited Loading

talevy commented Mar 13, 2018

djschny commented Aug 9, 2016 •

edited by javanna

Loading

consulthys commented Dec 23, 2017 •

edited

Loading

consulthys commented Jan 15, 2018 •

edited

Loading