Skip to content

Support ingest pipelines as part of the update api? #17895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
javanna opened this issue Apr 20, 2016 · 29 comments
Closed

Support ingest pipelines as part of the update api? #17895

javanna opened this issue Apr 20, 2016 · 29 comments
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement help wanted adoptme

Comments

@javanna
Copy link
Member

javanna commented Apr 20, 2016

Would it make sense to support ingest pipelines as part of the update api?

Ingest pipelines can already be used in the reindex api as an alternative to scripts. There is some kind of overlap between ingest and scripts: scripts are much more flexible (and safe with painless) but for people that get familiar with ingest it may be more convenient to reuse defined pipelines. Also some of the enrichments made through ingest are not easily performed using scripts (think geoip).

We can debate that ingest enrichments are really meant to be executed pre-indexing. But while we are reindexing a document, which is what the update api does, maybe it would be handy to be able to use ingest pipelines as well? Or maybe this is a bad idea at all and it would introduce yet another way to update documents other than scripts and partial document. I am on the fence on this myself, would be good to discuss this.

This comment made me think about this. Not sure ingest would help there though.

@javanna javanna added discuss :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Apr 20, 2016
@martijnvg
Copy link
Member

I'm not sure myself too is this needs to be supported, this sound tempting. However I think it is tricky. Ingest is flexible on where it runs if a node isn't ingest it executes elsewhere. This can become tricky with the update api, since it always executes on the node holding the primary.

@martijnvg
Copy link
Member

Discussed during fix it friday and this looks like a useful enhancement, but there are corner cases which would make it very tricky to support this. (index name or routing is changed during ingestion or when a node isn't allowed to run ingest) Therefor I'm closing this issue and we can re-evaluate this at a later time if this is still useful and the technical concerns can fixed easily.

@djschny
Copy link
Contributor

djschny commented Aug 9, 2016

I feel it is very important to support ingest pipelines as part of the update API. Now that _timestamp has been removed, it was suggested that the alternative is to accomplish this with an ingest node pipeline. Without having this on the _update API, it makes for an incomplete solution. For example consider the following pipeline:

PUT _ingest/pipeline/timestamps
{
  "description": "Adds createdAt and updatedAt style timestamps",
  "processors": [
    {
      "set": {
        "field": "metadata.timestamp_created",
        "value": "{{_ingest.timestamp}}",
        "override": false
      }
    },
    {
      "set": {
        "field": "metadata.timestamp_updated",
        "value": "{{_ingest.timestamp}}",
        "override": true
      }
    }
  ]
}

When doing a partial update, we offer no alternative ability to set these timestamps appropriately that I am aware of. A script could probably be used, but resuse and consistency of the pipelines is something that we would want?

@clintongormley
Copy link
Contributor

That is an interesting idea... although it'd be a lot slower than just using a script (as the document would have to be converted from JSON -> object -> JSON -> object -> JSON).

As mentioned above, exposing ingest to the update api has some complexity (eg changing _index, _type, _parent, _routing shouldn't be allowed). That said, update-by-query already exposes ingest.

@martijnvg
Copy link
Member

+1 I think that if don't allow modifying _index, _parent or _routing metadata fields then we can support ingest in the update api and for bulk items in the bulk api.

@ejsmith
Copy link

ejsmith commented Sep 28, 2016

I feel like having a guaranteed timestamp field is critically important for many use cases such as online reindexing. Telling us to use an ingest pipeline, but then having no way to guarantee that those pipelines are run to ensure that the timestamp is updated is not a good solution IMO.

@clintongormley
Copy link
Contributor

#17895 (comment) is a good point, especially given the fact that we've removed now() from painless #20766

@ejsmith
Copy link

ejsmith commented Oct 9, 2016

I really hope the team reconsiders this decision because I think this is really going to hurt the product.

@martijnvg
Copy link
Member

@ejsmith I think there is consensus about adding support for ingest to the update api.

@clintongormley Lets change the discuss label for an adoptme label?

@ejsmith
Copy link

ejsmith commented Oct 10, 2016

I fail to see how that helps. If it's a setting on the update API that means it's optional and I don't have any way to have a guaranteed timestamp that is reliable.

s1monw added a commit to s1monw/elasticsearch that referenced this issue Oct 10, 2016
Update scripts might want to update the documents `_timestamp` but need a notion of `now()`.
Painless doesn't support any notion of now() since it would make scripts non-pure functions. Yet,
in the update case this is a valid value and we can pass it with the context together to allow the
script to record the timestamp the documetn was updated.

Relates to elastic#17895
@s1monw
Copy link
Contributor

s1monw commented Oct 10, 2016

@ejsmith I think #20835 should resolve this issue?

@ejsmith
Copy link

ejsmith commented Oct 10, 2016

Again, all of those features don't give me a very basic thing which is a guaranteed timestamp on my documents that I can rely on for things like reindexing and other operations. They are all dependent on everyone playing nice and making sure to set the timestamp.

@s1monw
Copy link
Contributor

s1monw commented Oct 10, 2016

I understand what you mean but we never had such a think as the reliable way of getting a timestamp. The reason why we are moving away from it was it's inconsistency and people had wrong expectations. Now we are trying to provide building blocks to get the desired behavior. They involve work, I understand that but it's the most flexible and contained way of managing expectations.

@ejsmith
Copy link

ejsmith commented Oct 10, 2016

So your saying that the timestamp feature isn't reliable?

@s1monw
Copy link
Contributor

s1monw commented Oct 10, 2016

We had tons of bug related to TTL that caused confusion together with _timestamp. Also if you use an update script and you don't explicitly update _timestamp it keeps the original timestamp when it was created. While if you send the document again from the outside (ie. get -> modify -> update) you'd get a new value for _timestamp. Same is true for reindex where stuff would not be preserved. now sometimes you want that, sometimes you don't but it's usecase specific and not really reliable. Either way you turn it it's better to have it addable by a user since we can't do the right thing for everybody.

@niemyjski
Copy link
Contributor

I'd rather have it be updated all the time than never or not having this feature. It caused way more problems by being removed than just leaving it there.

@ejsmith
Copy link

ejsmith commented Oct 10, 2016

I don't care about TTL. That makes perfect sense and can easily be handled by outside code. Reliable timestamp can't be handled reliably from the outside. I understand that there were issues with timestamp in the past, but I guess I don't get why it can't be implemented on the server side. Again, to me it seems like a very critical thing and pretty much every other database system has that functionality. I know you guys aren't specifically a database, but the system very much works like a document database and we need to do reindexing and other bulk operations that very much need to have some sort of reliable way of knowing when documents were updated last.

s1monw added a commit that referenced this issue Oct 10, 2016
Update scripts might want to update the documents `_timestamp` but need a notion of `now()`.
Painless doesn't support any notion of now() since it would make scripts non-pure functions. Yet,
in the update case this is a valid value and we can pass it with the context together to allow the
script to record the timestamp the document was updated.

Relates to #17895
s1monw added a commit that referenced this issue Oct 10, 2016
Update scripts might want to update the documents `_timestamp` but need a notion of `now()`.
Painless doesn't support any notion of now() since it would make scripts non-pure functions. Yet,
in the update case this is a valid value and we can pass it with the context together to allow the
script to record the timestamp the document was updated.

Relates to #17895
@s1monw
Copy link
Contributor

s1monw commented Oct 10, 2016

I'd rather have it be updated all the time than never or not having this feature. It caused way more problems by being removed than just leaving it there.

this is again how you used it but it's not how it was used by the majority. Majority provided timestamp from the outside and then it's suddenly a simple date field. This is how we should treat it, it was meant for the logging case nothing else. We are discussing X Y problems. You guys want a last_modified timestamp that wasn't what _timestamp read the documentation. I am up for discussing a last modified timestamp that is updated on every change and has clear semantics. That is, if you like it or not a new feature and should not be discussed on this issue. Feel free to open a new issue to add this specific field and we take it from there. @ejsmith @niemyjski WDYT

@s1monw
Copy link
Contributor

s1monw commented Oct 11, 2016

@ejsmith @niemyjski I opened #20859 for this can you please take a look

@niemyjski
Copy link
Contributor

I just ran into a case where I needed to use a pipeline as part of the update api to ensure my data is consistent and up-to-date.

niemyjski added a commit to FoundatioFx/Foundatio.Repositories that referenced this issue Nov 2, 2016
@niemyjski
Copy link
Contributor

Is there any reason the update api doesn't currently support pipelines? This is pretty huge and I'm already running into real world scenarios where I need this.

@ssunka
Copy link

ssunka commented Mar 2, 2017

We really need ingest pipelines to work for update API. Its really a good feature to be considered by Elastic Search experts

@icode
Copy link

icode commented Apr 24, 2017

has not support???????

@aalkilani
Copy link

+1 We're looking to use the Ingest pipeline to support updates from Logstash and get away from using scripts in Logstash where the entire document gets shipped back to LS for every update. With the amount of data we're processing, this is becoming critical for us and may push us away from using ES altogether.

@brunosdiniz
Copy link

+1 for this enhancement.

For the use case of ingest attachment pipelines, how are people currently dealing with updates to documents that have been previously indexed based on binary files?

In the scenario I´m exploring the document has an array of attachments, and having to fetch entire documents to be able to change a single attribute (via index api) doesn´t really seem like something one would have to resort to with ES at this stage.

Is there any good alternative to ingest attachment plugin for those using Logstash that does not involve ingestion pipelines?

I thought about maybe handling the text extraction in a ruby filter and avoiding the ingestion pipeline altogether, but that seemed to hacky for such a scenario I think might be quite common for the majority of folks.

Appreciate any insight on this.

@dadoonet
Copy link
Member

@brunosdiniz I'm pretty sure it will happen at some point in LS. LS is moving some part of the code to Java so integrating ingest-attachment as a codec or a filter should be "easy" in the future.

See elastic/logstash#3815. I started an implementation of this 3 years ago but never went farer than the POC.

One thing you can may be use is my open source project named FSCrawler. It comes with a REST interface which you may call from Logstash with something like this REST Filter plugin (not merged though so it will require some work I guess).

@consulthys
Copy link
Contributor

consulthys commented Dec 23, 2017

I'm pretty sure there are many real-world use case scenarii which would benefit from having the Update API support ingest pipelines.

For instance, I have a very concrete case where documents are coming from a third-party system (which is out of our control). When we index those documents the first time, we send them through an ingest pipeline to apply some transformations (split values, etc) and standardization (field renaming, etc). At a later point in the process, those documents are enriched by various software components under our control (mainly new fields are added). However, those documents are also updated from time to time in the 3rd party system and need to be repulled and re-indexed when that happens. If we simply index the document with the new data again, the current version of the document which contains the additional enriched fields will be erased. Just to illustrate this:

At time T, the document coming from the 3rd-party system looks like:

{ 
     "field1": "data11,data12",              // raw
     "field2": "data2"                       // raw
}

At time T+1, after going through an ingest pipeline, it looks like

{ 
    "field1_renamed": ["data11", "data12"],  // renamed + split
    "field2_renamed": "data2"                // renamed
}

At time T+2, various processes enrich the document which now looks like

{ 
     "field1_renamed": ["data11", "data12"], // same
     "field2_renamed": "data2",              // same
     "custom_field1": "customdata1",         // custom/enriched
     "custom_field2": "customdata2"          // custom/enriched
}

At time T+3, the document is updated in the 3rd-party system and now looks like:

{ 
     "field1": "data3",                      // raw
     "field2": "data4"                       // raw
}

At this point, we need to do two things:

  1. apply the transformations in the ingest pipeline (rename fields, split fields, etc)
  2. keep the custom fields that have been created at T+2

i.e. the new version of the document should look like this:

{ 
     "field1_renamed": "data3",              // renamed
     "field2_renamed": "data4",              // renamed 
     "custom_field1": "customdata1",         // same 
     "custom_field2": "customdata2"          // same
}

I don't think what we're doing above is anything out of the ordinary, I can imagine plenty of use cases along the same lines. Since the Update API doesn't support pipelines, we cannot use them at all and have to reimplement exactly what ingest pipelines do but shove that functionality inside of our software stack to apply the transformation ourselves.

Another possibility would boil down to do all this with Logstash (rename fields, etc) + use a ruby filter to apply the diff, but for having implemented it a few times, I think it's a pretty convoluted solution which won't scale well in our case because there are many 3rd party systems and the field renaming is different for each of them (i.e. it's easier to create pipelines for each of them, than different Logstash configs for each of them).

I've thought of using parent/child docs for this, so that our custom fields would be in a parent document and data from 3rd party systems would be in child documents that can be indexed/overridden as usual (hence pipelines would work), but this setup would make queries overkill afterwards.

Yet another idea would be to have a _meta field per document so we can shove our custom fields in there and not override them when indexing a new version, but as far as I know this feature only exists in my head.

In light of this concrete case, I appreciate if anyone has any thoughts on how to best handle this.

@consulthys
Copy link
Contributor

consulthys commented Jan 15, 2018

For those who are interested, I've figured out yet another way to circumvent this. It involves an additional step, but it works and does exactly what I want.

The idea is to:

  1. first use the Index API to index the data into a temporary index using the ingestion pipeline (to apply the transformations, renaming, etc)
  2. then to read the data from that temporary index
  3. and finally to use the Update API (doc/script upsert) to update the real index with the transformed data read from the temporary index.

Agreed, it would be much easier if the Update API would support ingestion pipelines, and I'm still hoping that it will be the case one day.

@talevy
Copy link
Contributor

talevy commented Mar 13, 2018

Closing since the performance ramifications of this are too great to consider implementing this functionality and exporting the document (at update time) to a separate ingest node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement help wanted adoptme
Projects
None yet
Development

No branches or pull requests