Ingest processor cannot access _id on autogenerated id #41163

spinscale · 2019-04-12T18:18:57Z

I think this is merely a documentation issue for now. Found at https://discuss.elastic.co/t/accessing-id-in-ingest-pipeline/176503

Indexing a document that will have its ID autogenerated, obviously has no way of accessing its id, however there is no error happening and the user just might not know the correct order of operations.

Elasticsearch version (bin/elasticsearch --version): 7.0.0

Steps to reproduce:

PUT _ingest/pipeline/my_pipeline
{
    "processors": [
      {
        "set" : {
          "field" : "id",
          "value" : "{{_id}}"
        }
      }
    ]
}

DELETE foo

POST foo/_doc?pipeline=my_pipeline&refresh=true
{"foo":"rab"}

# id field will be empty
GET foo/_search

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-04-15T08:15:23Z

Pinging @elastic/es-core-features

chigix · 2019-05-27T07:15:59Z

Hi, May I ask about the decision whether this issue would be processed recently? Actually, I'm expecting to have a copy field with the generated ID text as well.

martijnvg · 2019-05-28T09:12:31Z

Agreed, accessing the _id in a pipeline for documents with auto generated ids leads to unexpected behaviour. So this needs to be documented, on top of that I'm leaning towards also throwing a descriptive error in the case there is no _id present.

martijnvg · 2019-05-30T15:20:44Z

We discussed this issue and failing with a descriptive error is preferred over the current behaviour if the id is missing and a pipeline uses {{_id}}.

jrodewig · 2019-10-07T15:40:47Z

[docs issue triage]

Leaving open. This is still relevant.

hlzhang · 2020-02-03T22:42:07Z

Agreed, accessing the _id in a pipeline for documents with auto generated ids leads to unexpected behaviour. So this needs to be documented, on top of that I'm leaning towards also throwing a descriptive error in the case there is no _id present.

Do not agree.

You should at least provide read-only access to the _id field in pipeline.
We ingest about 1TB of logs per day from hundreds of different entities and we analyze those logs every night. Without read-only access to the _id field.

We have to use the expensive scroll API. We can not use the Search After feature because of duplicate _id value to another field with doc_values enabled is a very slow operation.

I don't know if it's possible to do thousands of scrolls in parallel on tens of TB's data.

There is no elegant way to have a duplicate id as https://www.elastic.co/guide/en/elasticsearch/reference/7.5/search-request-body.html#request-body-search-search-after said: "Instead it is advised to duplicate (client side or with a set ingest processor) the content of the _id field in another field that has doc value enabled and to use this new field as the tiebreaker for the sort."

I can generate flake id as Elasticsearch does by developing a Flake Id Logstash Plugin but this would slow down the indexing speed (see: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html).

If I can not duplicate _id as the official document said the search after is totally useless for me.

martijnvg · 2020-02-04T12:40:05Z

Maybe we can investigate generating the id prior to doing ingest. Currently generating an id happens after ingest has occurred.

HJK181 · 2020-06-24T07:42:36Z

This issue is open for about a year now and nothing happened to your documentation which is clearly wrong !!
I've implemented the suggested set processor solution from the documentation:

Instead it is advised to duplicate (client side or with a set ingest processor) the content of the _id field in another field that has doc value enabled and to use this new field as the tiebreaker for the sort.

to now realize it was a waste of time as it's never going to work ?! Why aren't u able to update the documentation for about a year?

markharwood added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >docs General docs changes >enhancement labels Apr 15, 2019

martijnvg added team-discuss and removed team-discuss labels May 28, 2019

martijnvg mentioned this issue Nov 12, 2019

Improve ingest node usability #48999

Closed

12 tasks

mwilliammyers mentioned this issue Nov 20, 2019

Add skip attribute to input objects graphql-rust/juniper#463

Open

rjernst added Team:Data Management Meta label for data/management team Team:Docs Meta label for docs team labels May 4, 2020

imotov mentioned this issue Jun 4, 2020

Changing auto generated id in ingest pipeline to existing id #57693

Open

jrodewig self-assigned this Jul 23, 2020

jrodewig mentioned this issue Jul 23, 2020

[DOCS] Fix ingest processor docs for autogen doc IDs #60147

Merged

jrodewig closed this as completed in #60147 Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest processor cannot access _id on autogenerated id #41163

Ingest processor cannot access _id on autogenerated id #41163

spinscale commented Apr 12, 2019

elasticmachine commented Apr 15, 2019

chigix commented May 27, 2019

martijnvg commented May 28, 2019

martijnvg commented May 30, 2019

jrodewig commented Oct 7, 2019

hlzhang commented Feb 3, 2020

martijnvg commented Feb 4, 2020

HJK181 commented Jun 24, 2020

Ingest processor cannot access _id on autogenerated id #41163

Ingest processor cannot access _id on autogenerated id #41163

Comments

spinscale commented Apr 12, 2019

elasticmachine commented Apr 15, 2019

chigix commented May 27, 2019

martijnvg commented May 28, 2019

martijnvg commented May 30, 2019

jrodewig commented Oct 7, 2019

hlzhang commented Feb 3, 2020

martijnvg commented Feb 4, 2020

HJK181 commented Jun 24, 2020