Skip to content

Ingest processor cannot access _id on autogenerated id #41163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
spinscale opened this issue Apr 12, 2019 · 8 comments · Fixed by #60147
Closed

Ingest processor cannot access _id on autogenerated id #41163

spinscale opened this issue Apr 12, 2019 · 8 comments · Fixed by #60147
Assignees
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >docs General docs changes >enhancement Team:Data Management Meta label for data/management team Team:Docs Meta label for docs team

Comments

@spinscale
Copy link
Contributor

I think this is merely a documentation issue for now. Found at https://discuss.elastic.co/t/accessing-id-in-ingest-pipeline/176503

Indexing a document that will have its ID autogenerated, obviously has no way of accessing its id, however there is no error happening and the user just might not know the correct order of operations.

Elasticsearch version (bin/elasticsearch --version): 7.0.0

Steps to reproduce:

PUT _ingest/pipeline/my_pipeline
{
    "processors": [
      {
        "set" : {
          "field" : "id",
          "value" : "{{_id}}"
        }
      }
    ]
}

DELETE foo

POST foo/_doc?pipeline=my_pipeline&refresh=true
{"foo":"rab"}

# id field will be empty
GET foo/_search
@markharwood markharwood added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >docs General docs changes >enhancement labels Apr 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@chigix
Copy link

chigix commented May 27, 2019

Hi, May I ask about the decision whether this issue would be processed recently? Actually, I'm expecting to have a copy field with the generated ID text as well.

@martijnvg
Copy link
Member

Agreed, accessing the _id in a pipeline for documents with auto generated ids leads to unexpected behaviour. So this needs to be documented, on top of that I'm leaning towards also throwing a descriptive error in the case there is no _id present.

@martijnvg
Copy link
Member

We discussed this issue and failing with a descriptive error is preferred over the current behaviour if the id is missing and a pipeline uses {{_id}}.

@jrodewig
Copy link
Contributor

jrodewig commented Oct 7, 2019

[docs issue triage]

Leaving open. This is still relevant.

@hlzhang
Copy link

hlzhang commented Feb 3, 2020

Agreed, accessing the _id in a pipeline for documents with auto generated ids leads to unexpected behaviour. So this needs to be documented, on top of that I'm leaning towards also throwing a descriptive error in the case there is no _id present.

Do not agree.

You should at least provide read-only access to the _id field in pipeline.
We ingest about 1TB of logs per day from hundreds of different entities and we analyze those logs every night. Without read-only access to the _id field.

We have to use the expensive scroll API. We can not use the Search After feature because of duplicate _id value to another field with doc_values enabled is a very slow operation.

I don't know if it's possible to do thousands of scrolls in parallel on tens of TB's data.

There is no elegant way to have a duplicate id as https://www.elastic.co/guide/en/elasticsearch/reference/7.5/search-request-body.html#request-body-search-search-after said: "Instead it is advised to duplicate (client side or with a set ingest processor) the content of the _id field in another field that has doc value enabled and to use this new field as the tiebreaker for the sort."

I can generate flake id as Elasticsearch does by developing a Flake Id Logstash Plugin but this would slow down the indexing speed (see: https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html).

If I can not duplicate _id as the official document said the search after is totally useless for me.

@martijnvg
Copy link
Member

Maybe we can investigate generating the id prior to doing ingest. Currently generating an id happens after ingest has occurred.

@rjernst rjernst added Team:Data Management Meta label for data/management team Team:Docs Meta label for docs team labels May 4, 2020
@HJK181
Copy link

HJK181 commented Jun 24, 2020

This issue is open for about a year now and nothing happened to your documentation which is clearly wrong !!
I've implemented the suggested set processor solution from the documentation:

Instead it is advised to duplicate (client side or with a set ingest processor) the content of the _id field in another field that has doc value enabled and to use this new field as the tiebreaker for the sort.

to now realize it was a waste of time as it's never going to work ?! Why aren't u able to update the documentation for about a year?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >docs General docs changes >enhancement Team:Data Management Meta label for data/management team Team:Docs Meta label for docs team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants