Skip to content

Inconsistent results with ingest pipeline _simulate #22825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anderssynstad opened this issue Jan 26, 2017 · 8 comments
Closed

Inconsistent results with ingest pipeline _simulate #22825

anderssynstad opened this issue Jan 26, 2017 · 8 comments
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP

Comments

@anderssynstad
Copy link

anderssynstad commented Jan 26, 2017

Elasticsearch version: 5.1.1

Plugins installed: [none]

JVM version: openjdk-8-jre-headless:amd64 8u121-b13-0ubuntu1.16.04.2

OS version: Ubuntu 16.04.1

Description of the problem including expected versus actual behavior: Ingest pipeline appears to give inconsistent results given the following datasets:

SET 1:

curl -XPUT localhost:9200/_ingest/pipeline/myingest -d'
{
  "processors": [
    {
      "split": {
        "field": "dev_environment",
        "separator": ";",
        "ignore_missing": true
      },
      "foreach": {
        "field": "dev_environment",
        "ignore_failure": true,
        "processor": {
          "trim": {
            "field": "_ingest._value",
            "ignore_missing": true
          }
        }
      }
    }
  ]
}'

SET 2:

curl -XPOST localhost:9200/_ingest/pipeline/myingest/_simulate -d'
{
  "docs": [
    { "_source": { "dev_environment": "Notepad++; Visual Studio" }},
    { "_source": { "dev_environment": "Notepad++" }},
    { "_source": { "dev_environment": "Sublime; Visual Studio" }},
    { "_source": { }}
  ]
}'

Running SET 1 & 2 gives the expected result. But when you combine the two into SET 3:

SET 3:

curl -XPOST localhost:9200/_ingest/pipeline/_simulate -d'
{
"pipeline": {
  "processors": [
    {
      "split": {
        "field": "dev_environment",
        "separator": ";",
        "ignore_missing": true
      },
      "foreach": {
        "field": "dev_environment",
        "ignore_failure": true,
        "processor": {
          "trim": {
            "field": "_ingest._value",
            "ignore_missing": true
          }
        }
      }
    }
  ]
},
  "docs": [
    { "_source": { "dev_environment": "Notepad++; Visual Studio" }},
    { "_source": { "dev_environment": "Notepad++" }},
    { "_source": { "dev_environment": "Sublime; Visual Studio" }},
    { "_source": { }}
  ]
}'

Running SET 3 seems to work, but if the trim processor actually fails. If you remove "ignore_failure": true, it'll get an exception.

In order to get SET 3 to work, one has to fix the processor objects as done in SET 4.

SET 4:

curl -XPOST localhost:9200/_ingest/pipeline/_simulate -d'
{
"pipeline": {
  "processors": [
    {
      "split": {
        "field": "dev_environment",
        "separator": ";",
        "ignore_missing": true
      }},{
      "foreach": {
        "field": "dev_environment",
        "ignore_failure": true,
        "processor": {
          "trim": {
            "field": "_ingest._value",
            "ignore_missing": true
          }
        }
      }
    }
  ]
},
  "docs": [
    { "_source": { "dev_environment": "Notepad++; Visual Studio" }},
    { "_source": { "dev_environment": "Notepad++" }},
    { "_source": { "dev_environment": "Sublime; Visual Studio" }},
    { "_source": { }}
  ]
}'

SET 4 gives the expected results.

The claim is that SET 1 + 2 should give the same result as SET 3, but it does not. SET 3 requires syntax change in order to work.

Steps to reproduce:

  1. Run SET 1 + SET 2. Observe they work as expected.
  2. Run SET 3. Observe the trim fails to run.
  3. Run SET 4. Observe change results in expected result.

Sorry for potato code snippets.

@clintongormley clintongormley added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >bug labels Jan 27, 2017
@clintongormley
Copy link
Contributor

@anderssynstad thanks for reporting and for providing such a clear recreation

@clintongormley
Copy link
Contributor

@talevy the syntax in SET 1 & 3 are incorrect, as split and foreach are in the same object - our parsing is not strict here and should complain

@anderssynstad
Copy link
Author

Worth noticing is that the training documentation and examples for "Advanced Elasticsearch: Data Modeling" should be updated as well. Around page 75 if I remember correct.

@clintongormley
Copy link
Contributor

Worth noticing is that the training documentation and examples for "Advanced Elasticsearch: Data Modeling" should be updated as well. Around page 75 if I remember correct.

/cc @pmusa @djschny

@djschny
Copy link
Contributor

djschny commented Jan 31, 2017

The ignore_failure is a workaround until ignore_missing is available on the foreach process. See #22147

@clintongormley
Copy link
Contributor

@djschny this issue is about bad syntax, not about ignore_failure - @anderssynstad thinks that the training slides include that bad syntax.

@djschny
Copy link
Contributor

djschny commented Jan 31, 2017

I understand that @clintongormley. I was adding info as to why the ignore_failure is in there hiding the problem as @anderssynstad called out.

The training slides will be updated regardless.

@talevy
Copy link
Contributor

talevy commented Mar 15, 2018

Looks like this was resolved and concluded that there are syntax issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP
Projects
None yet
Development

No branches or pull requests

4 participants