Ingest: Add conditional per processor #32398

original-brownbear · 2018-07-26T13:22:22Z

Adds conditional if setting to all processors in a pipeline
closes Conditional pipeline processors #21248

elasticmachine · 2018-07-26T13:22:23Z

Pinging @elastic/es-core-infra

original-brownbear · 2018-07-26T13:23:40Z

Still WIP, missing tests and some cleanup. I'd just like to get the ok from everyone that this is what we're looking for.

With this one you can now add a field if (script type) to any processor and the boolean return of the script will be used.

E.g.:

POST _ingest/pipeline/_simulate
{
  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "set" : {
          "if": "ctx.foo == 'bar'",
          "field" : "field2",
          "value" : "_value"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "bar"
      }
    },
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "rab"
      }
    }
  ]
}

->

{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_type": "_doc",
        "_id": "id",
        "_source": {
          "field2": "_value",
          "foo": "bar"
        },
        "_ingest": {
          "timestamp": "2018-07-26T13:09:23.228451Z"
        }
      }
    },
    {
      "doc": {
        "_index": "index",
        "_type": "_doc",
        "_id": "id",
        "_source": {
          "foo": "rab"
        },
        "_ingest": {
          "timestamp": "2018-07-26T13:09:23.228473Z"
        }
      }
    }
  ]
}

talevy · 2018-07-26T14:53:05Z

@original-brownbear yup. that is what I had in mind! thanks! I was wondering why we don't extend AbstractProcessor to understand conditionals and bake it into the framework instead of conditionally overriding the type to be "conditional", I feel like it should have the conditional support by default, but only invoked if the if exists. what do you think? I understand this means that the script dependencies may be passed into all the factories, but maybe not?

original-brownbear · 2018-07-26T15:17:58Z

@talevy

I was wondering why we don't extend AbstractProcessor to understand conditionals and bake it into the framework instead

Yea that sounds nice in theory (though just in terms of code quality performance wise it probably doesn't matter in any way, either way you'll have the same number of interface(ish) calls in there).
In practice, see below :(

I understand this means that the script dependencies may be passed into all the factories, but maybe not?

Yea this, but worse yet, you'd have to change the Processor API in one way or another. Either you'll add a top level conditionallyExecute that invokes executemethod if the script returns true and change all the high level calls in composite/foreach etc. processor to invoke that or you'll have to literally change every implementation to check the conditional by some provided API => huge changes for some aesthetic benefit at best imo.
Also, you'd probably have to change literally every processor factory (you gotta extract the if script field somehow and pass the script to the abstract class still, so that would make this even less practical).

=> I think wrapping the processor as the implementation is probably the smallest/safest change we're going to get here.

talevy · 2018-07-26T16:20:44Z

though just in terms of code quality performance wise it probably doesn't matter in any way

true, I didn't even think it would 😄

OK, this sounds good. One thing we should double-check is that when exceptions occurs in either the internal or conditional (if), the correct processor type is shown, since both the wrapping conditional and the inner processor share the same tag.

rjernst · 2018-07-27T07:47:44Z

The general approach here seems ok. My main concern before doing a deeper review is the object passed into the conditional script needs to be immutable. I think we should look at what variables are available too, as ctx only exists for legacy reasons (because that is what update scripts exposed when ingest was added).

original-brownbear · 2018-07-27T07:52:29Z

@rjernst why don't we just expose a new kind of interface there instead of ctx, we could call it doc or whatever and only make a method .get(String Path) available on it for lookup (we could make that method wrap collections/maps as discussed ... those structures won't really be used much anyway and people will rather look up scalars directly by path so the performance hit probably is irrelevant)?

jakelandis · 2018-07-27T15:44:52Z

I like the 'if' syntax and general direction.

I share @talevy 's concern over handling messaging. For example today, if you change your example from "ctx.foo == 'bar'" to `"ctx.foo = 'bar'" The following error occurs:

"error": {
        "root_cause": [
          {
            "type": "exception",
            "reason": "java.lang.IllegalArgumentException: ScriptException[runtime error]; nested: ClassCastException[java.base/java.lang.String cannot be cast to java.base/java.lang.Boolean];",
            "header": {
              "processor_type": "conditional"
            }
          }
        ],
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: ScriptException[runtime error]; nested: ClassCastException[java.base/java.lang.String cannot be cast to java.base/java.lang.Boolean];",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "ScriptException[runtime error]; nested: ClassCastException[java.base/java.lang.String cannot be cast to java.base/java.lang.Boolean];",
          "caused_by": {
            "type": "script_exception",
            "reason": "runtime error",
            "script_stack": [
              "ctx.foo = 'bar'",
              "^---- HERE"
            ],
            "script": "ctx.foo = 'bar'",
            "lang": "painless",
            "caused_by": {
              "type": "class_cast_exception",
              "reason": "java.base/java.lang.String cannot be cast to java.base/java.lang.Boolean"
            }
          }
        },
        "header": {
          "processor_type": "conditional"
        }
      }
    }

It's a bit misleading where the error is happening due to the conditional processor type. However, adding a tag helps alot, so maybe it's a non issue.

...
"header": {
              "processor_type": "conditional",
              "processor_tag": "my_set"
            }

If we add per processor metrics, there would be a similar concern to accurately represent the combination of the inner processor and outer processor in the same metric.

I am abit concerned by allowing arbitrary processing inside the if condition. Meaning the if condition can be much more then just a true/false single expression evaluation. When you combine this change with the ability to call other processors from scripting it can compound this concern ? For example, you could call grok from inside an if and then make your true/false decision based on that. I am not sure if arbitrary processing inside the if is a good or bad thing.

If we don't want to allow that kind of arbitrary processing , could implement a much a simpler dsl that only allows single expression boolean evaluations ? Perhaps a custom subset of painless ?

Also, since we are using scripting for the true/false evaluation, would we also support alternative scripting languages ?

original-brownbear · 2018-07-27T18:49:53Z

@jakelandis

It's a bit misleading where the error is happening due to the conditional processor type. However, adding a tag helps alot, so maybe it's a non issue.

I think this isn't an issue afaik the bucket selector aggregation scripts will behave the same way.

Also, since we are using scripting for the true/false evaluation, would we also support alternative scripting languages ?

We can also use the other languages here, expression in particular you can just make work like in the bucket aggregation selector scripts and make it return 1.0 for true and count all other returns as false.

I am not sure if arbitrary processing inside the if is a good or bad thing.

I'm a little torn here too. Obviously something like the LS config "language" has a much lower barrier to entry and is really short to write. Then again ... we already have the ES scripting, it's super powerful and existing users are probably proficient in it to a degree too.
IMO it may be worth exploring adding a simpler conditional language (if not outright the LS one) as an enhancement to this down the road (downside obviously would be that it's yet another thing to support).

That said, one strong argument I would have for "this is a good thing" is that when looking at LS you often see the pattern of:

Run grok on an event and make it set some field depending on a complex regex
Check if that field was set to a specific value in another a conditional
Run another grok action setting some other field
And then another conditional checking that field

... those cases would become a lot simpler to implement on the user side with the flexibility of Painless.

Perhaps a custom subset of painless ?

As far as I understand the code this would be very hard. Also, it becomes a very strange situation where you're essentially just arbitrarily taking away flexibility from the user (without making our lives easier in return, if anything constraining what subset of Painless or the other languages we're supporting is more effort than just outright using them).
I'd just address this (if we want to address it even) by maybe suggesting to users to keep things simple in these conditionals (though I don't really see a good argument to put behind that other than aesthetics).

@rjernst what do you think about moving forward with wrapping the ctx (we can rename it next week if you want :)) with an immutable map for now to make some progress here?
Maybe it's easiest to not deviate too much from the script processor here anyway? The more I think about it the less I like the idea of not passing ctx that works in one way to a script processor to the conditional..Having the if script field work in a totally different way (even though it's JSON schema is the same and you'd probably even have the both of them on the same screen next to each other when coding things up) is is really weird ergonomics imo.

rjernst · 2018-07-30T23:55:30Z

what do you think about moving forward with wrapping the ctx

IMO we should think about what we want the variables to be for the ingest script context long term. If ctx is what we want, then keeping it in conditional ingest scripts is fine. But if we think another script signature would work better long term, then the new conditional script should match the future signature we want. This will make it easier to migrate to the new signature, rather than adding new uses of ctx that will make the ability to deprecate and change to something else more painful.

original-brownbear · 2018-07-31T04:49:56Z

@rjernst

IMO we should think about what we want the variables to be for the ingest script context long term. If ctx is what we want, then keeping it in conditional ingest scripts is fine.

fair point :) IMO, with the way we are approaching the calling of processor from scripts as of right now (having static methods that do what the processors do as opposed to setting up and invoking actual Processor instances like in my POC #32043) ctx (the name is weird and I'd like doc or so better :P but w/e) being a nested map is an ok approach, especially with painless allowing for the map.key lookup syntax here. To me the using of IngestDocument as an input really only made sense if we cared about passing the actual IngestDocument along to other processors. If we don't want to do that, it just complicates the syntax in scripts I guess.
... but we should/could probably have that discussion elsewhere :)

original-brownbear · 2018-08-20T15:40:10Z

@rjernst added the lazy wrapping (urgh, that's quite a bit of code to handle all the cases that we have to look out for because they could leak a mutable view of the Map :)).

Can you take a quick look if that's an approach you're ok with? If you're ok with it then I think we only need some tests for the lazy wrapping (+ whatever your my find) here :)

jakelandis · 2018-08-23T21:23:55Z

@original-brownbear - apologizes for the late response... I just made this correlation.

IMO it may be worth exploring adding a simpler conditional language

Beats already has a form of this implemented: https://www.elastic.co/guide/en/beats/filebeat/master/defining-processors.html Is it worth exploring emulating that syntax (in Json of course) ? It removes concerns (and features) of arbitrary processing and eliminates the need for immutable wrappers while providing a familiar dsl.

original-brownbear · 2018-08-24T06:45:05Z

@jakelandis

I (~50%) agree it may be worth exploring a simpler language :)

The upsides definitely are:

Having an easy DSL for conditionals worked well for LS too (and probably will for Beats as well)
- I think product wise I like an easy DSL a lot
It removes the complication of having to deal with mutability
- Though I'm not so sure how troubling this is with the map wrapping added here, it's slightly annoying code, but the performance hit shouldn't be so bad (or even visible) thanks to escaping, I think.

The downsides are:

Yet another "language" to maintain and support for years to come
This would be kind of inconsistent with the bucket selector scripting that is effectively a conditional too https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-bucket-selector-aggregation.html#_syntax_12 wouldn't it?

rjernst · 2018-08-27T18:44:49Z

I don't think we should have another language. This is a huge burden to maintain. We have been trying very hard to reduce the number of languages we must support within elasticsearch. For example, we got rid of groovy, python, and javascript scripting languages, and we have been working towards making expressions on par performance-wise with expressions so we can remove that too.

IMO the overhead of wrapping these objects to make them immutable should be small, and the consistency of interacting with documents in the same way across conditionals or general script processors is hugely beneficial.

original-brownbear · 2018-08-27T20:24:07Z

@rjernst sweet, so add tests + be happy with this? :)

rjernst

Thanks @original-brownbear. Tests for the immutability would be good. I left a few more comments as well.

rjernst · 2018-08-27T21:06:16Z

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/ConditionalProcessor.java

+    @Override
+    public void execute(IngestDocument ingestDocument) throws Exception {
+        if (scriptService.compile(condition, ProcessorConditionalScript.CONTEXT)
+            .newInstance(condition.getParams()).execute(new UnmodifiableIngestData(ingestDocument.getSourceAndMetadata()))) {


break this long line into compiling the script in a separate line from executing it?

Sure will do :)

rjernst · 2018-08-27T21:12:14Z

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/IngestCommonPlugin.java

@@ -82,6 +82,7 @@ public IngestCommonPlugin() {
        processors.put(KeyValueProcessor.TYPE, new KeyValueProcessor.Factory());
        processors.put(URLDecodeProcessor.TYPE, new URLDecodeProcessor.Factory());
        processors.put(BytesProcessor.TYPE, new BytesProcessor.Factory());
+        processors.put(ConditionalProcessor.TYPE, new ConditionalProcessor.Factory(parameters.scriptService));


I thought we were only exposing the conditional via each processor as if, not as a processor on its own?

See #32398 (comment) and discussion leading up to it. This is just an implementation to prevent us from having to adjust every processor.

Sorry I don't understand. Can you reiterate? I don't understand why using it from parsing if requires it be generally available. As it is here, it would be available to construct directly by a user, right?

@rjernst

Sure, let me try to make it short:
In order to get the conditional we have to either add a conditional evaluation to every execute method for every processor or make some abstract parent have an execute implementation that holds the conditional evaluation. (I mean there's other options, but I think the two mentioned are the simplest and others will be even more noisy/risky/...)
Either option requires us to change all existing processors and also their factories (as a result of us parsing the configuration in each processor factory).
=> I went for this implementation since it was shortest and didn't have any functional/performance downsides over alternatives anyway (as far as I can see).

As it is here, it would be available to construct directly by a user, right?

Yea true, if you think that's a problem I can prevent that in the parser easily though :) Should I?

Yes I think we should prevent that. But I'm still not understanding why it is necessary for this processor to have a factory or be registered. It can be constructed completely locally, and directly via its ctor, within ConfigurationUtils.

@rjernst that's more of a convenience thing because org.elasticsearch.ingest.ConfigurationUtils#readProcessor(java.util.Map<java.lang.String,org.elasticsearch.ingest.Processor.Factory>, java.lang.String, java.lang.Object) (which is called from like 3 places in prod. code) would to also get the script factory as an input then (which will trigger a pretty big change if you factor in test code).

Just tried to keep this less noisy again :) => bad idea?, better to add the script factory as an input here?

With what I'm suggesting, the script factory is not needed. I don't think the ConditionalProcessor should be constructed from the script factory at all. Instead, have maybeWrapConditional or something like that which takes the processor we have already constructed, and creates/wraps it with a conditional processor (or returns the input if there is no conditional). There is no reason to have a factory for the conditional processor, just parse the config directly in that method, and construct there based on the config.

@rjernst you will still need the ScriptFactory as a method input to org.elasticsearch.ingest.ConfigurationUtils#readProcessorConfigs in some form or another won't you?

I 100% agree, that the current solution looks kind of convoluted :), but it was the only way I saw out of blowing up the change-set with passing the ScriptFactory into the methods inside ConfigurationUtils.
Looking at your discussion so far it seems it's probably preferable to go with the bigger changeset + cleaner solution I guess? :)
Sorry again about the confusion I caused here :)

Yeah, you are right, that method would have to take ScriptFactory. But I think this makes sense? And I think the signature changes should be mostly minimal? Maybe I am underestimating the extent, but I think it is fairly well contained to a handful of methods.

When I first tried it, it looked like a fairly massive change :) (but this is also really subjective).
I'll just code it up today and we can take a look, the result is def. nicer than what we have here ... so probably worth it

rjernst · 2018-08-27T21:12:54Z

server/src/main/java/org/elasticsearch/script/ProcessorConditionalScript.java

+/**
+ * A script used by the Ingest Script Processor.
+ */
+public abstract class ProcessorConditionalScript {


Can we call this IngestConditionalScript?

rjernst · 2018-08-27T21:16:39Z

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/ConditionalProcessor.java

+    }
+
+    private static Object wrapUnmodifiable(Object raw) {
+        if (raw instanceof Map) {


Can you add a comment here that these types must match what the json parser can create, and that anything not handled here must be immutable already (eg boxed numerics)?

original-brownbear · 2018-08-29T12:58:30Z

@rjernst alright, handled this by passing down the ScriptFactory now and removing the magic around a conditional processor factory and rewriting the config maps :)
Also added some tests around the immutability of the data passed in.
Should be good for another review :)

rjernst

Thanks @original-brownbear! LGTM

rjernst · 2018-08-29T19:39:02Z

server/src/main/java/org/elasticsearch/ingest/ConfigurationUtils.java

            } catch (Exception e) {
                throw newConfigurationException(type, tag, null, e);
            }
        }
        throw newConfigurationException(type, tag, null, "No processor type exists with name [" + type + "]");
    }
+
+    private static Script maybeExtractConditional(Map<String, Object> config) throws IOException {


since you are returning the script, I think this can just be called extractConditional

rjernst · 2018-08-29T19:39:13Z

server/src/main/java/org/elasticsearch/ingest/ConfigurationUtils.java

+                     LoggingDeprecationHandler.INSTANCE, stream)) {
+                return Script.parse(parser);
+            }
+        } else {


No need for an else, it can just be outside the if

original-brownbear · 2018-08-29T20:15:46Z

@rjernst thanks! Will merge once green :)

* Ingest: Add conditional per processor * closes elastic#21248

* Ingest: Add conditional per processor * closes #21248

Ingest: Add conditional per processor

9b6657d

original-brownbear added >enhancement WIP :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v7.0.0 v6.5.0 labels Jul 26, 2018

original-brownbear requested review from rjernst, talevy and jakelandis July 26, 2018 13:22

original-brownbear added 9 commits August 19, 2018 22:02

Merge remote-tracking branch 'elastic/master' into 21248

e98feb8

deep copy data instead of lazy wrap

8f06954

start tests

713f604

Merge remote-tracking branch 'elastic/master' into 21248

8090a40

start tests

f898188

add tests

3884159

unmodifiable wrappers

fc5d078

unmodifiable wrappers

288899a

unmodifiable wrappers

fe2d826

rjernst reviewed Aug 27, 2018

View reviewed changes

original-brownbear added 9 commits August 27, 2018 23:34

Merge remote-tracking branch 'elastic/master' into 21248

2b328ff

CR: renamings

9d64863

Merge remote-tracking branch 'elastic/master' into 21248

51deb46

Merge remote-tracking branch 'elastic/master' into 21248

b63e03c

Merge remote-tracking branch 'elastic/master' into 21248

8618ac0

Merge remote-tracking branch 'elastic/master' into 21248

c22ea2e

remove conditional processor factory

0f88489

Add some tests for immutability

881d6c4

fix javadoc

e2f7936

original-brownbear removed the WIP label Aug 29, 2018

jakelandis mentioned this pull request Aug 29, 2018

ingest: Documentation improvements for ingest node #33188

Closed

11 tasks

rjernst approved these changes Aug 29, 2018

View reviewed changes

original-brownbear added 2 commits August 29, 2018 22:09

Merge remote-tracking branch 'elastic/master' into 21248

cda2bad

CR: Rename + unwrap else

69100ba

original-brownbear merged commit cc4d705 into elastic:master Aug 30, 2018

original-brownbear deleted the 21248 branch August 30, 2018 01:46

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Sep 4, 2018

Ingest: Add conditional per processor (elastic#32398)

ebf1e90

* Ingest: Add conditional per processor * closes elastic#21248

original-brownbear added a commit that referenced this pull request Sep 4, 2018

Ingest: Add conditional per processor (#32398) (#33380)

d7655cc

* Ingest: Add conditional per processor * closes #21248

jakelandis mentioned this pull request Sep 4, 2018

[ingest] Per processor metrics #33387

Closed

ycombinator mentioned this pull request Sep 25, 2018

ES slowlog module improvements elastic/beats#8416

Closed

Mpdreamz mentioned this pull request Dec 13, 2018

[meta] 6.5.0 Release elastic/elasticsearch-net#3457

Closed

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Ingest: Add conditional per processor #32398

Ingest: Add conditional per processor #32398

Uh oh!

Conversation

original-brownbear commented Jul 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Jul 26, 2018

Uh oh!

original-brownbear commented Jul 26, 2018

Uh oh!

talevy commented Jul 26, 2018

Uh oh!

original-brownbear commented Jul 26, 2018

Uh oh!

talevy commented Jul 26, 2018

Uh oh!

rjernst commented Jul 27, 2018

Uh oh!

original-brownbear commented Jul 27, 2018

Uh oh!

jakelandis commented Jul 27, 2018

Uh oh!

original-brownbear commented Jul 27, 2018

Uh oh!

rjernst commented Jul 30, 2018

Uh oh!

original-brownbear commented Jul 31, 2018

Uh oh!

original-brownbear commented Aug 20, 2018

Uh oh!

jakelandis commented Aug 23, 2018

Uh oh!

original-brownbear commented Aug 24, 2018

Uh oh!

rjernst commented Aug 27, 2018

Uh oh!

original-brownbear commented Aug 27, 2018

Uh oh!

rjernst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Aug 29, 2018

Uh oh!

rjernst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

original-brownbear commented Jul 26, 2018 •

edited

Loading