Skip to content

Ingest: Add conditional per processor #32398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Aug 30, 2018

Conversation

original-brownbear
Copy link
Contributor

@original-brownbear original-brownbear commented Jul 26, 2018

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@original-brownbear
Copy link
Contributor Author

Still WIP, missing tests and some cleanup. I'd just like to get the ok from everyone that this is what we're looking for.

With this one you can now add a field if (script type) to any processor and the boolean return of the script will be used.

E.g.:

POST _ingest/pipeline/_simulate
{
  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "set" : {
          "if": "ctx.foo == 'bar'",
          "field" : "field2",
          "value" : "_value"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "bar"
      }
    },
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "rab"
      }
    }
  ]
}

->

{
  "docs": [
    {
      "doc": {
        "_index": "index",
        "_type": "_doc",
        "_id": "id",
        "_source": {
          "field2": "_value",
          "foo": "bar"
        },
        "_ingest": {
          "timestamp": "2018-07-26T13:09:23.228451Z"
        }
      }
    },
    {
      "doc": {
        "_index": "index",
        "_type": "_doc",
        "_id": "id",
        "_source": {
          "foo": "rab"
        },
        "_ingest": {
          "timestamp": "2018-07-26T13:09:23.228473Z"
        }
      }
    }
  ]
}

@talevy
Copy link
Contributor

talevy commented Jul 26, 2018

@original-brownbear yup. that is what I had in mind! thanks! I was wondering why we don't extend AbstractProcessor to understand conditionals and bake it into the framework instead of conditionally overriding the type to be "conditional", I feel like it should have the conditional support by default, but only invoked if the if exists. what do you think? I understand this means that the script dependencies may be passed into all the factories, but maybe not?

@original-brownbear
Copy link
Contributor Author

@talevy

I was wondering why we don't extend AbstractProcessor to understand conditionals and bake it into the framework instead

Yea that sounds nice in theory (though just in terms of code quality performance wise it probably doesn't matter in any way, either way you'll have the same number of interface(ish) calls in there).
In practice, see below :(

I understand this means that the script dependencies may be passed into all the factories, but maybe not?

Yea this, but worse yet, you'd have to change the Processor API in one way or another. Either you'll add a top level conditionallyExecute that invokes executemethod if the script returns true and change all the high level calls in composite/foreach etc. processor to invoke that or you'll have to literally change every implementation to check the conditional by some provided API => huge changes for some aesthetic benefit at best imo.
Also, you'd probably have to change literally every processor factory (you gotta extract the if script field somehow and pass the script to the abstract class still, so that would make this even less practical).

=> I think wrapping the processor as the implementation is probably the smallest/safest change we're going to get here.

@talevy
Copy link
Contributor

talevy commented Jul 26, 2018

though just in terms of code quality performance wise it probably doesn't matter in any way

true, I didn't even think it would 😄

OK, this sounds good. One thing we should double-check is that when exceptions occurs in either the internal or conditional (if), the correct processor type is shown, since both the wrapping conditional and the inner processor share the same tag.

@rjernst
Copy link
Member

rjernst commented Jul 27, 2018

The general approach here seems ok. My main concern before doing a deeper review is the object passed into the conditional script needs to be immutable. I think we should look at what variables are available too, as ctx only exists for legacy reasons (because that is what update scripts exposed when ingest was added).

@original-brownbear
Copy link
Contributor Author

@rjernst why don't we just expose a new kind of interface there instead of ctx, we could call it doc or whatever and only make a method .get(String Path) available on it for lookup (we could make that method wrap collections/maps as discussed ... those structures won't really be used much anyway and people will rather look up scalars directly by path so the performance hit probably is irrelevant)?

@jakelandis
Copy link
Contributor

I like the 'if' syntax and general direction.

I share @talevy 's concern over handling messaging. For example today, if you change your example from "ctx.foo == 'bar'" to `"ctx.foo = 'bar'" The following error occurs:

"error": {
        "root_cause": [
          {
            "type": "exception",
            "reason": "java.lang.IllegalArgumentException: ScriptException[runtime error]; nested: ClassCastException[java.base/java.lang.String cannot be cast to java.base/java.lang.Boolean];",
            "header": {
              "processor_type": "conditional"
            }
          }
        ],
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: ScriptException[runtime error]; nested: ClassCastException[java.base/java.lang.String cannot be cast to java.base/java.lang.Boolean];",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "ScriptException[runtime error]; nested: ClassCastException[java.base/java.lang.String cannot be cast to java.base/java.lang.Boolean];",
          "caused_by": {
            "type": "script_exception",
            "reason": "runtime error",
            "script_stack": [
              "ctx.foo = 'bar'",
              "^---- HERE"
            ],
            "script": "ctx.foo = 'bar'",
            "lang": "painless",
            "caused_by": {
              "type": "class_cast_exception",
              "reason": "java.base/java.lang.String cannot be cast to java.base/java.lang.Boolean"
            }
          }
        },
        "header": {
          "processor_type": "conditional"
        }
      }
    }

It's a bit misleading where the error is happening due to the conditional processor type. However, adding a tag helps alot, so maybe it's a non issue.

...
"header": {
              "processor_type": "conditional",
              "processor_tag": "my_set"
            }

If we add per processor metrics, there would be a similar concern to accurately represent the combination of the inner processor and outer processor in the same metric.

I am abit concerned by allowing arbitrary processing inside the if condition. Meaning the if condition can be much more then just a true/false single expression evaluation. When you combine this change with the ability to call other processors from scripting it can compound this concern ? For example, you could call grok from inside an if and then make your true/false decision based on that. I am not sure if arbitrary processing inside the if is a good or bad thing.

If we don't want to allow that kind of arbitrary processing , could implement a much a simpler dsl that only allows single expression boolean evaluations ? Perhaps a custom subset of painless ?

Also, since we are using scripting for the true/false evaluation, would we also support alternative scripting languages ?

@original-brownbear
Copy link
Contributor Author

@jakelandis

It's a bit misleading where the error is happening due to the conditional processor type. However, adding a tag helps alot, so maybe it's a non issue.

I think this isn't an issue afaik the bucket selector aggregation scripts will behave the same way.

Also, since we are using scripting for the true/false evaluation, would we also support alternative scripting languages ?

We can also use the other languages here, expression in particular you can just make work like in the bucket aggregation selector scripts and make it return 1.0 for true and count all other returns as false.

I am not sure if arbitrary processing inside the if is a good or bad thing.

I'm a little torn here too. Obviously something like the LS config "language" has a much lower barrier to entry and is really short to write. Then again ... we already have the ES scripting, it's super powerful and existing users are probably proficient in it to a degree too.
IMO it may be worth exploring adding a simpler conditional language (if not outright the LS one) as an enhancement to this down the road (downside obviously would be that it's yet another thing to support).

That said, one strong argument I would have for "this is a good thing" is that when looking at LS you often see the pattern of:

  1. Run grok on an event and make it set some field depending on a complex regex
  2. Check if that field was set to a specific value in another a conditional
  3. Run another grok action setting some other field
  4. And then another conditional checking that field

... those cases would become a lot simpler to implement on the user side with the flexibility of Painless.

Perhaps a custom subset of painless ?

As far as I understand the code this would be very hard. Also, it becomes a very strange situation where you're essentially just arbitrarily taking away flexibility from the user (without making our lives easier in return, if anything constraining what subset of Painless or the other languages we're supporting is more effort than just outright using them).
I'd just address this (if we want to address it even) by maybe suggesting to users to keep things simple in these conditionals (though I don't really see a good argument to put behind that other than aesthetics).


@rjernst what do you think about moving forward with wrapping the ctx (we can rename it next week if you want :)) with an immutable map for now to make some progress here?
Maybe it's easiest to not deviate too much from the script processor here anyway? The more I think about it the less I like the idea of not passing ctx that works in one way to a script processor to the conditional..Having the if script field work in a totally different way (even though it's JSON schema is the same and you'd probably even have the both of them on the same screen next to each other when coding things up) is is really weird ergonomics imo.

@rjernst
Copy link
Member

rjernst commented Jul 30, 2018

what do you think about moving forward with wrapping the ctx

IMO we should think about what we want the variables to be for the ingest script context long term. If ctx is what we want, then keeping it in conditional ingest scripts is fine. But if we think another script signature would work better long term, then the new conditional script should match the future signature we want. This will make it easier to migrate to the new signature, rather than adding new uses of ctx that will make the ability to deprecate and change to something else more painful.

@original-brownbear
Copy link
Contributor Author

@rjernst

IMO we should think about what we want the variables to be for the ingest script context long term. If ctx is what we want, then keeping it in conditional ingest scripts is fine.

fair point :) IMO, with the way we are approaching the calling of processor from scripts as of right now (having static methods that do what the processors do as opposed to setting up and invoking actual Processor instances like in my POC #32043) ctx (the name is weird and I'd like doc or so better :P but w/e) being a nested map is an ok approach, especially with painless allowing for the map.key lookup syntax here. To me the using of IngestDocument as an input really only made sense if we cared about passing the actual IngestDocument along to other processors. If we don't want to do that, it just complicates the syntax in scripts I guess.
... but we should/could probably have that discussion elsewhere :)

@original-brownbear
Copy link
Contributor Author

@rjernst added the lazy wrapping (urgh, that's quite a bit of code to handle all the cases that we have to look out for because they could leak a mutable view of the Map :)).

Can you take a quick look if that's an approach you're ok with? If you're ok with it then I think we only need some tests for the lazy wrapping (+ whatever your my find) here :)

@jakelandis
Copy link
Contributor

@original-brownbear - apologizes for the late response... I just made this correlation.

IMO it may be worth exploring adding a simpler conditional language

Beats already has a form of this implemented: https://www.elastic.co/guide/en/beats/filebeat/master/defining-processors.html Is it worth exploring emulating that syntax (in Json of course) ? It removes concerns (and features) of arbitrary processing and eliminates the need for immutable wrappers while providing a familiar dsl.

@original-brownbear
Copy link
Contributor Author

@jakelandis

I (~50%) agree it may be worth exploring a simpler language :)

The upsides definitely are:

  • Having an easy DSL for conditionals worked well for LS too (and probably will for Beats as well)
    • I think product wise I like an easy DSL a lot
  • It removes the complication of having to deal with mutability
    • Though I'm not so sure how troubling this is with the map wrapping added here, it's slightly annoying code, but the performance hit shouldn't be so bad (or even visible) thanks to escaping, I think.

The downsides are:

@rjernst
Copy link
Member

rjernst commented Aug 27, 2018

I don't think we should have another language. This is a huge burden to maintain. We have been trying very hard to reduce the number of languages we must support within elasticsearch. For example, we got rid of groovy, python, and javascript scripting languages, and we have been working towards making expressions on par performance-wise with expressions so we can remove that too.

IMO the overhead of wrapping these objects to make them immutable should be small, and the consistency of interacting with documents in the same way across conditionals or general script processors is hugely beneficial.

@original-brownbear
Copy link
Contributor Author

@rjernst sweet, so add tests + be happy with this? :)

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @original-brownbear. Tests for the immutability would be good. I left a few more comments as well.

@Override
public void execute(IngestDocument ingestDocument) throws Exception {
if (scriptService.compile(condition, ProcessorConditionalScript.CONTEXT)
.newInstance(condition.getParams()).execute(new UnmodifiableIngestData(ingestDocument.getSourceAndMetadata()))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break this long line into compiling the script in a separate line from executing it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will do :)

@@ -82,6 +82,7 @@ public IngestCommonPlugin() {
processors.put(KeyValueProcessor.TYPE, new KeyValueProcessor.Factory());
processors.put(URLDecodeProcessor.TYPE, new URLDecodeProcessor.Factory());
processors.put(BytesProcessor.TYPE, new BytesProcessor.Factory());
processors.put(ConditionalProcessor.TYPE, new ConditionalProcessor.Factory(parameters.scriptService));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were only exposing the conditional via each processor as if, not as a processor on its own?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #32398 (comment) and discussion leading up to it. This is just an implementation to prevent us from having to adjust every processor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand. Can you reiterate? I don't understand why using it from parsing if requires it be generally available. As it is here, it would be available to construct directly by a user, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjernst

Sure, let me try to make it short:
In order to get the conditional we have to either add a conditional evaluation to every execute method for every processor or make some abstract parent have an execute implementation that holds the conditional evaluation. (I mean there's other options, but I think the two mentioned are the simplest and others will be even more noisy/risky/...)
Either option requires us to change all existing processors and also their factories (as a result of us parsing the configuration in each processor factory).
=> I went for this implementation since it was shortest and didn't have any functional/performance downsides over alternatives anyway (as far as I can see).

As it is here, it would be available to construct directly by a user, right?

Yea true, if you think that's a problem I can prevent that in the parser easily though :) Should I?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we should prevent that. But I'm still not understanding why it is necessary for this processor to have a factory or be registered. It can be constructed completely locally, and directly via its ctor, within ConfigurationUtils.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjernst that's more of a convenience thing because org.elasticsearch.ingest.ConfigurationUtils#readProcessor(java.util.Map<java.lang.String,org.elasticsearch.ingest.Processor.Factory>, java.lang.String, java.lang.Object) (which is called from like 3 places in prod. code) would to also get the script factory as an input then (which will trigger a pretty big change if you factor in test code).

Just tried to keep this less noisy again :) => bad idea?, better to add the script factory as an input here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With what I'm suggesting, the script factory is not needed. I don't think the ConditionalProcessor should be constructed from the script factory at all. Instead, have maybeWrapConditional or something like that which takes the processor we have already constructed, and creates/wraps it with a conditional processor (or returns the input if there is no conditional). There is no reason to have a factory for the conditional processor, just parse the config directly in that method, and construct there based on the config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjernst you will still need the ScriptFactory as a method input to org.elasticsearch.ingest.ConfigurationUtils#readProcessorConfigs in some form or another won't you?

I 100% agree, that the current solution looks kind of convoluted :), but it was the only way I saw out of blowing up the change-set with passing the ScriptFactory into the methods inside ConfigurationUtils.
Looking at your discussion so far it seems it's probably preferable to go with the bigger changeset + cleaner solution I guess? :)
Sorry again about the confusion I caused here :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right, that method would have to take ScriptFactory. But I think this makes sense? And I think the signature changes should be mostly minimal? Maybe I am underestimating the extent, but I think it is fairly well contained to a handful of methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first tried it, it looked like a fairly massive change :) (but this is also really subjective).
I'll just code it up today and we can take a look, the result is def. nicer than what we have here ... so probably worth it

/**
* A script used by the Ingest Script Processor.
*/
public abstract class ProcessorConditionalScript {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this IngestConditionalScript?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure :)

}

private static Object wrapUnmodifiable(Object raw) {
if (raw instanceof Map) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here that these types must match what the json parser can create, and that anything not handled here must be immutable already (eg boxed numerics)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure :)

@original-brownbear
Copy link
Contributor Author

@rjernst alright, handled this by passing down the ScriptFactory now and removing the magic around a conditional processor factory and rewriting the config maps :)
Also added some tests around the immutability of the data passed in.
Should be good for another review :)

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @original-brownbear! LGTM

} catch (Exception e) {
throw newConfigurationException(type, tag, null, e);
}
}
throw newConfigurationException(type, tag, null, "No processor type exists with name [" + type + "]");
}

private static Script maybeExtractConditional(Map<String, Object> config) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you are returning the script, I think this can just be called extractConditional

LoggingDeprecationHandler.INSTANCE, stream)) {
return Script.parse(parser);
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for an else, it can just be outside the if

@original-brownbear
Copy link
Contributor Author

@rjernst thanks! Will merge once green :)

@original-brownbear original-brownbear merged commit cc4d705 into elastic:master Aug 30, 2018
@original-brownbear original-brownbear deleted the 21248 branch August 30, 2018 01:46
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Sep 4, 2018
* Ingest: Add conditional per processor
* closes elastic#21248
original-brownbear added a commit that referenced this pull request Sep 4, 2018
* Ingest: Add conditional per processor
* closes #21248
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conditional pipeline processors
6 participants