Support dotted field names in ingest processors #96648

felixbarny · 2023-06-07T09:22:46Z

Elasticsearch supports ingestion of JSON documents that have dotted field names and JSON objects.
In most places, both are treated equivalently:

During search and aggregation. That's because even if the ES field type is object, the documents are internally stored as a flat list of key/value pairs. See also the object field type docs.
Mustache templates, such as the value option in the set processor.
The field API, available for scripting, which is in Beta at the time of writing.

However, the IngestDocument#getFieldValue and IngestDocument#setFieldValue methods don't support accessing dotted field names. This makes it impossible to reference a dotted field name with, for example, the set processor's field option.

This issue proposes to enhance the ingest processors to support reading and writing to fields that are either in a dotted or a nested field notation. The behavior should be aligned with the field API.

See also

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-06-07T09:23:09Z

Pinging @elastic/es-data-management (Team:Data Management)

felixbarny · 2023-06-07T09:40:48Z

@eyalkoren once we have this, we should update the JSON pipeline and remove the dot_expander processor:

elasticsearch/x-pack/plugin/core/src/main/resources/logs-json-message-pipeline.json

Lines 29 to 35 in e29b9c9

    
           { 
        
             "dot_expander" : { 
        
               "if": "ctx._tmp_json_message != null", 
        
               "field": "*", 
        
               "override": true 
        
             } 
        
           },

This will enable ingestion of documents that can't be expanded because they contain conflicting fields, such as foo and foo.bar.

We'll need test that the JSON processor with the following settings supports merging objects at the root of the doc with dotted fields form the JSON.

elasticsearch/x-pack/plugin/core/src/main/resources/logs-json-message-pipeline.json

Lines 15 to 16 in e29b9c9

    
           "add_to_root": true, 
        
           "add_to_root_conflict_strategy": "merge",

Example doc:

{
  "data_stream": {
    "type": "logs",
    "dataset": "generic",
    "namespace": "default"
  },
  "message": "{\"data_stream.dataset\": \"foo\"}"
}

Expected result:

{
  "data_stream": {
    "type": "logs",
    "dataset": "foo",
    "namespace": "default"
  }
}

felixbarny · 2023-06-09T12:13:39Z

I had a look whether it would be possible to use the WriteField API that underpins the scripting Field API in IngestDocument to get/set values.

The biggest semantical mismatch that I noticed is that while the IngestDocument#get/set methods support array-indexing, such as foo.0.bar, the WriteField API does not.

@stu-elastic has it been considered to add support for referencing elements in a List via the WriteField API? If we could add that feature, we might be able to use the WriteField API within the IngestDocument#get/set methods and thereby bringing more consistency to the different ways of accessing fields during ingest processing.

stu-elastic · 2023-06-09T15:46:28Z

@felixbarny we considered that but, for the initial version, decided the semantic ambiguity of accepting indexes, which could also be keys, didn't make sense.

To clarify the requirements, is there a need to accept specific indices or would being able indicate the first index be sufficient? If we had some real scripts that demonstrated the need for index addressing, that would be extremely helpful.

Honestly, I rarely see scripts that use an index above zero, after that, they seem to process each entry.

cc: @rjernst

felixbarny · 2023-06-09T16:10:23Z

I'm not sure how widely used it is to reference an index, I'd assume it isn't. However, that's a feature that's supported by all processors, for example the copy_from parameter of the set processor.

So if we want to replace the implementation of IngestDocument#getFieldValue, we need to do that in a non-breaking way. If we used the WriteField implementation within IngestDocument#getFieldValue, it would take away the ability to reference indices, which would be a breaking change. So if we want to unify the filed access across the field API and ingest processors, I think we'll need to add that feature to the field API/WriteField.

felixbarny · 2023-06-12T10:09:43Z

I played a bit with extending WriteFiled to support array references. As a side-effect it now also supports resolving nested objects that have a parent with a dotted field name, for example ['foo.bar'].baz with field('foo.bar.baz').

With these changes, it seems feasible to use WriteField as the underlying implementation of the several IngestDocument methods, such as getFieldValue, setFieldValue, hasField, removeField, and appendFieldValue. Some exception messages might slightly change in the process, though. Probably not a big deal.

The question is, do we want that and can we do that? I think it makes a lot of sense to have a common way to accessing field during ingest processing and scripting. It would be hard to explain to users why some features only exist for ingest processors (like the ability to access array elements) and some only for the fields API (such as being able to access dotted field names).

Assuming we agree on that, I think we need to answer the following questions on whether we can do that change:

Is it a breaking change for ingest processors to be able to access dotted fields? The first instinct is no, because it's added functionality and aligns the behavior with Mustache scripts and the fields API. However, the semantics change a bit. For example, the rename processor now throws an exception when trying to rename foo to foo.bar for a document like this: {"foo": "baz1", "foo.bar": "baz2"}, as the path foo.bar already exists - currently the processor wasn't able to resolve the dotted path but after the change it would. There are probably other nuanced examples where things behave slightly differently compared to now. We could run the component tests of https://github.com/elastic/integrations to see if any would be impacted by the change.
What's the performance impact of that change? I don't think we can get to a point where this has zero impact on performance but we should ensure it's within reasonable bounds.

Any other things that we need to discuss?

felixbarny · 2023-06-13T06:21:40Z

I've created a POC that extends WriteField so that it can access array elements and then used it to power the field access in IngestDocument: #96786.

This should get us closer to answers on the questions of whether this is a breaking change and what the performance impact is.

masseyke · 2023-08-16T17:57:13Z

It looks like JsonProcessor is one that doesn't support dots in field names -- https://discuss.elastic.co/t/elastic-json-processor-error-cannot-add-non-map-fields-to-root-of-document/340845

felixbarny · 2023-08-18T08:45:10Z

Seems like it only supports fields with dots right now due to ctx.get(fieldName).

I've added a fix to this PR. The same PR also fixes another problem where the JSON processor struggles with dots in field names.

jeffatfw · 2024-09-02T23:39:42Z

I think I ran into this issue when ingesting JSON from a Custom Logs integration when I went to create an ingest pipeline processor of type JSON appending to root, I get the non-map to root error. Once set to use a target field instead of root it works.

Also if I move this logic out of the ingest pipeline and into the Custom Configurations section of the Custom Logs integration that ingests this with this:
json:
keys_under_root: true
add_error_key: true
message_key: message
overwrite_keys: true

It works properly and the JSON components are on the document root.

I would prefer this to work in the ingest pipeline as it logically seems to make more sense there and I can handle errors better in the pipeline compared to on the agent. It would be great if this were continued to be worked on!

flash1293 · 2024-10-10T11:08:11Z

Reviving this work because I think there is a big need for this kind of functionality.

Instead of trying to fix behavior of existing constructs, we can improve the world for newly created pipelines by adding a new function that safely retrieves fields no matter whether they are stored in nested objects or in a dotted field property:

$('my.nested.field', '')

would be equivalent to accessing ctx['my.nested.field'] and ctx.my.nested.field - in case neither can be found, an empty string is returned. This function also doesn't fail on cases like my existing as an object without a property nested. It's conceptually similar to the lodash get function getting implicitly executiyed on the ctx object: https://lodash.com/docs#get

What do you think @felixbarny @dakrone ?

felixbarny added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Jun 7, 2023

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jun 7, 2023

felixbarny mentioned this issue Jun 7, 2023

[Logs+] Enable JSON parsing for logs by default #96651

Closed

felixbarny mentioned this issue Jun 13, 2023

Use WriteField API within IngestDocument #96786

Closed

felixbarny mentioned this issue Jun 19, 2023

Update documentation for logstash output with the correct defaults elastic/apm-server#10643

Closed

felixbarny mentioned this issue Jul 7, 2023

Support dotted field notations in the reroute processor #96243

Merged

felixbarny mentioned this issue Jul 24, 2023

Switch data_stream to object notation elastic/apm-data#124

Merged

felixbarny linked a pull request Aug 1, 2023 that will close this issue

Use WriteField API within IngestDocument #96786

Closed

felixbarny mentioned this issue Aug 17, 2023

ignore_malformed to support ignoring JSON objects ingested into fields of the wrong type #12366

Open

gbamparop assigned felixbarny Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dotted field names in ingest processors #96648

Support dotted field names in ingest processors #96648

felixbarny commented Jun 7, 2023

elasticsearchmachine commented Jun 7, 2023

felixbarny commented Jun 7, 2023 •

edited

Loading

felixbarny commented Jun 9, 2023

stu-elastic commented Jun 9, 2023

felixbarny commented Jun 9, 2023

felixbarny commented Jun 12, 2023

felixbarny commented Jun 13, 2023

masseyke commented Aug 16, 2023

felixbarny commented Aug 18, 2023

jeffatfw commented Sep 2, 2024

flash1293 commented Oct 10, 2024

Support dotted field names in ingest processors #96648

Support dotted field names in ingest processors #96648

Comments

felixbarny commented Jun 7, 2023

elasticsearchmachine commented Jun 7, 2023

felixbarny commented Jun 7, 2023 • edited Loading

felixbarny commented Jun 9, 2023

stu-elastic commented Jun 9, 2023

felixbarny commented Jun 9, 2023

felixbarny commented Jun 12, 2023

felixbarny commented Jun 13, 2023

masseyke commented Aug 16, 2023

felixbarny commented Aug 18, 2023

jeffatfw commented Sep 2, 2024

flash1293 commented Oct 10, 2024

felixbarny commented Jun 7, 2023 •

edited

Loading