Skip to content

Support dotted field names in ingest processors #96648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
felixbarny opened this issue Jun 7, 2023 · 11 comments
Open

Support dotted field names in ingest processors #96648

felixbarny opened this issue Jun 7, 2023 · 11 comments
Assignees
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@felixbarny
Copy link
Member

Elasticsearch supports ingestion of JSON documents that have dotted field names and JSON objects.
In most places, both are treated equivalently:

  • During search and aggregation. That's because even if the ES field type is object, the documents are internally stored as a flat list of key/value pairs. See also the object field type docs.
  • Mustache templates, such as the value option in the set processor.
  • The field API, available for scripting, which is in Beta at the time of writing.

However, the IngestDocument#getFieldValue and IngestDocument#setFieldValue methods don't support accessing dotted field names. This makes it impossible to reference a dotted field name with, for example, the set processor's field option.

This issue proposes to enhance the ingest processors to support reading and writing to fields that are either in a dotted or a nested field notation. The behavior should be aligned with the field API.

See also

@felixbarny felixbarny added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Jun 7, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jun 7, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@felixbarny
Copy link
Member Author

felixbarny commented Jun 7, 2023

@eyalkoren once we have this, we should update the JSON pipeline and remove the dot_expander processor:

{
"dot_expander" : {
"if": "ctx._tmp_json_message != null",
"field": "*",
"override": true
}
},

This will enable ingestion of documents that can't be expanded because they contain conflicting fields, such as foo and foo.bar.

We'll need test that the JSON processor with the following settings supports merging objects at the root of the doc with dotted fields form the JSON.

"add_to_root": true,
"add_to_root_conflict_strategy": "merge",

Example doc:

{
  "data_stream": {
    "type": "logs",
    "dataset": "generic",
    "namespace": "default"
  },
  "message": "{\"data_stream.dataset\": \"foo\"}"
}

Expected result:

{
  "data_stream": {
    "type": "logs",
    "dataset": "foo",
    "namespace": "default"
  }
}

@felixbarny
Copy link
Member Author

I had a look whether it would be possible to use the WriteField API that underpins the scripting Field API in IngestDocument to get/set values.

The biggest semantical mismatch that I noticed is that while the IngestDocument#get/set methods support array-indexing, such as foo.0.bar, the WriteField API does not.

@stu-elastic has it been considered to add support for referencing elements in a List via the WriteField API? If we could add that feature, we might be able to use the WriteField API within the IngestDocument#get/set methods and thereby bringing more consistency to the different ways of accessing fields during ingest processing.

@stu-elastic
Copy link
Contributor

@felixbarny we considered that but, for the initial version, decided the semantic ambiguity of accepting indexes, which could also be keys, didn't make sense.

To clarify the requirements, is there a need to accept specific indices or would being able indicate the first index be sufficient? If we had some real scripts that demonstrated the need for index addressing, that would be extremely helpful.

Honestly, I rarely see scripts that use an index above zero, after that, they seem to process each entry.

cc: @rjernst

@felixbarny
Copy link
Member Author

I'm not sure how widely used it is to reference an index, I'd assume it isn't. However, that's a feature that's supported by all processors, for example the copy_from parameter of the set processor.

So if we want to replace the implementation of IngestDocument#getFieldValue, we need to do that in a non-breaking way. If we used the WriteField implementation within IngestDocument#getFieldValue, it would take away the ability to reference indices, which would be a breaking change. So if we want to unify the filed access across the field API and ingest processors, I think we'll need to add that feature to the field API/WriteField.

@felixbarny
Copy link
Member Author

I played a bit with extending WriteFiled to support array references. As a side-effect it now also supports resolving nested objects that have a parent with a dotted field name, for example ['foo.bar'].baz with field('foo.bar.baz').

With these changes, it seems feasible to use WriteField as the underlying implementation of the several IngestDocument methods, such as getFieldValue, setFieldValue, hasField, removeField, and appendFieldValue. Some exception messages might slightly change in the process, though. Probably not a big deal.

The question is, do we want that and can we do that? I think it makes a lot of sense to have a common way to accessing field during ingest processing and scripting. It would be hard to explain to users why some features only exist for ingest processors (like the ability to access array elements) and some only for the fields API (such as being able to access dotted field names).

Assuming we agree on that, I think we need to answer the following questions on whether we can do that change:

  • Is it a breaking change for ingest processors to be able to access dotted fields? The first instinct is no, because it's added functionality and aligns the behavior with Mustache scripts and the fields API. However, the semantics change a bit. For example, the rename processor now throws an exception when trying to rename foo to foo.bar for a document like this: {"foo": "baz1", "foo.bar": "baz2"}, as the path foo.bar already exists - currently the processor wasn't able to resolve the dotted path but after the change it would. There are probably other nuanced examples where things behave slightly differently compared to now. We could run the component tests of https://github.com/elastic/integrations to see if any would be impacted by the change.
  • What's the performance impact of that change? I don't think we can get to a point where this has zero impact on performance but we should ensure it's within reasonable bounds.

Any other things that we need to discuss?

@felixbarny
Copy link
Member Author

I've created a POC that extends WriteField so that it can access array elements and then used it to power the field access in IngestDocument: #96786.

This should get us closer to answers on the questions of whether this is a breaking change and what the performance impact is.

@masseyke
Copy link
Member

It looks like JsonProcessor is one that doesn't support dots in field names -- https://discuss.elastic.co/t/elastic-json-processor-error-cannot-add-non-map-fields-to-root-of-document/340845

@felixbarny
Copy link
Member Author

Seems like it only supports fields with dots right now due to ctx.get(fieldName).

I've added a fix to this PR. The same PR also fixes another problem where the JSON processor struggles with dots in field names.

@jeffatfw
Copy link

jeffatfw commented Sep 2, 2024

I think I ran into this issue when ingesting JSON from a Custom Logs integration when I went to create an ingest pipeline processor of type JSON appending to root, I get the non-map to root error. Once set to use a target field instead of root it works.

Also if I move this logic out of the ingest pipeline and into the Custom Configurations section of the Custom Logs integration that ingests this with this:
json:
keys_under_root: true
add_error_key: true
message_key: message
overwrite_keys: true

It works properly and the JSON components are on the document root.

I would prefer this to work in the ingest pipeline as it logically seems to make more sense there and I can handle errors better in the pipeline compared to on the agent. It would be great if this were continued to be worked on!

@flash1293
Copy link
Contributor

Reviving this work because I think there is a big need for this kind of functionality.

Instead of trying to fix behavior of existing constructs, we can improve the world for newly created pipelines by adding a new function that safely retrieves fields no matter whether they are stored in nested objects or in a dotted field property:

$('my.nested.field', '')

would be equivalent to accessing ctx['my.nested.field'] and ctx.my.nested.field - in case neither can be found, an empty string is returned. This function also doesn't fail on cases like my existing as an object without a property nested. It's conceptually similar to the lodash get function getting implicitly executiyed on the ctx object: https://lodash.com/docs#get

What do you think @felixbarny @dakrone ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants