-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Support dotted field names in ingest processors #96648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-data-management (Team:Data Management) |
@eyalkoren once we have this, we should update the JSON pipeline and remove the dot_expander processor: elasticsearch/x-pack/plugin/core/src/main/resources/logs-json-message-pipeline.json Lines 29 to 35 in e29b9c9
This will enable ingestion of documents that can't be expanded because they contain conflicting fields, such as We'll need test that the JSON processor with the following settings supports merging objects at the root of the doc with dotted fields form the JSON. elasticsearch/x-pack/plugin/core/src/main/resources/logs-json-message-pipeline.json Lines 15 to 16 in e29b9c9
Example doc: {
"data_stream": {
"type": "logs",
"dataset": "generic",
"namespace": "default"
},
"message": "{\"data_stream.dataset\": \"foo\"}"
} Expected result: {
"data_stream": {
"type": "logs",
"dataset": "foo",
"namespace": "default"
}
} |
I had a look whether it would be possible to use the The biggest semantical mismatch that I noticed is that while the IngestDocument#get/set methods support array-indexing, such as @stu-elastic has it been considered to add support for referencing elements in a List via the |
@felixbarny we considered that but, for the initial version, decided the semantic ambiguity of accepting indexes, which could also be keys, didn't make sense. To clarify the requirements, is there a need to accept specific indices or would being able indicate the first index be sufficient? If we had some real scripts that demonstrated the need for index addressing, that would be extremely helpful. Honestly, I rarely see scripts that use an index above zero, after that, they seem to process each entry. cc: @rjernst |
I'm not sure how widely used it is to reference an index, I'd assume it isn't. However, that's a feature that's supported by all processors, for example the So if we want to replace the implementation of |
I played a bit with extending With these changes, it seems feasible to use The question is, do we want that and can we do that? I think it makes a lot of sense to have a common way to accessing field during ingest processing and scripting. It would be hard to explain to users why some features only exist for ingest processors (like the ability to access array elements) and some only for the fields API (such as being able to access dotted field names). Assuming we agree on that, I think we need to answer the following questions on whether we can do that change:
Any other things that we need to discuss? |
I've created a POC that extends WriteField so that it can access array elements and then used it to power the field access in IngestDocument: #96786. This should get us closer to answers on the questions of whether this is a breaking change and what the performance impact is. |
It looks like JsonProcessor is one that doesn't support dots in field names -- https://discuss.elastic.co/t/elastic-json-processor-error-cannot-add-non-map-fields-to-root-of-document/340845 |
Seems like it only supports fields with dots right now due to I've added a fix to this PR. The same PR also fixes another problem where the JSON processor struggles with dots in field names. |
I think I ran into this issue when ingesting JSON from a Custom Logs integration when I went to create an ingest pipeline processor of type JSON appending to root, I get the non-map to root error. Once set to use a target field instead of root it works. Also if I move this logic out of the ingest pipeline and into the Custom Configurations section of the Custom Logs integration that ingests this with this: It works properly and the JSON components are on the document root. I would prefer this to work in the ingest pipeline as it logically seems to make more sense there and I can handle errors better in the pipeline compared to on the agent. It would be great if this were continued to be worked on! |
Reviving this work because I think there is a big need for this kind of functionality. Instead of trying to fix behavior of existing constructs, we can improve the world for newly created pipelines by adding a new function that safely retrieves fields no matter whether they are stored in nested objects or in a dotted field property:
would be equivalent to accessing What do you think @felixbarny @dakrone ? |
Elasticsearch supports ingestion of JSON documents that have dotted field names and JSON objects.
In most places, both are treated equivalently:
object
, the documents are internally stored as a flat list of key/value pairs. See also the object field type docs.value
option in theset
processor.However, the
IngestDocument#getFieldValue
andIngestDocument#setFieldValue
methods don't support accessing dotted field names. This makes it impossible to reference a dotted field name with, for example, theset
processor'sfield
option.This issue proposes to enhance the ingest processors to support reading and writing to fields that are either in a dotted or a nested field notation. The behavior should be aligned with the field API.
See also
The text was updated successfully, but these errors were encountered: