-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Support dotted field notations in the reroute processor #96243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dotted field notations in the reroute processor #96243
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @felixbarny, I've created a changelog YAML for you. |
Hi @felixbarny, I've updated the changelog YAML for you. |
So, I've taken a look through the code, and I'm not entirely sure we should go this route by default. As it stands, the
Fully agree that this is way out of scope, but even going down that path starts to open up a lot of corner cases for how ingest node should be handling dotted field names. For instance, if we're using a set processor to read from a dotted field in order to write it somewhere else, does the destination get a dotted field name or is it in object notation. If a dotted field exists, do we overwrite it with the object notation or vice versa? Is there any way to make use of the dot expander in most cases before running the reroute processor? This would resolve the issue without having to wrestle with the above design concerns. |
I don't like the implicit requirement of having to use the dot expander processor in order to be able to use the reroute processor. We could potentially de-dot |
For what it is worth, none of the existing processors are equipped to handle dotted field names other than the dot expander. Dotted field names are not visible in ingest. I don't think there is consensus to be found on how best to fix ingest so that dotted fields are considered for set and get field operations. Accepting dotted field names when parsing data into Lucene is fine because documents are treated as a loose bag of terms that never need to be rematerialized. Picking between nested json and dotted field names when writing a value to a field in ingest is more complicated because it changes the source. Making assumptions about what the user wants/is ok with is what leads to usability bugs. If you have a pipeline that works on documents with dotted fields, using a dot expander makes it clear that dotted field names will be normalized into nested json and visible to subsequent processors. Between assertions in the ingest processors and the simulate features we offer, I think we're more likely to find issues earlier on in the ingest process with this approach than down the path in the document parsing code at the shard layer. This feels like accepting technical debt, but I'll mark this as team discuss for the DM team in case others disagree with that assessment. |
Linking to a Hadoop issue because I view Ingest Node and Hadoop in similar light. They both are expected to read and write source data in ways that the rest of Elasticsearch does not concern itself with. |
Using the dedot processor by default seems to against where we want to land eventually that everything is flattened: #88934
I think we all agree that this is not great but needs to be fixed eventually. It is a problem that has hit us several time in the past. But I see the case here for the reroute processor slightly differently: It has a default behaviour on certain fields. These are special fields targeted to the data stream naming scheme. I consider it acceptable to have special logic on these fields. It becomes especially important as even our products use it differently and we should not put any of the burden on the user to figure this out. |
Except for the dot expander processor, which calls
Compared to the general problem of "solving" dots in field names in ingest pipelines, the advantage that we have in the context of the reroute processor is that we're not working with completely generic data. We're dealing with very specific, well-defined ECS fields:
These are defined as objects in ECS and we're already modifying these properties with the reroute processor, so they're not really "sacred" properties that we need to be careful of modifying. It's the reroute processor's job to modify these attributes so that they're in sync with the index name. I don't think it's helpful to insist that incoming documents need to use nested field notation for the Rejecting documents with dotted field names for |
We discussed this in our most recent team meeting and there was general discomfort from the engineers present about dotted field name usage in ingest and around fixing this for just the reroute processor.
Based on this direction, I think it's likely that this will continue to be a problem for other processors. If the plan is to put pipeline logic in the hands of users so that they can manage their own logs parsing logic, and to send dotted field names mixed with nested json, then we are setting them up for frustration if ingest does not handle dotted field names gracefully. We would like to have a plan in place on how to support dotted field names across all processors, especially if we are moving in the direction of all logs mappings being flattened, as this problem will only continue to pop up.
Accepting all logs is absolutely the strategy we want to follow for this feature. The suggestion of using the dot expander earlier in the pipeline would allow for accepting these logs without making special changes for this one case and hopefully would guard against further problems in other processors. We've discussed having json parsing by default for Logs+. We're setting default timestamps on documents based on their arrivals. I was hoping expanding dotted fields would be a feature worth considering in the same light. Doing so would buy us flexibility on how we want to approach the greater problem of dotted field names in ingest. Plus, if we move forward with the change toward being lenient around accepting sub-objects when they are disabled (as described in #88934), then I'm not sure I understand how the dot expander conflicts with the accept all logs objective. Those nested fields would just be flattened at index time. Let's see if we can make some time soon to discuss this further. Dotted field names are an incredibly messy feature that we support and we want to make sure we're going in the right direction here, not just introducing quick fixes as they arise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would like to have a plan in place on how to support dotted field names across all processors, especially if we are moving in the direction of all logs mappings being flattened, as this problem will only continue to pop up.
I do agree that this problem is bigger than just the reroute processor and ideally, we'd have a consistent approach for all processors.
However, the changes in this PR don't necessarily conflict with that goal. I think we can make progress on this specific issue and the general issue independently. The way the changes are made in this PR don't conflict with any general changes to support dotted field notations in ingest processors and are aligned with how we're handling dots in field names in the fields API as well as Mustache templates.
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/RerouteProcessor.java
Show resolved
Hide resolved
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/RerouteProcessor.java
Outdated
Show resolved
Hide resolved
Using the dot expander on all incoming data conflicts with one of the main motivations for using subobjects:false: We want to fully accept documents that currently would be considered to have an object/scalar mismatch, for example if they have the fields foo and foo.bar. Applying for dot expander on such a document will result in an error. |
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/RerouteProcessor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on discussions (both here and externally) as well as how things are heading for other parts of the ingest code in regards to dotted field names (especially the work on the fields api in Painless), I think we'd be good to get this in at this point.
LGTM, thanks for the discussion and iterations!
💚 Backport successful
|
|
This PR did not add support for dots in es-hadoop. It just added support for that in the There's another discussion about adding support for dots in more processors: Yet another issue discusses adding support for dots in es-hadoop: |
@graphaelli reported an issue when using the reroute processor on APM data. The reason is that APM Server sets
data_stream.dataset
as a dotted field name instead of object notation. This leads to the document containing both the dotted and the nested field after routing whose values differ which leads to a conflict for the constant-keyword mapping fordata_stream.dataset
.IMO,
IngestDocument#setValue
andIngestDocument#getValue
should always consider both dotted and nested notation but that's a much bigger change that I don't want to conflate with this enhancement which is borderline a bug fix.