Support dotted field notations in the reroute processor #96243

felixbarny · 2023-05-22T08:38:43Z

@graphaelli reported an issue when using the reroute processor on APM data. The reason is that APM Server sets data_stream.dataset as a dotted field name instead of object notation. This leads to the document containing both the dotted and the nested field after routing whose values differ which leads to a conflict for the constant-keyword mapping for data_stream.dataset.

IMO, IngestDocument#setValue and IngestDocument#getValue should always consider both dotted and nested notation but that's a much bigger change that I don't want to conflate with this enhancement which is borderline a bug fix.

elasticsearchmachine · 2023-05-22T08:39:07Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2023-05-22T08:39:30Z

Hi @felixbarny, I've created a changelog YAML for you.

elasticsearchmachine · 2023-05-22T13:58:49Z

Hi @felixbarny, I've updated the changelog YAML for you.

…d-notations

jbaiera · 2023-05-24T00:02:04Z

So, I've taken a look through the code, and I'm not entirely sure we should go this route by default. As it stands, the IngestDocument.ctxMap is only used in scripting. All other interactions with it are done through the IngestDocument get and set value methods.

IMO, IngestDocument#setValue and IngestDocument#getValue should always consider both dotted and nested notation but that's a much bigger change that I don't want to conflate with this enhancement which is borderline a bug fix.

Fully agree that this is way out of scope, but even going down that path starts to open up a lot of corner cases for how ingest node should be handling dotted field names. For instance, if we're using a set processor to read from a dotted field in order to write it somewhere else, does the destination get a dotted field name or is it in object notation. If a dotted field exists, do we overwrite it with the object notation or vice versa?

Is there any way to make use of the dot expander in most cases before running the reroute processor? This would resolve the issue without having to wrestle with the above design concerns.

felixbarny · 2023-05-24T11:06:48Z

Is there any way to make use of the dot expander in most cases before running the reroute processor? This would resolve the issue without having to wrestle with the above design concerns.

I don't like the implicit requirement of having to use the dot expander processor in order to be able to use the reroute processor. We could potentially de-dot data_stream.* by default which is effectively what this PR is doing.

jbaiera · 2023-05-24T20:49:44Z

I don't like the implicit requirement of having to use the dot expander processor in order to be able to use the reroute processor.

For what it is worth, none of the existing processors are equipped to handle dotted field names other than the dot expander. Dotted field names are not visible in ingest. I don't think there is consensus to be found on how best to fix ingest so that dotted fields are considered for set and get field operations.

Accepting dotted field names when parsing data into Lucene is fine because documents are treated as a loose bag of terms that never need to be rematerialized. Picking between nested json and dotted field names when writing a value to a field in ingest is more complicated because it changes the source. Making assumptions about what the user wants/is ok with is what leads to usability bugs.

If you have a pipeline that works on documents with dotted fields, using a dot expander makes it clear that dotted field names will be normalized into nested json and visible to subsequent processors. Between assertions in the ingest processors and the simulate features we offer, I think we're more likely to find issues earlier on in the ingest process with this approach than down the path in the document parsing code at the shard layer.

This feels like accepting technical debt, but I'll mark this as team discuss for the DM team in case others disagree with that assessment.

jbaiera · 2023-05-24T20:58:03Z

Linking to a Hadoop issue because I view Ingest Node and Hadoop in similar light. They both are expected to read and write source data in ways that the rest of Elasticsearch does not concern itself with.

elastic/elasticsearch-hadoop#853

ruflin · 2023-05-25T06:49:47Z

Using the dedot processor by default seems to against where we want to land eventually that everything is flattened: #88934

For what it is worth, none of the existing processors are equipped to handle dotted field names other than the dot expander.

I think we all agree that this is not great but needs to be fixed eventually. It is a problem that has hit us several time in the past. But I see the case here for the reroute processor slightly differently: It has a default behaviour on certain fields. These are special fields targeted to the data stream naming scheme. I consider it acceptable to have special logic on these fields. It becomes especially important as even our products use it differently and we should not put any of the burden on the user to figure this out.

felixbarny · 2023-05-25T08:21:57Z

As it stands, the IngestDocument.ctxMap. All other interactions with it are done through the IngestDocument get and set value methods.

Except for the dot expander processor, which calls IngestDocument.getSourceAndMetadata which returns the same ctxMap object as IngestDocument.ctxMap.

Picking between nested json and dotted field names when writing a value to a field in ingest is more complicated because it changes the source. Making assumptions about what the user wants/is ok with is what leads to usability bugs.

Compared to the general problem of "solving" dots in field names in ingest pipelines, the advantage that we have in the context of the reroute processor is that we're not working with completely generic data. We're dealing with very specific, well-defined ECS fields:

data_stream.type
data_stream.dataset
event.dataset
data_stream.namespace

These are defined as objects in ECS and we're already modifying these properties with the reroute processor, so they're not really "sacred" properties that we need to be careful of modifying. It's the reroute processor's job to modify these attributes so that they're in sync with the index name.

I don't think it's helpful to insist that incoming documents need to use nested field notation for the reroute processor to work. People use dots and nested notation interchangeably as ES doesn't differ between the notations once the document is stored. Our solutions generally prefer retrieving information via fields rather than _source as per the general recommendation and with synthetic _source you also can't rely on the exact structure of the source too much.

Rejecting documents with dotted field names for data_stream.* would be completely against the spirit of accepting all logs. Therefore, I think the only path that we have is to accept reading from the dotted field name equivalents and to ensure that the output of the reroute processor is a document that never has both a nested and a dotted field notation data_stream.*. Otherwise, this would trigger an issue with constant keyword collisions that's very hard to understand for users.

jbaiera · 2023-05-26T20:35:16Z

We discussed this in our most recent team meeting and there was general discomfort from the engineers present about dotted field name usage in ingest and around fixing this for just the reroute processor.

Using the dedot processor by default seems to against where we want to land eventually that everything is flattened: #88934

Based on this direction, I think it's likely that this will continue to be a problem for other processors. If the plan is to put pipeline logic in the hands of users so that they can manage their own logs parsing logic, and to send dotted field names mixed with nested json, then we are setting them up for frustration if ingest does not handle dotted field names gracefully. We would like to have a plan in place on how to support dotted field names across all processors, especially if we are moving in the direction of all logs mappings being flattened, as this problem will only continue to pop up.

Rejecting documents with dotted field names for data_stream.* would be completely against the spirit of accepting all logs.

Accepting all logs is absolutely the strategy we want to follow for this feature. The suggestion of using the dot expander earlier in the pipeline would allow for accepting these logs without making special changes for this one case and hopefully would guard against further problems in other processors. We've discussed having json parsing by default for Logs+. We're setting default timestamps on documents based on their arrivals. I was hoping expanding dotted fields would be a feature worth considering in the same light. Doing so would buy us flexibility on how we want to approach the greater problem of dotted field names in ingest.

Plus, if we move forward with the change toward being lenient around accepting sub-objects when they are disabled (as described in #88934), then I'm not sure I understand how the dot expander conflicts with the accept all logs objective. Those nested fields would just be flattened at index time.

Let's see if we can make some time soon to discuss this further. Dotted field names are an incredibly messy feature that we support and we want to make sure we're going in the right direction here, not just introducing quick fixes as they arise.

felixbarny

We would like to have a plan in place on how to support dotted field names across all processors, especially if we are moving in the direction of all logs mappings being flattened, as this problem will only continue to pop up.

I do agree that this problem is bigger than just the reroute processor and ideally, we'd have a consistent approach for all processors.

However, the changes in this PR don't necessarily conflict with that goal. I think we can make progress on this specific issue and the general issue independently. The way the changes are made in this PR don't conflict with any general changes to support dotted field notations in ingest processors and are aligned with how we're handling dots in field names in the fields API as well as Mustache templates.

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/RerouteProcessor.java

felixbarny · 2023-05-31T05:58:57Z

Plus, if we move forward with the change toward being lenient around accepting sub-objects when they are disabled (as described in #88934), then I'm not sure I understand how the dot expander conflicts with the accept all logs objective. Those nested fields would just be flattened at index time.

Using the dot expander on all incoming data conflicts with one of the main motivations for using subobjects:false: We want to fully accept documents that currently would be considered to have an object/scalar mismatch, for example if they have the fields foo and foo.bar. Applying for dot expander on such a document will result in an error.

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/RerouteProcessor.java

…d-notations

jbaiera

Based on discussions (both here and externally) as well as how things are heading for other parts of the ingest code in regards to dotted field names (especially the work on the fields api in Painless), I think we'd be good to get this in at this point.

LGTM, thanks for the discussion and iterations!

elasticsearchmachine · 2023-06-06T19:39:09Z

💚 Backport successful

Status	Branch	Result
✅	8.8

)

zhengminhz · 2023-07-07T10:26:28Z

Caused by: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Found field 'connect2.SSLhandshake'. Fields containing dots ('.') are not supported in es-hadoop at org.elasticsearch.spark.sql.RowValueReader.addToBuffer(RowValueReader.scala:62) at org.elasticsearch.spark.sql.RowValueReader.addToBuffer$(RowValueReader.scala:55) at org.elasticsearch.spark.sql.ScalaRowValueReader.addToBuffer(ScalaEsRowValueReader.scala:32) at org.elasticsearch.spark.sql.ScalaRowValueReader.addToMap(ScalaEsRowValueReader.scala:118) at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:1058) at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:895) at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:608) at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:432) ... 44 more
@felixbarny I still have this when i use 8.8.2 version

felixbarny · 2023-07-07T12:47:12Z

Fields containing dots ('.') are not supported in es-hadoop

This PR did not add support for dots in es-hadoop. It just added support for that in the reroute processor.

There's another discussion about adding support for dots in more processors:

Support dotted field names in ingest processors #96648

Yet another issue discusses adding support for dots in es-hadoop:

Dots in field names exception elasticsearch-hadoop#853

Support dotted field notations in the reroute processor

86c0322

felixbarny added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels May 22, 2023

felixbarny requested a review from jbaiera May 22, 2023 08:38

felixbarny self-assigned this May 22, 2023

elasticsearchmachine added Team:Data Management Meta label for data/management team v8.9.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels May 22, 2023

Update docs/changelog/96243.yaml

48c5831

felixbarny requested a review from eyalkoren May 22, 2023 09:10

felixbarny added >bug v8.8.1 auto-backport-and-merge and removed >enhancement labels May 22, 2023

Update docs/changelog/96243.yaml

d3c55dd

Merge remote-tracking branch 'origin/main' into reroute-support-dotte…

8eea9c0

…d-notations

jbaiera added the team-discuss label May 24, 2023

felixbarny mentioned this pull request May 30, 2023

Use object notation for data_stream fields elastic/apm-data#44

Closed

felixbarny commented May 30, 2023

View reviewed changes

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/RerouteProcessor.java Show resolved Hide resolved

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/RerouteProcessor.java Outdated Show resolved Hide resolved

jbaiera reviewed Jun 5, 2023

View reviewed changes

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/RerouteProcessor.java Outdated Show resolved Hide resolved

felixbarny added 2 commits June 6, 2023 10:04

Merge remote-tracking branch 'origin/main' into reroute-support-dotte…

4a6ef43

…d-notations

Preserve dotted/nested notation from doc

5cfaa3d

bpintea added v8.8.2 and removed v8.8.1 labels Jun 6, 2023

jbaiera approved these changes Jun 6, 2023

View reviewed changes

felixbarny merged commit 2e81bcc into elastic:main Jun 6, 2023

felixbarny mentioned this pull request Jun 6, 2023

[8.8] Support dotted field notations in the reroute processor (#96243) #96628

Merged

felixbarny added a commit to felixbarny/elasticsearch that referenced this pull request Jun 6, 2023

Support dotted field notations in the reroute processor (elastic#96243)

049a8fb

elasticsearchmachine pushed a commit that referenced this pull request Jun 6, 2023

Support dotted field notations in the reroute processor (#96243) (#96628

9a8f301

)

felixbarny mentioned this pull request Jun 7, 2023

Support dotted field names in ingest processors #96648

Open

felixbarny mentioned this pull request Jun 19, 2023

Update documentation for logstash output with the correct defaults elastic/apm-server#10643

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dotted field notations in the reroute processor #96243

Support dotted field notations in the reroute processor #96243

felixbarny commented May 22, 2023

elasticsearchmachine commented May 22, 2023

elasticsearchmachine commented May 22, 2023

elasticsearchmachine commented May 22, 2023

jbaiera commented May 24, 2023

felixbarny commented May 24, 2023

jbaiera commented May 24, 2023

jbaiera commented May 24, 2023

ruflin commented May 25, 2023

felixbarny commented May 25, 2023

jbaiera commented May 26, 2023

felixbarny left a comment

felixbarny commented May 31, 2023 •

edited

Loading

jbaiera left a comment •

edited

Loading

elasticsearchmachine commented Jun 6, 2023

zhengminhz commented Jul 7, 2023

felixbarny commented Jul 7, 2023

Support dotted field notations in the reroute processor #96243

Support dotted field notations in the reroute processor #96243

Conversation

felixbarny commented May 22, 2023

elasticsearchmachine commented May 22, 2023

elasticsearchmachine commented May 22, 2023

elasticsearchmachine commented May 22, 2023

jbaiera commented May 24, 2023

felixbarny commented May 24, 2023

jbaiera commented May 24, 2023

jbaiera commented May 24, 2023

ruflin commented May 25, 2023

felixbarny commented May 25, 2023

jbaiera commented May 26, 2023

felixbarny left a comment

Choose a reason for hiding this comment

felixbarny commented May 31, 2023 • edited Loading

jbaiera left a comment • edited Loading

Choose a reason for hiding this comment

elasticsearchmachine commented Jun 6, 2023

💚 Backport successful

zhengminhz commented Jul 7, 2023

felixbarny commented Jul 7, 2023

felixbarny commented May 31, 2023 •

edited

Loading

jbaiera left a comment •

edited

Loading