Skip to content

Add the ability for Elasticsearch to calculate and index the length of a string field #65636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
webmat opened this issue Nov 30, 2020 · 6 comments
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@webmat
Copy link

webmat commented Nov 30, 2020

I would love to have the option of having Elasticsearch calculate the length of a string field, via a mapping setting.

There are many situations where having the string length is useful. The one I’m most interested in is the first:

  • Analyzing potentially suspicious content:
    • detecting DNS tunnelling based on DNS question / answer length
    • long shell command line indicating automated execution & obfuscated payloads
  • Detecting documents that went over keyword’s “ignore_above”
  • Classifying documents by field length ranges (e.g. under 100 char, 100 to 500, 500+)
  • Curating documents (product description under 10 chars, user names over 100 chars)

Calculating field length via a runtime field can satisfy the need of displaying field length of a document pulled up another way. However each of the situation above needs the length indexed explicitly, if we want to avoid prohibitively expensive queries.

It’s currently possible to do this by adding a sister field (e.g. dns.question.name => dns.question.name_length), then calculating the length upon ingestion with an ingest node processor or other method. However this approach potentially leads to boilerplate code that needs to be repeated in many pipelines.

I’m thinking of two ways this could potentially be implemented. I’d be happy with either:

  • a parameter on text and keyword family fields, leading to a virtual multi-field (e.g. dns.question.name => dns.question.name.length)
  • a distinct data type similar to token_count that counts characters. This could be added explicitly as a multi-field named to the user’s preference.

A recent ECS discussion on DNS question/answer length (ecs#992) was the inspiration for this. If we had such a capability, we would potentially add such a “length” field a few more places in ECS: DNS, URLs, user agents, process.command_line, etc.

@webmat webmat added >enhancement needs:triage Requires assignment of a team area label labels Nov 30, 2020
@tvernum tvernum added :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Dec 6, 2020
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Dec 6, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@jimczi
Copy link
Contributor

jimczi commented Dec 18, 2020

We have an official plugin that does exactly that:
https://www.elastic.co/guide/en/elasticsearch/plugins/7.x/mapper-size.html
However I guess that you're asking for something that is included in the official distribution by default @webmat ?

@ebeahan
Copy link
Member

ebeahan commented Dec 18, 2020

The mapper-size plugin provides the _size metadata field which, when enabled, indexes the size in bytes of the original _source field.

My understanding of @webmat's proposal is to calculate the length of the individual fields versus the size of the _source field.

@jimczi
Copy link
Contributor

jimczi commented Dec 18, 2020

Oh , I misread the issue thanks.

@javanna
Copy link
Member

javanna commented Mar 3, 2021

Wouldn't this be made possible by #68984 ? Once an indexed field supports defining a script which is executed at index time, you can calculate the length of the field and index it straight-away. Am I missing something?

@javanna
Copy link
Member

javanna commented Jun 16, 2022

This issue suggests creating a new field type, or automatically indexing the length of a field when requested. Like mentioned above, this could be obtained with a runtime field, or by specifying a script for a long field that calculates the length of a given indexed field loaded from doc_values. With that I am closing this issue, feel free to reopen or add comments if I missed something.

@javanna javanna closed this as completed Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

6 participants