-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Implement stats aggregation for string terms #47468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 7 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
981807f
First commit of the implementation for `string_stats` aggregation
csoulios fe35bf8
Muted unit test
csoulios f05eb48
Addressed code review comments
csoulios 5904d18
Merge branch 'master' into feature/string_stats
csoulios 7f723c6
Merge branch 'master' into feature/string_stats
csoulios 08cf5e7
Added asciidoc for the string_stats aggregation
csoulios 82b6eb7
Added missing reference/callout
csoulios ee31f20
Merge branch 'master' into feature/string_stats
csoulios dc5077a
Addressed review comments
csoulios 3f4ef90
Fix serialization bug that failed caching results
csoulios 630701c
Addressed review comments
csoulios 699a91e
Merge branch 'master' into feature/string_stats
csoulios 9861659
Used CompensatedSum for computing Kahan Summation
csoulios b580cdb
Merge branch 'master' into feature/string_stats
csoulios File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
217 changes: 217 additions & 0 deletions
217
docs/reference/aggregations/metrics/string-stats-aggregation.asciidoc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,217 @@ | ||
[role="xpack"] | ||
[testenv="basic"] | ||
[[search-aggregations-metrics-string-stats-aggregation]] | ||
=== String Stats Aggregation | ||
|
||
A `multi-value` metrics aggregation that computes statistics over string values extracted from the aggregated documents. | ||
These values can be retrieved either from specific `keyword` fields in the documents or can be generated by a provided script. | ||
|
||
The string stats aggregation returns the following results: | ||
|
||
* `count` - The number of non-empty fields counted. | ||
* `min_length` - The length of the shortest term. | ||
* `max_length` - The length of the longest term. | ||
* `avg_length` - The average length computed over all terms. | ||
* `entropy` - The https://en.wikipedia.org/wiki/Entropy_(information_theory)[Shannon Entropy] value computed over all terms collected by | ||
the aggregation. Shannon entropy quantifies the amount of information contained in the field. It is a very useful metric for | ||
measuring a wide range of properties of a data set, such as diversity, similarity, randomness etc. | ||
|
||
Assuming the data consists of a twitter messages: | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
POST /twitter/_search?size=0 | ||
{ | ||
"aggs" : { | ||
"message_stats" : { "string_stats" : { "field" : "message.keyword" } } | ||
} | ||
} | ||
-------------------------------------------------- | ||
// TEST[setup:twitter] | ||
|
||
The above aggregation computes the string statistics for the `message` field in all documents. The aggregation type | ||
is `string_stats` and the `field` parameter defines the field of the documents the stats will be computed on. | ||
The above will return the following: | ||
|
||
[source,console-result] | ||
-------------------------------------------------- | ||
{ | ||
... | ||
|
||
"aggregations": { | ||
"message_stats" : { | ||
"count" : 5, | ||
"min_length" : 24, | ||
"max_length" : 30, | ||
"avg_length" : 28.8, | ||
"entropy" : 3.94617750050791 | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] | ||
|
||
The name of the aggregation (`message_stats` above) also serves as the key by which the aggregation result can be retrieved from | ||
the returned response. | ||
|
||
==== Character distribution | ||
|
||
The computation of the Shannon Entropy value is based on the probability of each character appearing in all terms collected | ||
by the aggregation. To view the probability distribution for all characters, we can add the `show_distribution` (default: `false`) parameter. | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
POST /twitter/_search?size=0 | ||
{ | ||
"aggs" : { | ||
"message_stats" : { | ||
"string_stats" : { | ||
"field" : "message.keyword", | ||
"show_distribution": true <1> | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// TEST[setup:twitter] | ||
|
||
<1> Set the `show_distribution` parameter to `true`, so that probability distribution for all characters is returned in the results. | ||
|
||
[source,console-result] | ||
-------------------------------------------------- | ||
{ | ||
... | ||
|
||
"aggregations": { | ||
"message_stats" : { | ||
"count" : 5, | ||
"min_length" : 24, | ||
"max_length" : 30, | ||
"avg_length" : 28.8, | ||
"entropy" : 3.94617750050791, | ||
"distribution" : { | ||
" " : 0.1527777777777778, | ||
"e" : 0.14583333333333334, | ||
"s" : 0.09722222222222222, | ||
"m" : 0.08333333333333333, | ||
"t" : 0.0763888888888889, | ||
"h" : 0.0625, | ||
"a" : 0.041666666666666664, | ||
"i" : 0.041666666666666664, | ||
"r" : 0.041666666666666664, | ||
"g" : 0.034722222222222224, | ||
"n" : 0.034722222222222224, | ||
"o" : 0.034722222222222224, | ||
"u" : 0.034722222222222224, | ||
"b" : 0.027777777777777776, | ||
"w" : 0.027777777777777776, | ||
"c" : 0.013888888888888888, | ||
"E" : 0.006944444444444444, | ||
"l" : 0.006944444444444444, | ||
"1" : 0.006944444444444444, | ||
"2" : 0.006944444444444444, | ||
"3" : 0.006944444444444444, | ||
"4" : 0.006944444444444444, | ||
"y" : 0.006944444444444444 | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] | ||
|
||
The `distribution` object shows the probability of each character appearing in all terms. The characters are sorted by descending probability. | ||
|
||
==== Script | ||
|
||
Computing the message string stats based on a script: | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
POST /twitter/_search?size=0 | ||
{ | ||
"aggs" : { | ||
"message_stats" : { | ||
"string_stats" : { | ||
"script" : { | ||
"lang": "painless", | ||
"source": "doc['message.keyword'].value" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// TEST[setup:twitter] | ||
|
||
This will interpret the `script` parameter as an `inline` script with the `painless` script language and no script parameters. | ||
To use a stored script use the following syntax: | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
POST /twitter/_search?size=0 | ||
{ | ||
"aggs" : { | ||
"message_stats" : { | ||
"string_stats" : { | ||
"script" : { | ||
"id": "my_script", | ||
"params" : { | ||
"field" : "message.keyword" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// TEST[setup:twitter,stored_example_script] | ||
|
||
===== Value Script | ||
|
||
We can use a value script to modify the message (eg we can add a prefix) and compute the new stats: | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
POST /twitter/_search?size=0 | ||
{ | ||
"aggs" : { | ||
"message_stats" : { | ||
"string_stats" : { | ||
"field" : "message.keyword", | ||
"script" : { | ||
"lang": "painless", | ||
"source": "params.prefix + _value", | ||
"params" : { | ||
"prefix" : "Message: " | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// TEST[setup:twitter] | ||
|
||
==== Missing value | ||
|
||
The `missing` parameter defines how documents that are missing a value should be treated. | ||
By default they will be ignored but it is also possible to treat them as if they had a value. | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
POST /twitter/_search?size=0 | ||
{ | ||
"aggs" : { | ||
"message_stats" : { | ||
"string_stats" : { | ||
"field" : "message.keyword", | ||
"missing": "[empty message]" <1> | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// TEST[setup:twitter] | ||
|
||
<1> Documents without a value in the `message` field will be treated as documents that have the value `[empty message]`. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, good catch, thanks for the fix :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the embarrassing typo below heh :)