Skip to content

Commit b0e12c9

Browse files
authored
Implement stats aggregation for string terms (#47468)
This PR adds a new metric aggregation called string_stats that operates on string terms of a document and returns the following: min_length: The length of the shortest term max_length: The length of the longest term avg_length: The average length of all terms distribution: The probability distribution of all characters appearing in all terms entropy: The total Shannon entropy value calculated for all terms This aggregation has been implemented as an analytics plugin.
1 parent 2723a52 commit b0e12c9

File tree

10 files changed

+1155
-12
lines changed

10 files changed

+1155
-12
lines changed

docs/build.gradle

+1-1
Original file line numberDiff line numberDiff line change
@@ -179,7 +179,7 @@ buildRestTests.setups['ledger'] = '''
179179
{"index":{}}
180180
{"date": "2015/01/01 00:00:00", "amount": 200, "type": "sale", "description": "something"}
181181
{"index":{}}
182-
{"date": "2015/01/01 00:00:00", "amount": 10, "type": "expense", "decription": "another thing"}
182+
{"date": "2015/01/01 00:00:00", "amount": 10, "type": "expense", "description": "another thing"}
183183
{"index":{}}
184184
{"date": "2015/01/01 00:00:00", "amount": 150, "type": "sale", "description": "blah"}
185185
{"index":{}}

docs/reference/aggregations/metrics.asciidoc

+2
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ include::metrics/scripted-metric-aggregation.asciidoc[]
3535

3636
include::metrics/stats-aggregation.asciidoc[]
3737

38+
include::metrics/string-stats-aggregation.asciidoc[]
39+
3840
include::metrics/sum-aggregation.asciidoc[]
3941

4042
include::metrics/tophits-aggregation.asciidoc[]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
[role="xpack"]
2+
[testenv="basic"]
3+
[[search-aggregations-metrics-string-stats-aggregation]]
4+
=== String Stats Aggregation
5+
6+
A `multi-value` metrics aggregation that computes statistics over string values extracted from the aggregated documents.
7+
These values can be retrieved either from specific `keyword` fields in the documents or can be generated by a provided script.
8+
9+
The string stats aggregation returns the following results:
10+
11+
* `count` - The number of non-empty fields counted.
12+
* `min_length` - The length of the shortest term.
13+
* `max_length` - The length of the longest term.
14+
* `avg_length` - The average length computed over all terms.
15+
* `entropy` - The https://en.wikipedia.org/wiki/Entropy_(information_theory)[Shannon Entropy] value computed over all terms collected by
16+
the aggregation. Shannon entropy quantifies the amount of information contained in the field. It is a very useful metric for
17+
measuring a wide range of properties of a data set, such as diversity, similarity, randomness etc.
18+
19+
Assuming the data consists of a twitter messages:
20+
21+
[source,console]
22+
--------------------------------------------------
23+
POST /twitter/_search?size=0
24+
{
25+
"aggs" : {
26+
"message_stats" : { "string_stats" : { "field" : "message.keyword" } }
27+
}
28+
}
29+
--------------------------------------------------
30+
// TEST[setup:twitter]
31+
32+
The above aggregation computes the string statistics for the `message` field in all documents. The aggregation type
33+
is `string_stats` and the `field` parameter defines the field of the documents the stats will be computed on.
34+
The above will return the following:
35+
36+
[source,console-result]
37+
--------------------------------------------------
38+
{
39+
...
40+
41+
"aggregations": {
42+
"message_stats" : {
43+
"count" : 5,
44+
"min_length" : 24,
45+
"max_length" : 30,
46+
"avg_length" : 28.8,
47+
"entropy" : 3.94617750050791
48+
}
49+
}
50+
}
51+
--------------------------------------------------
52+
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
53+
54+
The name of the aggregation (`message_stats` above) also serves as the key by which the aggregation result can be retrieved from
55+
the returned response.
56+
57+
==== Character distribution
58+
59+
The computation of the Shannon Entropy value is based on the probability of each character appearing in all terms collected
60+
by the aggregation. To view the probability distribution for all characters, we can add the `show_distribution` (default: `false`) parameter.
61+
62+
[source,console]
63+
--------------------------------------------------
64+
POST /twitter/_search?size=0
65+
{
66+
"aggs" : {
67+
"message_stats" : {
68+
"string_stats" : {
69+
"field" : "message.keyword",
70+
"show_distribution": true <1>
71+
}
72+
}
73+
}
74+
}
75+
--------------------------------------------------
76+
// TEST[setup:twitter]
77+
78+
<1> Set the `show_distribution` parameter to `true`, so that probability distribution for all characters is returned in the results.
79+
80+
[source,console-result]
81+
--------------------------------------------------
82+
{
83+
...
84+
85+
"aggregations": {
86+
"message_stats" : {
87+
"count" : 5,
88+
"min_length" : 24,
89+
"max_length" : 30,
90+
"avg_length" : 28.8,
91+
"entropy" : 3.94617750050791,
92+
"distribution" : {
93+
" " : 0.1527777777777778,
94+
"e" : 0.14583333333333334,
95+
"s" : 0.09722222222222222,
96+
"m" : 0.08333333333333333,
97+
"t" : 0.0763888888888889,
98+
"h" : 0.0625,
99+
"a" : 0.041666666666666664,
100+
"i" : 0.041666666666666664,
101+
"r" : 0.041666666666666664,
102+
"g" : 0.034722222222222224,
103+
"n" : 0.034722222222222224,
104+
"o" : 0.034722222222222224,
105+
"u" : 0.034722222222222224,
106+
"b" : 0.027777777777777776,
107+
"w" : 0.027777777777777776,
108+
"c" : 0.013888888888888888,
109+
"E" : 0.006944444444444444,
110+
"l" : 0.006944444444444444,
111+
"1" : 0.006944444444444444,
112+
"2" : 0.006944444444444444,
113+
"3" : 0.006944444444444444,
114+
"4" : 0.006944444444444444,
115+
"y" : 0.006944444444444444
116+
}
117+
}
118+
}
119+
}
120+
--------------------------------------------------
121+
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/]
122+
123+
The `distribution` object shows the probability of each character appearing in all terms. The characters are sorted by descending probability.
124+
125+
==== Script
126+
127+
Computing the message string stats based on a script:
128+
129+
[source,console]
130+
--------------------------------------------------
131+
POST /twitter/_search?size=0
132+
{
133+
"aggs" : {
134+
"message_stats" : {
135+
"string_stats" : {
136+
"script" : {
137+
"lang": "painless",
138+
"source": "doc['message.keyword'].value"
139+
}
140+
}
141+
}
142+
}
143+
}
144+
--------------------------------------------------
145+
// TEST[setup:twitter]
146+
147+
This will interpret the `script` parameter as an `inline` script with the `painless` script language and no script parameters.
148+
To use a stored script use the following syntax:
149+
150+
[source,console]
151+
--------------------------------------------------
152+
POST /twitter/_search?size=0
153+
{
154+
"aggs" : {
155+
"message_stats" : {
156+
"string_stats" : {
157+
"script" : {
158+
"id": "my_script",
159+
"params" : {
160+
"field" : "message.keyword"
161+
}
162+
}
163+
}
164+
}
165+
}
166+
}
167+
--------------------------------------------------
168+
// TEST[setup:twitter,stored_example_script]
169+
170+
===== Value Script
171+
172+
We can use a value script to modify the message (eg we can add a prefix) and compute the new stats:
173+
174+
[source,console]
175+
--------------------------------------------------
176+
POST /twitter/_search?size=0
177+
{
178+
"aggs" : {
179+
"message_stats" : {
180+
"string_stats" : {
181+
"field" : "message.keyword",
182+
"script" : {
183+
"lang": "painless",
184+
"source": "params.prefix + _value",
185+
"params" : {
186+
"prefix" : "Message: "
187+
}
188+
}
189+
}
190+
}
191+
}
192+
}
193+
--------------------------------------------------
194+
// TEST[setup:twitter]
195+
196+
==== Missing value
197+
198+
The `missing` parameter defines how documents that are missing a value should be treated.
199+
By default they will be ignored but it is also possible to treat them as if they had a value.
200+
201+
[source,console]
202+
--------------------------------------------------
203+
POST /twitter/_search?size=0
204+
{
205+
"aggs" : {
206+
"message_stats" : {
207+
"string_stats" : {
208+
"field" : "message.keyword",
209+
"missing": "[empty message]" <1>
210+
}
211+
}
212+
}
213+
}
214+
--------------------------------------------------
215+
// TEST[setup:twitter]
216+
217+
<1> Documents without a value in the `message` field will be treated as documents that have the value `[empty message]`.
+7-2
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,15 @@
66
package org.elasticsearch.xpack.analytics;
77

88
import org.elasticsearch.xpack.analytics.cumulativecardinality.CumulativeCardinalityPipelineAggregationBuilder;
9+
import org.elasticsearch.xpack.analytics.stringstats.StringStatsAggregationBuilder;
910

10-
public class DataScienceAggregationBuilders {
11+
public class AnalyticsAggregationBuilders {
1112

12-
public static CumulativeCardinalityPipelineAggregationBuilder cumulativeCaardinality(String name, String bucketsPath) {
13+
public static CumulativeCardinalityPipelineAggregationBuilder cumulativeCardinality(String name, String bucketsPath) {
1314
return new CumulativeCardinalityPipelineAggregationBuilder(name, bucketsPath);
1415
}
16+
17+
public static StringStatsAggregationBuilder stringStats(String name) {
18+
return new StringStatsAggregationBuilder(name);
19+
}
1520
}

x-pack/plugin/analytics/src/main/java/org/elasticsearch/xpack/analytics/AnalyticsPlugin.java

+23-9
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,17 @@
1111
import org.elasticsearch.plugins.ActionPlugin;
1212
import org.elasticsearch.plugins.Plugin;
1313
import org.elasticsearch.plugins.SearchPlugin;
14-
import org.elasticsearch.xpack.core.XPackPlugin;
15-
import org.elasticsearch.xpack.core.action.XPackInfoFeatureAction;
16-
import org.elasticsearch.xpack.core.action.XPackUsageFeatureAction;
17-
import org.elasticsearch.xpack.core.analytics.action.AnalyticsStatsAction;
1814
import org.elasticsearch.xpack.analytics.action.AnalyticsInfoTransportAction;
1915
import org.elasticsearch.xpack.analytics.action.AnalyticsUsageTransportAction;
2016
import org.elasticsearch.xpack.analytics.action.TransportAnalyticsStatsAction;
2117
import org.elasticsearch.xpack.analytics.cumulativecardinality.CumulativeCardinalityPipelineAggregationBuilder;
2218
import org.elasticsearch.xpack.analytics.cumulativecardinality.CumulativeCardinalityPipelineAggregator;
19+
import org.elasticsearch.xpack.analytics.stringstats.InternalStringStats;
20+
import org.elasticsearch.xpack.analytics.stringstats.StringStatsAggregationBuilder;
21+
import org.elasticsearch.xpack.core.XPackPlugin;
22+
import org.elasticsearch.xpack.core.action.XPackInfoFeatureAction;
23+
import org.elasticsearch.xpack.core.action.XPackUsageFeatureAction;
24+
import org.elasticsearch.xpack.core.analytics.action.AnalyticsStatsAction;
2325

2426
import java.util.Arrays;
2527
import java.util.List;
@@ -38,11 +40,23 @@ public AnalyticsPlugin() { }
3840

3941
@Override
4042
public List<PipelineAggregationSpec> getPipelineAggregations() {
41-
return singletonList(new PipelineAggregationSpec(
42-
CumulativeCardinalityPipelineAggregationBuilder.NAME,
43-
CumulativeCardinalityPipelineAggregationBuilder::new,
44-
CumulativeCardinalityPipelineAggregator::new,
45-
CumulativeCardinalityPipelineAggregationBuilder::parse));
43+
return singletonList(
44+
new PipelineAggregationSpec(
45+
CumulativeCardinalityPipelineAggregationBuilder.NAME,
46+
CumulativeCardinalityPipelineAggregationBuilder::new,
47+
CumulativeCardinalityPipelineAggregator::new,
48+
CumulativeCardinalityPipelineAggregationBuilder::parse)
49+
);
50+
}
51+
52+
@Override
53+
public List<AggregationSpec> getAggregations() {
54+
return singletonList(
55+
new AggregationSpec(
56+
StringStatsAggregationBuilder.NAME,
57+
StringStatsAggregationBuilder::new,
58+
StringStatsAggregationBuilder::parse).addResultReader(InternalStringStats::new)
59+
);
4660
}
4761

4862
@Override

0 commit comments

Comments
 (0)