Skip to content

Commit 62d096c

Browse files
szabostevelcawl
andcommitted
[DOCS] Provides further details on aggregations in datafeeds (#55462)
Co-authored-by: Lisa Cawley <[email protected]>
1 parent f31d3a2 commit 62d096c

File tree

1 file changed

+49
-16
lines changed

1 file changed

+49
-16
lines changed

docs/reference/ml/anomaly-detection/aggregations.asciidoc

Lines changed: 49 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,28 @@ TIP: If you use a terms aggregation and the cardinality of a term is high, the
1515
aggregation might not be effective and you might want to just use the default
1616
search and scroll behavior.
1717

18+
[discrete]
19+
[[aggs-limits-dfeeds]]
20+
==== Requirements and limitations
21+
1822
There are some limitations to using aggregations in {dfeeds}. Your aggregation
1923
must include a `date_histogram` aggregation, which in turn must contain a `max`
2024
aggregation on the time field. This requirement ensures that the aggregated data
2125
is a time series and the timestamp of each bucket is the time of the last record
2226
in the bucket.
2327

28+
IMPORTANT: The name of the aggregation and the name of the field that the agg
29+
operates on need to match, otherwise the aggregation doesn't work. For example,
30+
if you use a `max` aggregation on a time field called `responsetime`, the name
31+
of the aggregation must be also `responsetime`.
32+
2433
You must also consider the interval of the date histogram aggregation carefully.
2534
The bucket span of your {anomaly-job} must be divisible by the value of the
2635
`calendar_interval` or `fixed_interval` in your aggregation (with no remainder).
2736
If you specify a `frequency` for your {dfeed}, it must also be divisible by this
28-
interval.
37+
interval. {anomaly-jobs-cap} cannot use date histograms with an interval
38+
measured in months because the length of the month is not fixed. {dfeeds-cap}
39+
tolerate weeks or smaller units.
2940

3041
TIP: As a rule of thumb, if your detectors use <<ml-metric-functions,metric>> or
3142
<<ml-sum-functions,sum>> analytical functions, set the date histogram
@@ -34,6 +45,11 @@ finer, more granular time buckets, which are ideal for this type of analysis. If
3445
your detectors use <<ml-count-functions,count>> or <<ml-rare-functions,rare>>
3546
functions, set the interval to the same value as the bucket span.
3647

48+
49+
[discrete]
50+
[[aggs-include-jobs]]
51+
==== Including aggregations in {anomaly-jobs}
52+
3753
When you create or update an {anomaly-job}, you can include the names of
3854
aggregations, for example:
3955

@@ -85,13 +101,13 @@ PUT _ml/datafeeds/datafeed-farequote
85101
"time": { <1>
86102
"max": {"field": "time"}
87103
},
88-
"airline": { <1>
104+
"airline": { <2>
89105
"terms": {
90106
"field": "airline",
91107
"size": 100
92108
},
93109
"aggregations": {
94-
"responsetime": { <1>
110+
"responsetime": { <3>
95111
"avg": {
96112
"field": "responsetime"
97113
}
@@ -107,15 +123,23 @@ PUT _ml/datafeeds/datafeed-farequote
107123

108124
<1> In this example, the aggregations have names that match the fields that they
109125
operate on. That is to say, the `max` aggregation is named `time` and its
110-
field is also `time`. The same is true for the aggregations with the names
111-
`airline` and `responsetime`.
126+
field also needs to be `time`.
127+
<2> Likewise, the `term` aggregation is named `airline` and its field is also
128+
named `airline`.
129+
<3> Likewise, the `avg` aggregation is named `responsetime` and its field is
130+
also named `responsetime`.
131+
132+
Your {dfeed} can contain multiple aggregations, but only the ones with names
133+
that match values in the job configuration are fed to the job.
112134

113-
IMPORTANT: Your {dfeed} can contain multiple aggregations, but only the ones
114-
with names that match values in the job configuration are fed to the job.
115135

116-
{dfeeds-cap} support complex nested aggregations, this example uses the `derivative`
117-
pipeline aggregation to find the first order derivative of the counter
118-
`system.network.out.bytes` for each value of the field `beat.name`.
136+
[discrete]
137+
[[aggs-dfeeds]]
138+
==== Nested aggregations in {dfeeds}
139+
140+
{dfeeds-cap} support complex nested aggregations. This example uses the
141+
`derivative` pipeline aggregation to find the first order derivative of the
142+
counter `system.network.out.bytes` for each value of the field `beat.name`.
119143

120144
[source,js]
121145
----------------------------------
@@ -154,6 +178,11 @@ pipeline aggregation to find the first order derivative of the counter
154178
----------------------------------
155179
// NOTCONSOLE
156180

181+
182+
[discrete]
183+
[[aggs-single-dfeeds]]
184+
==== Single bucket aggregations in {dfeeds}
185+
157186
{dfeeds-cap} not only supports multi-bucket aggregations, but also single bucket
158187
aggregations. The following shows two `filter` aggregations, each gathering the
159188
number of unique entries for the `error` field.
@@ -201,6 +230,11 @@ number of unique entries for the `error` field.
201230
----------------------------------
202231
// NOTCONSOLE
203232

233+
234+
[discrete]
235+
[[aggs-define-dfeeds]]
236+
==== Defining aggregations in {dfeeds}
237+
204238
When you define an aggregation in a {dfeed}, it must have the following form:
205239

206240
[source,js]
@@ -239,7 +273,7 @@ When you define an aggregation in a {dfeed}, it must have the following form:
239273
The top level aggregation must be either a
240274
{ref}/search-aggregations-bucket.html[bucket aggregation] containing as single
241275
sub-aggregation that is a `date_histogram` or the top level aggregation is the
242-
required `date_histogram`. There must be exactly one `date_histogram`
276+
required `date_histogram`. There must be exactly one `date_histogram`
243277
aggregation. For more information, see
244278
{ref}/search-aggregations-bucket-datehistogram-aggregation.html[Date histogram aggregation].
245279

@@ -248,9 +282,9 @@ NOTE: The `time_zone` parameter in the date histogram aggregation must be set to
248282

249283
Each histogram bucket has a key, which is the bucket start time. This key cannot
250284
be used for aggregations in {dfeeds}, however, because they need to know the
251-
time of the latest record within a bucket. Otherwise, when you restart a {dfeed},
252-
it continues from the start time of the histogram bucket and possibly fetches
253-
the same data twice. The max aggregation for the time field is therefore
285+
time of the latest record within a bucket. Otherwise, when you restart a
286+
{dfeed}, it continues from the start time of the histogram bucket and possibly
287+
fetches the same data twice. The max aggregation for the time field is therefore
254288
necessary to provide the time of the latest record within a bucket.
255289

256290
You can optionally specify a terms aggregation, which creates buckets for
@@ -280,8 +314,7 @@ GET .../_search {
280314
By default, {es} limits the maximum number of terms returned to 10000. For high
281315
cardinality fields, the query might not run. It might return errors related to
282316
circuit breaking exceptions that indicate that the data is too large. In such
283-
cases, do not use aggregations in your {dfeed}. For more
284-
information, see
317+
cases, do not use aggregations in your {dfeed}. For more information, see
285318
{ref}/search-aggregations-bucket-terms-aggregation.html[Terms aggregation].
286319

287320
You can also optionally specify multiple sub-aggregations. The sub-aggregations

0 commit comments

Comments
 (0)