Skip to content

Commit e2a2089

Browse files
committed
Adding more docs for delayed data detection (#36738)
* Adding more docs for delayed data detection
1 parent 363c62b commit e2a2089

File tree

3 files changed

+55
-7
lines changed

3 files changed

+55
-7
lines changed

docs/reference/ml/apis/datafeedresource.asciidoc

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -65,9 +65,10 @@ A {dfeed} resource has the following properties:
6565
releases earlier than 6.0.0. For more information, see <<removal-of-types>>.
6666

6767
`delayed_data_check_config`::
68-
(object) Specifies if and with how large a window should the data feed check
69-
for missing data. See <<ml-datafeed-delayed-data-check-config>>.
70-
For example: `{"enabled": true, "check_window": "1h"}`
68+
(object) Specifies whether the data feed checks for missing data and
69+
and the size of the window. For example:
70+
`{"enabled": true, "check_window": "1h"}` See
71+
<<ml-datafeed-delayed-data-check-config>>.
7172

7273
[[ml-datafeed-chunking-config]]
7374
==== Chunking Configuration Objects
@@ -97,7 +98,8 @@ A chunking configuration object has the following properties:
9798
The {dfeed} can optionally search over indices that have already been read in
9899
an effort to find if any data has since been added to the index. If missing data
99100
is found, it is a good indication that the `query_delay` option is set too low and
100-
the data is being indexed after the {dfeed} has passed that moment in time.
101+
the data is being indexed after the {dfeed} has passed that moment in time. See
102+
{stack-ov}/ml-delayed-data-detection.html[Working with delayed data].
101103

102104
This check only runs on real-time {dfeeds}
103105

docs/reference/ml/configuring.asciidoc

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,16 +32,20 @@ The scenarios in this section describe some best practices for generating useful
3232
* <<ml-configuring-url>>
3333
* <<ml-configuring-aggregation>>
3434
* <<ml-configuring-categories>>
35+
* <<ml-configuring-detector-custom-rules>>
3536
* <<ml-configuring-pop>>
3637
* <<ml-configuring-transform>>
37-
* <<ml-configuring-detector-custom-rules>>
38+
* <<ml-delayed-data-detection>>
3839

3940
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/customurl.asciidoc
4041
include::customurl.asciidoc[]
4142

4243
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/aggregations.asciidoc
4344
include::aggregations.asciidoc[]
4445

46+
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/detector-custom-rules.asciidoc
47+
include::detector-custom-rules.asciidoc[]
48+
4549
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/categories.asciidoc
4650
include::categories.asciidoc[]
4751

@@ -51,5 +55,5 @@ include::populations.asciidoc[]
5155
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/transforms.asciidoc
5256
include::transforms.asciidoc[]
5357

54-
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/detector-custom-rules.asciidoc
55-
include::detector-custom-rules.asciidoc[]
58+
:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/docs/reference/ml/delayed-data-detection.asciidoc
59+
include::delayed-data-detection.asciidoc[]
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
[role="xpack"]
2+
[[ml-delayed-data-detection]]
3+
=== Handling delayed data
4+
5+
Delayed data are documents that are indexed late. That is to say, it is data
6+
related to a time that the {dfeed} has already processed.
7+
8+
When you create a datafeed, you can specify a {ref}/ml-datafeed-resource.html[`query_delay`] setting.
9+
This setting enables the datafeed to wait for some time past real-time, which means any "late" data in this period
10+
is fully indexed before the datafeed tries to gather it. However, if the setting is set too low, the datafeed may query
11+
for data before it has been indexed and consequently miss that document. Conversely, if it is set too high,
12+
analysis drifts farther away from real-time. The balance that is struck depends upon each use case and
13+
the environmental factors of the cluster.
14+
15+
==== Why worry about delayed data?
16+
17+
This is a particularly prescient question. If data are delayed randomly (and consequently missing from analysis),
18+
the results of certain types of functions are not really affected. It all comes out ok in the end
19+
as the delayed data is distributed randomly. An example would be a `mean` metric for a field in a large collection of data.
20+
In this case, checking for delayed data may not provide much benefit. If data are consistently delayed, however, jobs with a `low_count` function may
21+
provide false positives. In this situation, it would be useful to see if data
22+
comes in after an anomaly is recorded so that you can determine a next course of action.
23+
24+
==== How do we detect delayed data?
25+
26+
In addition to the `query_delay` field, there is a
27+
{ref}/ml-datafeed-resource.html#ml-datafeed-delayed-data-check-config[delayed data check config], which enables you to
28+
configure the datafeed to look in the past for delayed data. Every 15 minutes or every `check_window`,
29+
whichever is smaller, the datafeed triggers a document search over the configured indices. This search looks over a
30+
time span with a length of `check_window` ending with the latest finalized bucket. That time span is partitioned into buckets,
31+
whose length equals the bucket span of the associated job. The `doc_count` of those buckets are then compared with the
32+
job's finalized analysis buckets to see whether any data has arrived since the analysis. If there is indeed missing data
33+
due to their ingest delay, the end user is notified.
34+
35+
==== What to do about delayed data?
36+
37+
The most common course of action is to simply to do nothing. For many functions and situations ignoring the data is
38+
acceptable. However, if the amount of delayed data is too great or the situation calls for it, the next course
39+
of action to consider is to increase the `query_delay` of the datafeed. This increased delay allows more time for data to be
40+
indexed. If you have real-time constraints, however, an increased delay might not be desirable.
41+
In which case, you would have to {ref}/tune-for-indexing-speed.html[tune for better indexing speed.]
42+

0 commit comments

Comments
 (0)