|
| 1 | +[role="xpack"] |
| 2 | +[[ml-delayed-data-detection]] |
| 3 | +=== Handling delayed data |
| 4 | + |
| 5 | +Delayed data are documents that are indexed late. That is to say, it is data |
| 6 | +related to a time that the {dfeed} has already processed. |
| 7 | + |
| 8 | +When you create a datafeed, you can specify a {ref}/ml-datafeed-resource.html[`query_delay`] setting. |
| 9 | +This setting enables the datafeed to wait for some time past real-time, which means any "late" data in this period |
| 10 | +is fully indexed before the datafeed tries to gather it. However, if the setting is set too low, the datafeed may query |
| 11 | +for data before it has been indexed and consequently miss that document. Conversely, if it is set too high, |
| 12 | +analysis drifts farther away from real-time. The balance that is struck depends upon each use case and |
| 13 | +the environmental factors of the cluster. |
| 14 | + |
| 15 | +==== Why worry about delayed data? |
| 16 | + |
| 17 | +This is a particularly prescient question. If data are delayed randomly (and consequently missing from analysis), |
| 18 | +the results of certain types of functions are not really affected. It all comes out ok in the end |
| 19 | +as the delayed data is distributed randomly. An example would be a `mean` metric for a field in a large collection of data. |
| 20 | +In this case, checking for delayed data may not provide much benefit. If data are consistently delayed, however, jobs with a `low_count` function may |
| 21 | +provide false positives. In this situation, it would be useful to see if data |
| 22 | +comes in after an anomaly is recorded so that you can determine a next course of action. |
| 23 | + |
| 24 | +==== How do we detect delayed data? |
| 25 | + |
| 26 | +In addition to the `query_delay` field, there is a |
| 27 | +{ref}/ml-datafeed-resource.html#ml-datafeed-delayed-data-check-config[delayed data check config], which enables you to |
| 28 | +configure the datafeed to look in the past for delayed data. Every 15 minutes or every `check_window`, |
| 29 | +whichever is smaller, the datafeed triggers a document search over the configured indices. This search looks over a |
| 30 | +time span with a length of `check_window` ending with the latest finalized bucket. That time span is partitioned into buckets, |
| 31 | +whose length equals the bucket span of the associated job. The `doc_count` of those buckets are then compared with the |
| 32 | +job's finalized analysis buckets to see whether any data has arrived since the analysis. If there is indeed missing data |
| 33 | +due to their ingest delay, the end user is notified. |
| 34 | + |
| 35 | +==== What to do about delayed data? |
| 36 | + |
| 37 | +The most common course of action is to simply to do nothing. For many functions and situations ignoring the data is |
| 38 | +acceptable. However, if the amount of delayed data is too great or the situation calls for it, the next course |
| 39 | +of action to consider is to increase the `query_delay` of the datafeed. This increased delay allows more time for data to be |
| 40 | +indexed. If you have real-time constraints, however, an increased delay might not be desirable. |
| 41 | +In which case, you would have to {ref}/tune-for-indexing-speed.html[tune for better indexing speed.] |
| 42 | + |
0 commit comments