You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a Datafeed is configured, the end user provides a query_delay. At times this delay is too small and consequently, when the Datafeed pulls data from the index(es) data could be missed that has yet to be indexed.
We currently do a poor job of indicating if any data was missed and alerting the user to such.
Solution
A proposed solution is for a separate process in real-time Datafeeds to look at past finalized bucket(s) and compare the event_count with a the current actual count of documents for that bucket(s) time window and the user provided query.
To capture bucket discrepancies over an arbitrary number of buckets in the past, a date_histogram aggregation with interval=bucket_span. When this is used in conjunction with the Datafeed's query it allows us to have an accurate count for what the event_count SHOULD be given the current data in the index. Then for each finalized bucket, we compare the event_count to the true data in the matching date_histogram bucket. If the true data has a higher count than the event_count, then that is considered a discrepancy.
If a discrepancy is found, an Audit should be made suggesting an increase in the query delay. As more capabilities are added (possibly Annotations?), those could be utilized to give a better indication of how much data was missed over a given timerange.
The text was updated successfully, but these errors were encountered:
droberts195
changed the title
Determine when data is missing from a bucket due to Ingest latency
[ML] Determine when data is missing from a bucket due to Ingest latency
Nov 2, 2018
Issue
When a Datafeed is configured, the end user provides a
query_delay
. At times this delay is too small and consequently, when the Datafeed pulls data from the index(es) data could be missed that has yet to be indexed.We currently do a poor job of indicating if any data was missed and alerting the user to such.
Solution
A proposed solution is for a separate process in real-time Datafeeds to look at past finalized bucket(s) and compare the
event_count
with a the current actual count of documents for that bucket(s) time window and the user provided query.To capture bucket discrepancies over an arbitrary number of buckets in the past, a
date_histogram
aggregation withinterval=bucket_span
. When this is used in conjunction with the Datafeed's query it allows us to have an accurate count for what theevent_count
SHOULD be given the current data in the index. Then for each finalized bucket, we compare theevent_count
to the true data in the matchingdate_histogram
bucket. If the true data has a higher count than theevent_count
, then that is considered a discrepancy.If a discrepancy is found, an Audit should be made suggesting an increase in the query delay. As more capabilities are added (possibly Annotations?), those could be utilized to give a better indication of how much data was missed over a given timerange.
The text was updated successfully, but these errors were encountered: