Skip to content

[ML] Determine when data is missing from a bucket due to Ingest latency #35131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
benwtrent opened this issue Oct 31, 2018 · 2 comments
Closed
Assignees
Labels

Comments

@benwtrent
Copy link
Member

benwtrent commented Oct 31, 2018

Issue

When a Datafeed is configured, the end user provides a query_delay. At times this delay is too small and consequently, when the Datafeed pulls data from the index(es) data could be missed that has yet to be indexed.

We currently do a poor job of indicating if any data was missed and alerting the user to such.

Solution

A proposed solution is for a separate process in real-time Datafeeds to look at past finalized bucket(s) and compare the event_count with a the current actual count of documents for that bucket(s) time window and the user provided query.

To capture bucket discrepancies over an arbitrary number of buckets in the past, a date_histogram aggregation with interval=bucket_span. When this is used in conjunction with the Datafeed's query it allows us to have an accurate count for what the event_count SHOULD be given the current data in the index. Then for each finalized bucket, we compare the event_count to the true data in the matching date_histogram bucket. If the true data has a higher count than the event_count, then that is considered a discrepancy.

If a discrepancy is found, an Audit should be made suggesting an increase in the query delay. As more capabilities are added (possibly Annotations?), those could be utilized to give a better indication of how much data was missed over a given timerange.

@benwtrent benwtrent added >feature :ml Machine learning v6.6.0 labels Oct 31, 2018
@benwtrent benwtrent self-assigned this Oct 31, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195 droberts195 changed the title Determine when data is missing from a bucket due to Ingest latency [ML] Determine when data is missing from a bucket due to Ingest latency Nov 2, 2018
@jasontedor jasontedor added v6.7.0 and removed v6.6.0 labels Dec 19, 2018
@benwtrent benwtrent added v6.6.0 and removed v6.6.0 labels Jan 4, 2019
@droberts195 droberts195 added v6.6.0 and removed v6.7.0 labels Jan 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants