-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] Outlier detection should only fetch docs that have the analyzed … #44944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Outlier detection should only fetch docs that have the analyzed … #44944
Conversation
…fields As data frame rows with missing values for analyzed fields are skipped, we can be more efficient by including a query that only picks documents that have values for all analyzed fields. Besides improving the number of documents we go through, we also provide a more accurate measurement of how many rows we need which reduces the memory requirements. This also adds an integration test that runs outlier detection on data with missing fields.
Pinging @elastic/ml-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} | ||
|
||
String id = "test_outlier_detection_with_missing_fields"; | ||
DataFrameAnalyticsConfig config = buildOutlierDetectionAnalytics(id, new String[] {sourceIndex}, sourceIndex + "-results", null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for certainty, you don't need to explicitly set analyzed_fields, because it defaults to all numeric fields?
You could index some docs that have "numeric" but are missing "categorical" to show that missing categorical field doesn't matter and "ml" object is still generated for such docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the categorical field in this case is not included in the analyzed fields. Good idea, I'll do so.
run elasticsearch-ci/packaging-sample |
elastic#44944) As data frame rows with missing values for analyzed fields are skipped, we can be more efficient by including a query that only picks documents that have values for all analyzed fields. Besides improving the number of documents we go through, we also provide a more accurate measurement of how many rows we need which reduces the memory requirements. This also adds an integration test that runs outlier detection on data with missing fields.
elastic#44944) As data frame rows with missing values for analyzed fields are skipped, we can be more efficient by including a query that only picks documents that have values for all analyzed fields. Besides improving the number of documents we go through, we also provide a more accurate measurement of how many rows we need which reduces the memory requirements. This also adds an integration test that runs outlier detection on data with missing fields.
#44944) (#44959) As data frame rows with missing values for analyzed fields are skipped, we can be more efficient by including a query that only picks documents that have values for all analyzed fields. Besides improving the number of documents we go through, we also provide a more accurate measurement of how many rows we need which reduces the memory requirements. This also adds an integration test that runs outlier detection on data with missing fields.
#44944) As data frame rows with missing values for analyzed fields are skipped, we can be more efficient by including a query that only picks documents that have values for all analyzed fields. Besides improving the number of documents we go through, we also provide a more accurate measurement of how many rows we need which reduces the memory requirements. This also adds an integration test that runs outlier detection on data with missing fields.
#44944) (#44960) As data frame rows with missing values for analyzed fields are skipped, we can be more efficient by including a query that only picks documents that have values for all analyzed fields. Besides improving the number of documents we go through, we also provide a more accurate measurement of how many rows we need which reduces the memory requirements. This also adds an integration test that runs outlier detection on data with missing fields.
…fields
As data frame rows with missing values for analyzed fields are skipped,
we can be more efficient by including a query that only picks documents
that have values for all analyzed fields. Besides improving the number
of documents we go through, we also provide a more accurate measurement
of how many rows we need which reduces the memory requirements.
This also adds an integration test that runs outlier detection on data
with missing fields.