[ML] Outlier detection should only fetch docs that have the analyzed … #44944

dimitris-athanasiou · 2019-07-29T09:20:41Z

…fields

As data frame rows with missing values for analyzed fields are skipped,
we can be more efficient by including a query that only picks documents
that have values for all analyzed fields. Besides improving the number
of documents we go through, we also provide a more accurate measurement
of how many rows we need which reduces the memory requirements.

This also adds an integration test that runs outlier detection on data
with missing fields.

…fields As data frame rows with missing values for analyzed fields are skipped, we can be more efficient by including a query that only picks documents that have values for all analyzed fields. Besides improving the number of documents we go through, we also provide a more accurate measurement of how many rows we need which reduces the memory requirements. This also adds an integration test that runs outlier detection on data with missing fields.

elasticmachine · 2019-07-29T09:20:43Z

Pinging @elastic/ml-core

przemekwitek

LGTM

przemekwitek · 2019-07-29T09:34:22Z

...rc/test/java/org/elasticsearch/xpack/ml/integration/OutlierDetectionWithMissingFieldsIT.java

+        }
+
+        String id = "test_outlier_detection_with_missing_fields";
+        DataFrameAnalyticsConfig config = buildOutlierDetectionAnalytics(id, new String[] {sourceIndex}, sourceIndex + "-results", null);


Just for certainty, you don't need to explicitly set analyzed_fields, because it defaults to all numeric fields?
You could index some docs that have "numeric" but are missing "categorical" to show that missing categorical field doesn't matter and "ml" object is still generated for such docs.

Yes, the categorical field in this case is not included in the analyzed fields. Good idea, I'll do so.

dimitris-athanasiou · 2019-07-29T10:22:52Z

run elasticsearch-ci/packaging-sample

elastic#44944) As data frame rows with missing values for analyzed fields are skipped, we can be more efficient by including a query that only picks documents that have values for all analyzed fields. Besides improving the number of documents we go through, we also provide a more accurate measurement of how many rows we need which reduces the memory requirements. This also adds an integration test that runs outlier detection on data with missing fields.

#44944) (#44959) As data frame rows with missing values for analyzed fields are skipped, we can be more efficient by including a query that only picks documents that have values for all analyzed fields. Besides improving the number of documents we go through, we also provide a more accurate measurement of how many rows we need which reduces the memory requirements. This also adds an integration test that runs outlier detection on data with missing fields.

#44944) As data frame rows with missing values for analyzed fields are skipped, we can be more efficient by including a query that only picks documents that have values for all analyzed fields. Besides improving the number of documents we go through, we also provide a more accurate measurement of how many rows we need which reduces the memory requirements. This also adds an integration test that runs outlier detection on data with missing fields.

#44944) (#44960) As data frame rows with missing values for analyzed fields are skipped, we can be more efficient by including a query that only picks documents that have values for all analyzed fields. Besides improving the number of documents we go through, we also provide a more accurate measurement of how many rows we need which reduces the memory requirements. This also adds an integration test that runs outlier detection on data with missing fields.

dimitris-athanasiou added >enhancement :ml Machine learning v8.0.0 v7.4.0 v7.3.1 labels Jul 29, 2019

przemekwitek approved these changes Jul 29, 2019

View reviewed changes

Add missing values for the categorical field

ebada3d

dimitris-athanasiou merged commit 8a21cc8 into elastic:master Jul 29, 2019

dimitris-athanasiou deleted the outlier-detection-should-query-docs-that-have-all-analyzed-fields branch July 29, 2019 13:02

This was referenced Jul 29, 2019

[7.x][ML] Outlier detection should only fetch docs that have the analyzed … #44959

Merged

[7.3][ML] Outlier detection should only fetch docs that have the analyzed … #44960

Merged

codebrain mentioned this pull request Oct 14, 2019

7.4 meta ticket elastic/elasticsearch-net#4133

Closed

56 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Outlier detection should only fetch docs that have the analyzed … #44944

[ML] Outlier detection should only fetch docs that have the analyzed … #44944

dimitris-athanasiou commented Jul 29, 2019

elasticmachine commented Jul 29, 2019

przemekwitek left a comment

przemekwitek Jul 29, 2019

dimitris-athanasiou Jul 29, 2019

dimitris-athanasiou commented Jul 29, 2019

[ML] Outlier detection should only fetch docs that have the analyzed … #44944

[ML] Outlier detection should only fetch docs that have the analyzed … #44944

Conversation

dimitris-athanasiou commented Jul 29, 2019

elasticmachine commented Jul 29, 2019

przemekwitek left a comment

Choose a reason for hiding this comment

przemekwitek Jul 29, 2019

Choose a reason for hiding this comment

dimitris-athanasiou Jul 29, 2019

Choose a reason for hiding this comment

dimitris-athanasiou commented Jul 29, 2019