diff --git a/docs/en/stack/ml/anomaly-detection/images/influencers.jpg b/docs/en/stack/ml/anomaly-detection/images/influencers.jpg new file mode 100644 index 000000000..28091c478 Binary files /dev/null and b/docs/en/stack/ml/anomaly-detection/images/influencers.jpg differ diff --git a/docs/en/stack/ml/anomaly-detection/influencers.asciidoc b/docs/en/stack/ml/anomaly-detection/influencers.asciidoc new file mode 100644 index 000000000..3f71f7ba2 --- /dev/null +++ b/docs/en/stack/ml/anomaly-detection/influencers.asciidoc @@ -0,0 +1,79 @@ +[role="xpack"] +[[ml-influencers]] +=== Influencers + +When anomalous events occur, we want to know why. To determine the cause, +however, you often need a broader knowledge of the domain. If you have +suspicions about which entities in your dataset are likely causing +irregularities, you can identify them as influencers in your {anomaly-jobs}. +That is to say, _influencers_ are fields that you suspect contain information +about someone or something that influences or contributes to anomalies in your +data. + +Influencers can be any field in your data. If you use {dfeeds}, however, the +field must exist in your {dfeed} query or aggregation; otherwise it is not +included in the job analysis. If you use a query in your {dfeed}, there is an +additional requirement: influencer fields must exist in the query results in the +same hit as the detector fields. {dfeeds-cap} process data by paging through the +query results; since search hits cannot span multiple indices or documents, +{dfeeds} have the same limitation. + +Influencers do not need to be fields that are specified in your {anomaly-job} +detectors, though they often are. If you use aggregations in your {dfeed}, it is +possible to use influencers that come from different indices than the detector +fields. However, both indices must have a date field with the same name, which you +specify in the `data_description`.`time_field` property for the {dfeed}. + +Picking an influencer is strongly recommended for the following reasons: + +* It allows you to more easily assign blame for anomalies +* It simplifies and aggregates the results + +If you use {kib}, the job creation wizards can suggest which fields to use as +influencers. The best influencer is the person or thing that you want to blame +for the anomaly. In many cases, users or client IP addresses make excellent +influencers. + +TIP: As a best practice, do not pick too many influencers. For example, you +generally do not need more than three. If you pick many influencers, the results +can be overwhelming and there is a small overhead to the analysis. + +[discrete] +[[ml-influencer-results]] +==== Influencer results + +The influencer results show which entities were anomalous and when. One +influencer result is written per bucket for each influencer that is considered +anomalous. For jobs with more than one detector, these scores provide a powerful +view of the most anomalous entities. + +For example, the `high_sum_total_sales` {anomaly-job} for the eCommerce orders +sample data uses `customer_full_name.keyword` and `category.keyword` as +influencers. You can examine the influencer results with the +{ref}/ml-get-influencer.html[get influencers API]. Alternatively, you can use +the *Anomaly Explorer* in {kib}: + +[role="screenshot"] +image::images/influencers.jpg["Influencers in the {kib} Anomaly Explorer"] + +On the left is a list of the top influencers for all of the detected anomalies +in that same time period. The list includes maximum anomaly scores, which in +this case are aggregated for each influencer, for each bucket, across all +detectors. There is also a total sum of the anomaly scores for each influencer. +You can use this list to help you narrow down the contributing factors and focus +on the most anomalous entities. + +You can also explore swim lanes that correspond to the values of an influencer. +In this example, the swim lanes correspond to the values for the +`customer_full_name.keyword`. By default, the swim lanes are sorted according to +which entity has the maximum anomaly score values. You can click on the sections +in the swim lane to see details about the anomalies that occurred in that time +interval. + +TIP: The anomaly scores that you see in each section of the *Anomaly Explorer* +might differ slightly. This disparity occurs because for each {anomaly-job}, +there are bucket results, influencer results, and record results. Anomaly scores +are generated for each type of result. The anomaly timeline in {kib} uses the +bucket-level anomaly scores. If you view swim lanes by influencer, it uses the +influencer-level anomaly scores, as does the list of top influencers. The list +of anomalies uses the record-level anomaly scores. diff --git a/docs/en/stack/ml/anomaly-detection/job-tips.asciidoc b/docs/en/stack/ml/anomaly-detection/job-tips.asciidoc index e180be33c..30e2d933a 100644 --- a/docs/en/stack/ml/anomaly-detection/job-tips.asciidoc +++ b/docs/en/stack/ml/anomaly-detection/job-tips.asciidoc @@ -56,23 +56,7 @@ duplicates if they have the same `function`, `field_name`, `by_field_name`, [[influencers]] ===== Influencers -When you create an {anomaly-job}, you can specify _influencers_, which are also -sometimes referred to as _key fields_. Picking an influencer is strongly -recommended for the following reasons: - -* It allows you to more easily assign blame for the anomaly -* It simplifies and aggregates the results - -The best influencer is the person or thing that you want to blame for the -anomaly. In many cases, users or client IP addresses make excellent influencers. -Influencers can be any field in your data; they do not need to be fields that -are specified in your detectors, though they often are. - -As a best practice, do not pick too many influencers. For example, you generally -do not need more than three. If you pick many influencers, the results can be -overwhelming and there is a small overhead to the analysis. - -The job creation wizards in {kib} can suggest which fields to use as influencers. +See <>. [[model-memory-limits]] ===== Model memory limits diff --git a/docs/en/stack/ml/anomaly-detection/ml-concepts.asciidoc b/docs/en/stack/ml/anomaly-detection/ml-concepts.asciidoc index 5472a699e..2ed66c854 100644 --- a/docs/en/stack/ml/anomaly-detection/ml-concepts.asciidoc +++ b/docs/en/stack/ml/anomaly-detection/ml-concepts.asciidoc @@ -8,6 +8,7 @@ This section explains the fundamental concepts of the Elastic {ml} * <> * <> * <> +* <> * <> * <> * <> @@ -19,10 +20,12 @@ include::datafeeds.asciidoc[] include::buckets.asciidoc[] +include::influencers.asciidoc[] + include::calendars.asciidoc[] include::rules.asciidoc[] include::architecture.asciidoc[] -include::model-snapshots.asciidoc[] \ No newline at end of file +include::model-snapshots.asciidoc[]