[DOCS] Adds conceptual overview for influencers (#756)

lcawl · lcawl · commit 43d2001176c3 · 2019-12-19T08:29:19.000-08:00
diff --git a/docs/en/stack/ml/anomaly-detection/images/influencers.jpg b/docs/en/stack/ml/anomaly-detection/images/influencers.jpg
diff --git a/docs/en/stack/ml/anomaly-detection/influencers.asciidoc b/docs/en/stack/ml/anomaly-detection/influencers.asciidoc
@@ -0,0 +1,79 @@
+[role="xpack"]
+[[ml-influencers]]
+=== Influencers
+
+When anomalous events occur, we want to know why. To determine the cause,
+however, you often need a broader knowledge of the domain. If you have
+suspicions about which entities in your dataset are likely causing
+irregularities, you can identify them as influencers in your {anomaly-jobs}.
+That is to say, _influencers_ are fields that you suspect contain information
+about someone or something that influences or contributes to anomalies in your
+data.
+
+Influencers can be any field in your data. If you use {dfeeds}, however, the
+field must exist in your {dfeed} query or aggregation; otherwise it is not
+included in the job analysis. If you use a query in your {dfeed}, there is an
+additional requirement: influencer fields must exist in the query results in the
+same hit as the detector fields. {dfeeds-cap} process data by paging through the
+query results; since search hits cannot span multiple indices or documents,
+{dfeeds} have the same limitation. 
+
+Influencers do not need to be fields that are specified in your {anomaly-job}
+detectors, though they often are. If you use aggregations in your {dfeed}, it is
+possible to use influencers that come from different indices than the detector
+fields. However, both indices must have a date field with the same name, which you
+specify in the `data_description`.`time_field` property for the {dfeed}.
+
+Picking an influencer is strongly recommended for the following reasons:
+
+* It allows you to more easily assign blame for anomalies
+* It simplifies and aggregates the results
+
+If you use {kib}, the job creation wizards can suggest which fields to use as
+influencers. The best influencer is the person or thing that you want to blame
+for the anomaly. In many cases, users or client IP addresses make excellent
+influencers.
+
+TIP: As a best practice, do not pick too many influencers. For example, you
+generally do not need more than three. If you pick many influencers, the results
+can be overwhelming and there is a small overhead to the analysis.
+
+[discrete]
+[[ml-influencer-results]]
+==== Influencer results
+
+The influencer results show which entities were anomalous and when. One
+influencer result is written per bucket for each influencer that is considered
+anomalous. For jobs with more than one detector, these scores provide a powerful
+view of the most anomalous entities.
+
+For example, the `high_sum_total_sales` {anomaly-job} for the eCommerce orders
+sample data uses `customer_full_name.keyword` and `category.keyword` as
+influencers. You can examine the influencer results with the
+{ref}/ml-get-influencer.html[get influencers API]. Alternatively, you can use
+the *Anomaly Explorer* in {kib}:
+
+[role="screenshot"]
+image::images/influencers.jpg["Influencers in the {kib} Anomaly Explorer"]
+
+On the left is a list of the top influencers for all of the detected anomalies
+in that same time period. The list includes maximum anomaly scores, which in
+this case are aggregated for each influencer, for each bucket, across all
+detectors. There is also a total sum of the anomaly scores for each influencer.
+You can use this list to help you narrow down the contributing factors and focus
+on the most anomalous entities.
+
+You can also explore swim lanes that correspond to the values of an influencer.
+In this example, the swim lanes correspond to the values for the 
+`customer_full_name.keyword`. By default, the swim lanes are sorted according to
+which entity has the maximum anomaly score values. You can click on the sections
+in the swim lane to see details about the anomalies that occurred in that time
+interval.
+
+TIP: The anomaly scores that you see in each section of the *Anomaly Explorer*
+might differ slightly. This disparity occurs because for each {anomaly-job},
+there are bucket results, influencer results, and record results. Anomaly scores
+are generated for each type of result. The anomaly timeline in {kib} uses the
+bucket-level anomaly scores. If you view swim lanes by influencer, it uses the
+influencer-level anomaly scores, as does the list of top influencers. The list
+of anomalies uses the record-level anomaly scores.
diff --git a/docs/en/stack/ml/anomaly-detection/job-tips.asciidoc b/docs/en/stack/ml/anomaly-detection/job-tips.asciidoc
@@ -56,23 +56,7 @@ duplicates if they have the same `function`, `field_name`, `by_field_name`,
 [[influencers]]
 ===== Influencers
 
-When you create an {anomaly-job}, you can specify _influencers_, which are also 
-sometimes referred to as _key fields_. Picking an influencer is strongly
-recommended for the following reasons:
-
-* It allows you to more easily assign blame for the anomaly
-* It simplifies and aggregates the results
-
-The best influencer is the person or thing that you want to blame for the
-anomaly. In many cases, users or client IP addresses make excellent influencers.
-Influencers can be any field in your data; they do not need to be fields that
-are specified in your detectors, though they often are.
-
-As a best practice, do not pick too many influencers. For example, you generally
-do not need more than three. If you pick many influencers, the results can be
-overwhelming and there is a small overhead to the analysis.
-
-The job creation wizards in {kib} can suggest which fields to use as influencers.
+See <<ml-influencers>>.
 
 [[model-memory-limits]]
 ===== Model memory limits
diff --git a/docs/en/stack/ml/anomaly-detection/ml-concepts.asciidoc b/docs/en/stack/ml/anomaly-detection/ml-concepts.asciidoc
@@ -8,6 +8,7 @@ This section explains the fundamental concepts of the Elastic {ml}
 * <<ml-jobs>>
 * <<ml-dfeeds>>
 * <<ml-buckets>>
+* <<ml-influencers>>
 * <<ml-calendars>>
 * <<ml-rules>>
 * <<ml-model-snapshots>>
@@ -19,10 +20,12 @@ include::datafeeds.asciidoc[]
 
 include::buckets.asciidoc[]
 
+include::influencers.asciidoc[]
+
 include::calendars.asciidoc[]
 
 include::rules.asciidoc[]
 
 include::architecture.asciidoc[]
 
-include::model-snapshots.asciidoc[]
+include::model-snapshots.asciidoc[]