|
| 1 | +[role="xpack"] |
| 2 | +[[ml-influencers]] |
| 3 | +=== Influencers |
| 4 | + |
| 5 | +When anomalous events occur, we want to know why. To determine the cause, |
| 6 | +however, you often need a broader knowledge of the domain. If you have |
| 7 | +suspicions about which entities in your dataset are likely causing |
| 8 | +irregularities, you can identify them as influencers in your {anomaly-jobs}. |
| 9 | +That is to say, _influencers_ are fields that you suspect contain information |
| 10 | +about someone or something that influences or contributes to anomalies in your |
| 11 | +data. |
| 12 | + |
| 13 | +Influencers can be any field in your data. If you use {dfeeds}, however, the |
| 14 | +field must exist in your {dfeed} query or aggregation; otherwise it is not |
| 15 | +included in the job analysis. If you use a query in your {dfeed}, there is an |
| 16 | +additional requirement: influencer fields must exist in the query results in the |
| 17 | +same hit as the detector fields. {dfeeds-cap} process data by paging through the |
| 18 | +query results; since search hits cannot span multiple indices or documents, |
| 19 | +{dfeeds} have the same limitation. |
| 20 | + |
| 21 | +Influencers do not need to be fields that are specified in your {anomaly-job} |
| 22 | +detectors, though they often are. If you use aggregations in your {dfeed}, it is |
| 23 | +possible to use influencers that come from different indices than the detector |
| 24 | +fields. However, both indices must have a date field with the same name, which you |
| 25 | +specify in the `data_description`.`time_field` property for the {dfeed}. |
| 26 | + |
| 27 | +Picking an influencer is strongly recommended for the following reasons: |
| 28 | + |
| 29 | +* It allows you to more easily assign blame for anomalies |
| 30 | +* It simplifies and aggregates the results |
| 31 | + |
| 32 | +If you use {kib}, the job creation wizards can suggest which fields to use as |
| 33 | +influencers. The best influencer is the person or thing that you want to blame |
| 34 | +for the anomaly. In many cases, users or client IP addresses make excellent |
| 35 | +influencers. |
| 36 | + |
| 37 | +TIP: As a best practice, do not pick too many influencers. For example, you |
| 38 | +generally do not need more than three. If you pick many influencers, the results |
| 39 | +can be overwhelming and there is a small overhead to the analysis. |
| 40 | + |
| 41 | +[discrete] |
| 42 | +[[ml-influencer-results]] |
| 43 | +==== Influencer results |
| 44 | + |
| 45 | +The influencer results show which entities were anomalous and when. One |
| 46 | +influencer result is written per bucket for each influencer that is considered |
| 47 | +anomalous. For jobs with more than one detector, these scores provide a powerful |
| 48 | +view of the most anomalous entities. |
| 49 | + |
| 50 | +For example, the `high_sum_total_sales` {anomaly-job} for the eCommerce orders |
| 51 | +sample data uses `customer_full_name.keyword` and `category.keyword` as |
| 52 | +influencers. You can examine the influencer results with the |
| 53 | +{ref}/ml-get-influencer.html[get influencers API]. Alternatively, you can use |
| 54 | +the *Anomaly Explorer* in {kib}: |
| 55 | + |
| 56 | +[role="screenshot"] |
| 57 | +image::images/influencers.jpg["Influencers in the {kib} Anomaly Explorer"] |
| 58 | + |
| 59 | +On the left is a list of the top influencers for all of the detected anomalies |
| 60 | +in that same time period. The list includes maximum anomaly scores, which in |
| 61 | +this case are aggregated for each influencer, for each bucket, across all |
| 62 | +detectors. There is also a total sum of the anomaly scores for each influencer. |
| 63 | +You can use this list to help you narrow down the contributing factors and focus |
| 64 | +on the most anomalous entities. |
| 65 | + |
| 66 | +You can also explore swim lanes that correspond to the values of an influencer. |
| 67 | +In this example, the swim lanes correspond to the values for the |
| 68 | +`customer_full_name.keyword`. By default, the swim lanes are sorted according to |
| 69 | +which entity has the maximum anomaly score values. You can click on the sections |
| 70 | +in the swim lane to see details about the anomalies that occurred in that time |
| 71 | +interval. |
| 72 | + |
| 73 | +TIP: The anomaly scores that you see in each section of the *Anomaly Explorer* |
| 74 | +might differ slightly. This disparity occurs because for each {anomaly-job}, |
| 75 | +there are bucket results, influencer results, and record results. Anomaly scores |
| 76 | +are generated for each type of result. The anomaly timeline in {kib} uses the |
| 77 | +bucket-level anomaly scores. If you view swim lanes by influencer, it uses the |
| 78 | +influencer-level anomaly scores, as does the list of top influencers. The list |
| 79 | +of anomalies uses the record-level anomaly scores. |
0 commit comments