Skip to content

[DOCS] Adds conceptual overview for influencers #756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 19, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
79 changes: 79 additions & 0 deletions docs/en/stack/ml/anomaly-detection/influencers.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
[role="xpack"]
[[ml-influencers]]
=== Influencers

When anomalous events occur, we want to know why. To determine the cause,
however, you often need a broader knowledge of the domain. If you have
suspicions about which entities in your dataset are likely causing
irregularities, you can identify them as influencers in your {anomaly-jobs}.
That is to say, _influencers_ are fields that you suspect contain information
about someone or something that influences or contributes to anomalies in your
data.

Influencers can be any field in your data. If you use {dfeeds}, however, the
field must exist in your {dfeed} query or aggregation; otherwise it is not
included in the job analysis. If you use a query in your {dfeed}, there is an
additional requirement: influencer fields must exist in the query results in the
same hit as the detector fields. {dfeeds-cap} process data by paging through the
query results; since search hits cannot span multiple indices or documents,
{dfeeds} have the same limitation.

Influencers do not need to be fields that are specified in your {anomaly-job}
detectors, though they often are. If you use aggregations in your {dfeed}, it is
possible to use influencers that come from different indices than the detector
fields. However, both indices must have a date field with the same name, which you
specify in the `data_description`.`time_field` property for the {dfeed}.

Picking an influencer is strongly recommended for the following reasons:

* It allows you to more easily assign blame for anomalies
* It simplifies and aggregates the results

If you use {kib}, the job creation wizards can suggest which fields to use as
influencers. The best influencer is the person or thing that you want to blame
for the anomaly. In many cases, users or client IP addresses make excellent
influencers.

TIP: As a best practice, do not pick too many influencers. For example, you
generally do not need more than three. If you pick many influencers, the results
can be overwhelming and there is a small overhead to the analysis.

[discrete]
[[ml-influencer-results]]
==== Influencer results

The influencer results show which entities were anomalous and when. One
influencer result is written per bucket for each influencer that is considered
anomalous. For jobs with more than one detector, these scores provide a powerful
view of the most anomalous entities.

For example, the `high_sum_total_sales` {anomaly-job} for the eCommerce orders
sample data uses `customer_full_name.keyword` and `category.keyword` as
influencers. You can examine the influencer results with the
{ref}/ml-get-influencer.html[get influencers API]. Alternatively, you can use
the *Anomaly Explorer* in {kib}:

[role="screenshot"]
image::images/influencers.jpg["Influencers in the {kib} Anomaly Explorer"]

On the left is a list of the top influencers for all of the detected anomalies
in that same time period. The list includes maximum anomaly scores, which in
this case are aggregated for each influencer, for each bucket, across all
detectors. There is also a total sum of the anomaly scores for each influencer.
You can use this list to help you narrow down the contributing factors and focus
on the most anomalous entities.

You can also explore swim lanes that correspond to the values of an influencer.
In this example, the swim lanes correspond to the values for the
`customer_full_name.keyword`. By default, the swim lanes are sorted according to
which entity has the maximum anomaly score values. You can click on the sections
in the swim lane to see details about the anomalies that occurred in that time
interval.

TIP: The anomaly scores that you see in each section of the *Anomaly Explorer*
might differ slightly. This disparity occurs because for each {anomaly-job},
there are bucket results, influencer results, and record results. Anomaly scores
are generated for each type of result. The anomaly timeline in {kib} uses the
bucket-level anomaly scores. If you view swim lanes by influencer, it uses the
influencer-level anomaly scores, as does the list of top influencers. The list
of anomalies uses the record-level anomaly scores.
18 changes: 1 addition & 17 deletions docs/en/stack/ml/anomaly-detection/job-tips.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -56,23 +56,7 @@ duplicates if they have the same `function`, `field_name`, `by_field_name`,
[[influencers]]
===== Influencers

When you create an {anomaly-job}, you can specify _influencers_, which are also
sometimes referred to as _key fields_. Picking an influencer is strongly
recommended for the following reasons:

* It allows you to more easily assign blame for the anomaly
* It simplifies and aggregates the results

The best influencer is the person or thing that you want to blame for the
anomaly. In many cases, users or client IP addresses make excellent influencers.
Influencers can be any field in your data; they do not need to be fields that
are specified in your detectors, though they often are.

As a best practice, do not pick too many influencers. For example, you generally
do not need more than three. If you pick many influencers, the results can be
overwhelming and there is a small overhead to the analysis.

The job creation wizards in {kib} can suggest which fields to use as influencers.
See <<ml-influencers>>.

[[model-memory-limits]]
===== Model memory limits
Expand Down
5 changes: 4 additions & 1 deletion docs/en/stack/ml/anomaly-detection/ml-concepts.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ This section explains the fundamental concepts of the Elastic {ml}
* <<ml-jobs>>
* <<ml-dfeeds>>
* <<ml-buckets>>
* <<ml-influencers>>
* <<ml-calendars>>
* <<ml-rules>>
* <<ml-model-snapshots>>
Expand All @@ -19,10 +20,12 @@ include::datafeeds.asciidoc[]

include::buckets.asciidoc[]

include::influencers.asciidoc[]

include::calendars.asciidoc[]

include::rules.asciidoc[]

include::architecture.asciidoc[]

include::model-snapshots.asciidoc[]
include::model-snapshots.asciidoc[]