Skip to content

[ML] Ability to label anomalies #197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
droberts195 opened this issue Sep 4, 2018 · 7 comments
Closed

[ML] Ability to label anomalies #197

droberts195 opened this issue Sep 4, 2018 · 7 comments

Comments

@droberts195
Copy link
Contributor

droberts195 commented Sep 4, 2018

The multi-bucket functionality has highlighted that it would be nice to be able to label our anomaly results to say why they were created.

A possible way to do this that would be to add a field to certain types of results that is a string array. It could be called labels, explanation, or maybe there is a better name.

A multi-bucket anomaly might then look like this:

    {
      "job_id": "it-ops-kpi",
      "result_type": "record",
      "probability": 0.00000332668,
      "record_score": 72.9929,
      "initial_record_score": 65.7923,
      "bucket_span": 300,
      "detector_index": 0,
      "is_interim": false,
      "timestamp": 1454944200000,
      "function": "low_sum",
      "function_description": "sum",
      "typical": [
        1806.48
      ],
      "actual": [
        288
      ],
      "field_name": "events_per_min",
      "explanation": [
        "multi-bucket"
      ]
    }

The explanation field contains zero or more strings that indicate why the result was created. We can have many possible reasons, but we should be rigorous about documenting what strings can possibly be used so that people who search for them know what to search for.

Should this field be available for both influencer results and record results or just record results?

This change would require a corresponding change to parsing and serialisation on the Java side, and a UI change to make the reasons visible to end users.

Originally it was thought that the same functionality could be used by users to add arbitrary annotations to results, but the current thinking is that it is better to have separate functionality for the two use cases, hence elastic/elasticsearch#33376 has been raised to discuss user annotations.

@tveasey
Copy link
Contributor

tveasey commented Sep 4, 2018

+1 on this change. Also I think there are significant benefits in keeping the information associated with the result, since this allows one to readily filter results by explanation which would be very useful functionality.

I started the process of extracting this information, i.e. this field in our results object for a single time series contains information about the source of the different factors which go in to how unusual we think it is in the current time bucket.

Should this field be available for both influencer results and record results or just record results?

There would be some work and thought needed to wire this through the aggregation and influence calculation. I think this is feasible, i.e. you want to find the dominant factors (by feature) generating the aggregate or influencer result. However, in the first instance it would be significantly easier to simply annotate this information on record level results and I think this would be useful functionality.

@droberts195
Copy link
Contributor Author

After further review of the understandability of multi-bucket anomalies in the UI we are considering this work essential for 6.5. @peteharverson has examples of charts that are very hard to understand without explicit labelling of multi-bucket anomalies.

@droberts195
Copy link
Contributor Author

We had a meeting to discuss this and the thinking has moved on a bit from the original description. We want some mechanism to assign a number representing how strongly an anomaly is multi bucket or single bucket.

For 6.5 we will add a new field multi_bucket_impact* to anomaly record level results.

The value will be on a scale of -5 to +5 where -5 means the anomaly is purely single bucket and +5 means the anomaly is purely multi bucket.

The formula should be something along the lines of:

  • If the multi bucket probability is above the cutoff threshold then -5
  • If the single bucket probability is above the cutoff threshold then +5
  • max(min(scale * (log(sb) - log(mb)), 5), -5)

Where sb is the single bucket probability, mb is the multi bucket probability and scale is chosen such that a ratio of 1000 in the probabilities results in a score of -5 or +5 without truncation, i.e. scale = 0.72382413650542 = 1 / log(pow(1000, 0.2)). (Alternatively, if we went for 10000 as the ratio instead of 1000 then we could use log10 instead of log in the formula and drop scale altogether, as it would work out as 1.)

* - the name of the field is still up for discussion if someone thinks of a better name in the next couple of weeks

@edsavage
Copy link
Contributor

Some (rough) notes on the backend design of anomaly labelling.

A new field named multi_bucket_impact containing a double value in the range -5.0 - +5.0 will be written to record (leaf) level results only.

The probabilities required to calculate the impact value are currently available in maths::SModelProbabilityResult::s_FeatureProbabilities. This is a vector of probabilities from different features. The probability corresponding to multi bucket analysis is labelled MEAN_FEATURE_LABEL and that for single bucket is labelled BUCKET_FEATURE_LABEL. (more appropriate labels are perhaps needed here). These probabilities need to be aggregated and adjusted in the same manner as is done for the m_Probability member of model::CProbabilityAndInfluenceCalculator.

CJsonOutputWriter methods addEventRateFields and addMetricFields will be modified to write the new field to results

In CHierarchicalResultsWriter::writeIndividualResult retrieve the impact value from the node object and passed to TResults (as is done for probability)

In the computeProbability method of both CEventRateModel and CMetricModel retrieve a vector of explanation probabilities from the pFeatures object, perform the calculation of the impact value and add the impact value to the SAnnotatedProbability result object by way of passing it through to the results builder.

In model::CProbabilityAndInfluenceCalculator::addProbability retrieve the s_FeatureProbabilities from the result object

Add a map - m_ExplainingProbabilities - of probability aggregators keyed on the same labels as used in s_FeatureProbabilities to model::CProbabilityAndInfluenceCalculator. The probabilities corresponding to the labels will be added to the appropriate aggregator in the map.

The impact value will be the result of the formula

max(min(f * (ls - lm), 5), -5)

Where

f = 5 * min(ls, lm) / min(max(ls, lm), -0.001) / log(1000) (edited)
ls = log(P(single bucket))
lm = log(P(multi bucket))

@edsavage
Copy link
Contributor

edsavage commented Oct 2, 2018

PR elastic/elasticsearch#34233 covers the java side of things

@edsavage
Copy link
Contributor

edsavage commented Oct 3, 2018

PR #230 implements the required ml-cpp changes

@droberts195
Copy link
Contributor Author

Closed by #230 and #257.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants