-
Notifications
You must be signed in to change notification settings - Fork 64
[ML] Ability to label anomalies #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 on this change. Also I think there are significant benefits in keeping the information associated with the result, since this allows one to readily filter results by explanation which would be very useful functionality. I started the process of extracting this information, i.e. this field in our results object for a single time series contains information about the source of the different factors which go in to how unusual we think it is in the current time bucket.
There would be some work and thought needed to wire this through the aggregation and influence calculation. I think this is feasible, i.e. you want to find the dominant factors (by feature) generating the aggregate or influencer result. However, in the first instance it would be significantly easier to simply annotate this information on record level results and I think this would be useful functionality. |
After further review of the understandability of multi-bucket anomalies in the UI we are considering this work essential for 6.5. @peteharverson has examples of charts that are very hard to understand without explicit labelling of multi-bucket anomalies. |
We had a meeting to discuss this and the thinking has moved on a bit from the original description. We want some mechanism to assign a number representing how strongly an anomaly is multi bucket or single bucket. For 6.5 we will add a new field The value will be on a scale of -5 to +5 where -5 means the anomaly is purely single bucket and +5 means the anomaly is purely multi bucket. The formula should be something along the lines of:
Where * - the name of the field is still up for discussion if someone thinks of a better name in the next couple of weeks |
Some (rough) notes on the backend design of anomaly labelling. A new field named multi_bucket_impact containing a double value in the range -5.0 - +5.0 will be written to record (leaf) level results only. The probabilities required to calculate the impact value are currently available in maths::SModelProbabilityResult::s_FeatureProbabilities. This is a vector of probabilities from different features. The probability corresponding to multi bucket analysis is labelled MEAN_FEATURE_LABEL and that for single bucket is labelled BUCKET_FEATURE_LABEL. (more appropriate labels are perhaps needed here). These probabilities need to be aggregated and adjusted in the same manner as is done for the m_Probability member of model::CProbabilityAndInfluenceCalculator. CJsonOutputWriter methods addEventRateFields and addMetricFields will be modified to write the new field to results In CHierarchicalResultsWriter::writeIndividualResult retrieve the impact value from the node object and passed to TResults (as is done for probability) In the computeProbability method of both CEventRateModel and CMetricModel retrieve a vector of explanation probabilities from the pFeatures object, perform the calculation of the impact value and add the impact value to the SAnnotatedProbability result object by way of passing it through to the results builder. In model::CProbabilityAndInfluenceCalculator::addProbability retrieve the s_FeatureProbabilities from the result object Add a map - m_ExplainingProbabilities - of probability aggregators keyed on the same labels as used in s_FeatureProbabilities to model::CProbabilityAndInfluenceCalculator. The probabilities corresponding to the labels will be added to the appropriate aggregator in the map. The impact value will be the result of the formula max(min(f * (ls - lm), 5), -5) Where f = 5 * min(ls, lm) / min(max(ls, lm), -0.001) / log(1000) (edited) |
PR elastic/elasticsearch#34233 covers the java side of things |
PR #230 implements the required ml-cpp changes |
The multi-bucket functionality has highlighted that it would be nice to be able to label our anomaly results to say why they were created.
A possible way to do this that would be to add a field to certain types of results that is a string array. It could be called
labels
,explanation
, or maybe there is a better name.A multi-bucket anomaly might then look like this:
The
explanation
field contains zero or more strings that indicate why the result was created. We can have many possible reasons, but we should be rigorous about documenting what strings can possibly be used so that people who search for them know what to search for.Should this field be available for both influencer results and record results or just record results?
This change would require a corresponding change to parsing and serialisation on the Java side, and a UI change to make the reasons visible to end users.
Originally it was thought that the same functionality could be used by users to add arbitrary annotations to results, but the current thinking is that it is better to have separate functionality for the two use cases, hence elastic/elasticsearch#33376 has been raised to discuss user annotations.
The text was updated successfully, but these errors were encountered: