[ML] Add multi_bucket_impact label to anomalies #230

edsavage · 2018-10-03T10:49:17Z

Add a label indicating the impact of the multi bucket analysis on the
overall probability. The value is in the range -5 to 5 where -5
indicates a wholly single bucket contribution and 5 a wholly multi
bucket contribution to the final probability.

Add a label indicating the impact of the multi bucket analysis on the overall probability. The value is in the range -5 to 5 where -5 indicates a wholly single bucket contribution and 5 a wholly multi bucket contribution to the final probability.

droberts195

I left a few minor comments.

Tom should confirm it's OK too before merging.

droberts195 · 2018-10-03T11:08:35Z

include/model/CProbabilityAndInfluenceCalculator.h

@@ -71,6 +71,9 @@ class MODEL_EXPORT CProbabilityAndInfluenceCalculator {
    using TStrCRefDouble1VecDouble1VecPrPrVec = std::vector<TStrCRefDouble1VecDouble1VecPrPr>;
    using TStrCRefDouble1VecDouble1VecPrPrVecVec =
        std::vector<TStrCRefDouble1VecDouble1VecPrPrVec>;
+    using TStrDoubleUMap = boost::unordered_map<std::string, double>;
+    using TStrProbabilityAggregatorMap =


Should be TStrProbabilityAggregatorUMap - the U is missing

droberts195 · 2018-10-03T11:21:25Z

lib/model/CProbabilityAndInfluenceCalculator.cc

+        double ls = std::log(std::max(sbProbability, ml::maths::MINUSCULE_PROBABILITY));
+        double lm = std::log(mbProbability);
+
+        double scale = 5.0 * std::min(ls, lm) /


It would be nice to add a constant for the 5.0 as it's used in 3 places and the constant will make clear that it's the same quantity in all 3 places.

+1 and in various other files above. I'd propose adding something like MAXIMUM_MULTI_BUCKET_IMPACT CModelConfig.

droberts195 · 2018-10-03T11:22:08Z

lib/model/CProbabilityAndInfluenceCalculator.cc

+    if (!this->calculateExplainingProbabilities(explainingProbabilities)) {
+        LOG_INFO(<< "Failed to compute explaining probabilities");
+        return false;
+    } else {


The style of most of the code would be not to have the else here because it's after an early return due to error.

tveasey

I left a couple of suggestions and a couple of places I think the code could be made safer. Also I think it would be nice to add a unit test for this. It should be possible to define some test data for which the impact should be high. Let's discuss this offline.

tveasey · 2018-10-03T12:16:54Z

lib/model/CProbabilityAndInfluenceCalculator.cc

@@ -49,6 +49,9 @@ using TProbabilityCalculation2Vec = core::CSmallVector<maths_t::EProbabilityCalc
 using TSizeDoublePr = std::pair<std::size_t, double>;
 using TSizeDoublePr1Vec = core::CSmallVector<TSizeDoublePr, 1>;

+const std::string SINGLE_BUCKET_FEATURE_LABEL{"single_bucket"};
+const std::string MULTI_BUCKET_FEATURE_LABEL{"multi_bucket"};


I think we'd be better off defining these once as static constants in maths::CModel. As it is these have to match the locally declared strings in CTimeSeriesModel.cc.

Better still would be to use an enum and update the relevant types to be defined w.r.t. this enum.

tveasey · 2018-10-03T12:24:32Z

lib/model/CProbabilityAndInfluenceCalculator.cc

+
+    for (const auto& ep : other.m_ExplainingProbabilities) {
+        if (ep.second.calculate(p) && !ep.second.empty()) {
+            auto itr = m_ExplainingProbabilities.find(ep.first);


This should insert if missing, i.e.

std::tie(itr, added) = m_ExplainingProbabilities.insert(ep); if (added == false) { itr->second.add(p, weight); }

tveasey · 2018-10-03T12:30:59Z

lib/model/CProbabilityAndInfluenceCalculator.cc

@@ -693,18 +715,34 @@ bool CProbabilityAndInfluenceCalculator::addProbability(model_t::EFeature featur
        return false;
    }

+    auto readResult = [&](const maths::SModelProbabilityResult& result) {
+        for (auto fp : result.s_FeatureProbabilities) {


const auto&

tveasey · 2018-10-03T12:33:28Z

lib/model/CProbabilityAndInfluenceCalculator.cc

@@ -827,6 +859,42 @@ bool CProbabilityAndInfluenceCalculator::calculate(double& probability) const {
    return m_Probability.calculate(probability);
 }

+bool CProbabilityAndInfluenceCalculator::calculateExplainingProbabilities(
+    TStrDoubleUMap& explainingProbabilities) const {


It seems wasteful copy the strings around here. This would also be fixed if you migrated to using an enum.

tveasey · 2018-10-03T12:39:57Z

lib/model/CProbabilityAndInfluenceCalculator.cc

+        double mbProbability = explainingProbabilities[MULTI_BUCKET_FEATURE_LABEL];
+
+        double ls = std::log(std::max(sbProbability, ml::maths::MINUSCULE_PROBABILITY));
+        double lm = std::log(mbProbability);


This should be bounded as well, i.e. std::log(std::max(mbProbability, ...)), in case of underflow. Also these should use CTools::smallestProbability().

tveasey · 2018-10-03T12:47:47Z

lib/model/CProbabilityAndInfluenceCalculator.cc

+    return true;
+}
+
+bool CProbabilityAndInfluenceCalculator::calculateMultiBucketImpact(double& multiBucketImpact) const {


I think we should add some explanation for this function, i.e. something along the lines of "we choose a function s.t. the impact saturates when one "probability" < "other probability" / 1000 or one probability is close to one, i.e. one factor is not at all anomalous".

Added a brief explanation on aspects of the design of the function calculating the multi_bucket_impact

tveasey

I think this is pretty much there. However, I'd recommend sticking with just one parameter, especially since the other parameter isn't currently being used to limit the calculated impact.

tveasey · 2018-10-04T08:59:20Z

lib/model/CProbabilityAndInfluenceCalculator.cc

+
+    multiBucketImpact = std::max(
+        std::min(scale * (ls - lm), CAnomalyDetectorModelConfig::MAXIMUM_MULTI_BUCKET_IMPACT),
+        -1.0 * CAnomalyDetectorModelConfig::MAXIMUM_MULTI_BUCKET_IMPACT);


This doesn't observe the CAnomalyDetectorModelConfig::MINIMUM_MULTI_BUCKET_IMPACT, i.e. if they were set differently down the line then this would still be in the range [-CAnomalyDetectorModelConfig::MAXIMUM_MULTI_BUCKET_IMPACT, CAnomalyDetectorModelConfig::MAXIMUM_MULTI_BUCKET_IMPACT]. I think this behaviour is fine, it would be weird to make them different, but I think it confuses matters having a separate MINIMUM_MULTI_BUCKET_IMPACT as a result. I'd just stick with one parameter, maybe MAXIMUM_MULTI_BUCKET_IMPACT_MAGNITUDE and negate it as appropriate?

* Consistent use of named constants * Account for altered SResults constructor signature in test cases

tveasey

LGTM

Add a label indicating the impact of the multi bucket analysis on the overall probability. The value is in the range -5 to 5 where -5 indicates a wholly single bucket contribution and 5 a wholly multi bucket contribution to the final probability. Backports elastic#230

Add a label indicating the impact of the multi bucket analysis on the overall probability. The value is in the range -5 to 5 where -5 indicates a wholly single bucket contribution and 5 a wholly multi bucket contribution to the final probability. Backports #230

edsavage added v7.0.0 review :ml v6.5.0 labels Oct 3, 2018

edsavage added 2 commits October 3, 2018 12:01

Corrected formatting

62a3543

Updated change log

19c0961

droberts195 reviewed Oct 3, 2018

View reviewed changes

tveasey reviewed Oct 3, 2018

View reviewed changes

edsavage added 2 commits October 3, 2018 16:50

Further attending to review comments

0026e21

Updated code comments

b169b02

Added a brief explanation on aspects of the design of the function calculating the multi_bucket_impact

edsavage mentioned this pull request Oct 3, 2018

[ML] Ability to label anomalies #197

Closed

tveasey reviewed Oct 4, 2018

View reviewed changes

edsavage added 2 commits October 4, 2018 11:21

Further addressing code review comments

bdc2bcc

* Consistent use of named constants * Account for altered SResults constructor signature in test cases

Added test case exercising multi_bucket_impact

e095eb2

tveasey approved these changes Oct 4, 2018

View reviewed changes

edsavage merged commit a480dde into elastic:master Oct 4, 2018

edsavage mentioned this pull request Oct 5, 2018

[6.5][ML] Add multi_bucket_impact label to anomalies #239

Merged

edsavage deleted the label_anomalies branch October 5, 2018 08:43

peteharverson mentioned this pull request Oct 17, 2018

[ML] Lowers multi-bucket impact thresholds used for anomaly display elastic/kibana#24136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Add multi_bucket_impact label to anomalies #230

[ML] Add multi_bucket_impact label to anomalies #230

edsavage commented Oct 3, 2018

droberts195 left a comment

droberts195 Oct 3, 2018

droberts195 Oct 3, 2018

tveasey Oct 3, 2018

droberts195 Oct 3, 2018

tveasey left a comment

tveasey Oct 3, 2018

tveasey Oct 3, 2018

tveasey Oct 3, 2018

tveasey Oct 3, 2018

tveasey Oct 3, 2018

tveasey Oct 3, 2018

tveasey left a comment

tveasey Oct 4, 2018

tveasey left a comment

[ML] Add multi_bucket_impact label to anomalies #230

[ML] Add multi_bucket_impact label to anomalies #230

Conversation

edsavage commented Oct 3, 2018

droberts195 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tveasey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tveasey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tveasey left a comment

Choose a reason for hiding this comment