-
Notifications
You must be signed in to change notification settings - Fork 64
[ML] Add multi_bucket_impact label to anomalies #230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add a label indicating the impact of the multi bucket analysis on the overall probability. The value is in the range -5 to 5 where -5 indicates a wholly single bucket contribution and 5 a wholly multi bucket contribution to the final probability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few minor comments.
Tom should confirm it's OK too before merging.
@@ -71,6 +71,9 @@ class MODEL_EXPORT CProbabilityAndInfluenceCalculator { | |||
using TStrCRefDouble1VecDouble1VecPrPrVec = std::vector<TStrCRefDouble1VecDouble1VecPrPr>; | |||
using TStrCRefDouble1VecDouble1VecPrPrVecVec = | |||
std::vector<TStrCRefDouble1VecDouble1VecPrPrVec>; | |||
using TStrDoubleUMap = boost::unordered_map<std::string, double>; | |||
using TStrProbabilityAggregatorMap = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be TStrProbabilityAggregatorUMap
- the U
is missing
double ls = std::log(std::max(sbProbability, ml::maths::MINUSCULE_PROBABILITY)); | ||
double lm = std::log(mbProbability); | ||
|
||
double scale = 5.0 * std::min(ls, lm) / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add a constant for the 5.0
as it's used in 3 places and the constant will make clear that it's the same quantity in all 3 places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 and in various other files above. I'd propose adding something like MAXIMUM_MULTI_BUCKET_IMPACT CModelConfig.
if (!this->calculateExplainingProbabilities(explainingProbabilities)) { | ||
LOG_INFO(<< "Failed to compute explaining probabilities"); | ||
return false; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The style of most of the code would be not to have the else
here because it's after an early return due to error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of suggestions and a couple of places I think the code could be made safer. Also I think it would be nice to add a unit test for this. It should be possible to define some test data for which the impact should be high. Let's discuss this offline.
@@ -49,6 +49,9 @@ using TProbabilityCalculation2Vec = core::CSmallVector<maths_t::EProbabilityCalc | |||
using TSizeDoublePr = std::pair<std::size_t, double>; | |||
using TSizeDoublePr1Vec = core::CSmallVector<TSizeDoublePr, 1>; | |||
|
|||
const std::string SINGLE_BUCKET_FEATURE_LABEL{"single_bucket"}; | |||
const std::string MULTI_BUCKET_FEATURE_LABEL{"multi_bucket"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd be better off defining these once as static constants in maths::CModel
. As it is these have to match the locally declared strings in CTimeSeriesModel.cc
.
Better still would be to use an enum and update the relevant types to be defined w.r.t. this enum.
|
||
for (const auto& ep : other.m_ExplainingProbabilities) { | ||
if (ep.second.calculate(p) && !ep.second.empty()) { | ||
auto itr = m_ExplainingProbabilities.find(ep.first); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should insert if missing, i.e.
std::tie(itr, added) = m_ExplainingProbabilities.insert(ep);
if (added == false) {
itr->second.add(p, weight);
}
@@ -693,18 +715,34 @@ bool CProbabilityAndInfluenceCalculator::addProbability(model_t::EFeature featur | |||
return false; | |||
} | |||
|
|||
auto readResult = [&](const maths::SModelProbabilityResult& result) { | |||
for (auto fp : result.s_FeatureProbabilities) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto&
@@ -827,6 +859,42 @@ bool CProbabilityAndInfluenceCalculator::calculate(double& probability) const { | |||
return m_Probability.calculate(probability); | |||
} | |||
|
|||
bool CProbabilityAndInfluenceCalculator::calculateExplainingProbabilities( | |||
TStrDoubleUMap& explainingProbabilities) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems wasteful copy the strings around here. This would also be fixed if you migrated to using an enum.
double mbProbability = explainingProbabilities[MULTI_BUCKET_FEATURE_LABEL]; | ||
|
||
double ls = std::log(std::max(sbProbability, ml::maths::MINUSCULE_PROBABILITY)); | ||
double lm = std::log(mbProbability); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be bounded as well, i.e. std::log(std::max(mbProbability, ...))
, in case of underflow. Also these should use CTools::smallestProbability()
.
return true; | ||
} | ||
|
||
bool CProbabilityAndInfluenceCalculator::calculateMultiBucketImpact(double& multiBucketImpact) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add some explanation for this function, i.e. something along the lines of "we choose a function s.t. the impact saturates when one "probability" < "other probability" / 1000 or one probability is close to one, i.e. one factor is not at all anomalous".
Added a brief explanation on aspects of the design of the function calculating the multi_bucket_impact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is pretty much there. However, I'd recommend sticking with just one parameter, especially since the other parameter isn't currently being used to limit the calculated impact.
|
||
multiBucketImpact = std::max( | ||
std::min(scale * (ls - lm), CAnomalyDetectorModelConfig::MAXIMUM_MULTI_BUCKET_IMPACT), | ||
-1.0 * CAnomalyDetectorModelConfig::MAXIMUM_MULTI_BUCKET_IMPACT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't observe the CAnomalyDetectorModelConfig::MINIMUM_MULTI_BUCKET_IMPACT
, i.e. if they were set differently down the line then this would still be in the range [-CAnomalyDetectorModelConfig::MAXIMUM_MULTI_BUCKET_IMPACT
, CAnomalyDetectorModelConfig::MAXIMUM_MULTI_BUCKET_IMPACT
]. I think this behaviour is fine, it would be weird to make them different, but I think it confuses matters having a separate MINIMUM_MULTI_BUCKET_IMPACT as a result. I'd just stick with one parameter, maybe MAXIMUM_MULTI_BUCKET_IMPACT_MAGNITUDE and negate it as appropriate?
* Consistent use of named constants * Account for altered SResults constructor signature in test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add a label indicating the impact of the multi bucket analysis on the overall probability. The value is in the range -5 to 5 where -5 indicates a wholly single bucket contribution and 5 a wholly multi bucket contribution to the final probability. Backports elastic#230
Add a label indicating the impact of the multi bucket analysis on the overall probability. The value is in the range -5 to 5 where -5 indicates a wholly single bucket contribution and 5 a wholly multi bucket contribution to the final probability. Backports #230
Add a label indicating the impact of the multi bucket analysis on the
overall probability. The value is in the range -5 to 5 where -5
indicates a wholly single bucket contribution and 5 a wholly multi
bucket contribution to the final probability.