Skip to content

[ML] Add multi_bucket_impact label to anomalies #230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 4, 2018

Conversation

edsavage
Copy link
Contributor

@edsavage edsavage commented Oct 3, 2018

Add a label indicating the impact of the multi bucket analysis on the
overall probability. The value is in the range -5 to 5 where -5
indicates a wholly single bucket contribution and 5 a wholly multi
bucket contribution to the final probability.

Add a label indicating the impact of the multi bucket analysis on the
overall probability. The value is in the range -5 to 5 where -5
indicates a wholly single bucket contribution and 5 a wholly multi
bucket contribution to the final probability.
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few minor comments.

Tom should confirm it's OK too before merging.

@@ -71,6 +71,9 @@ class MODEL_EXPORT CProbabilityAndInfluenceCalculator {
using TStrCRefDouble1VecDouble1VecPrPrVec = std::vector<TStrCRefDouble1VecDouble1VecPrPr>;
using TStrCRefDouble1VecDouble1VecPrPrVecVec =
std::vector<TStrCRefDouble1VecDouble1VecPrPrVec>;
using TStrDoubleUMap = boost::unordered_map<std::string, double>;
using TStrProbabilityAggregatorMap =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be TStrProbabilityAggregatorUMap - the U is missing

double ls = std::log(std::max(sbProbability, ml::maths::MINUSCULE_PROBABILITY));
double lm = std::log(mbProbability);

double scale = 5.0 * std::min(ls, lm) /
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a constant for the 5.0 as it's used in 3 places and the constant will make clear that it's the same quantity in all 3 places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 and in various other files above. I'd propose adding something like MAXIMUM_MULTI_BUCKET_IMPACT CModelConfig.

if (!this->calculateExplainingProbabilities(explainingProbabilities)) {
LOG_INFO(<< "Failed to compute explaining probabilities");
return false;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style of most of the code would be not to have the else here because it's after an early return due to error.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of suggestions and a couple of places I think the code could be made safer. Also I think it would be nice to add a unit test for this. It should be possible to define some test data for which the impact should be high. Let's discuss this offline.

@@ -49,6 +49,9 @@ using TProbabilityCalculation2Vec = core::CSmallVector<maths_t::EProbabilityCalc
using TSizeDoublePr = std::pair<std::size_t, double>;
using TSizeDoublePr1Vec = core::CSmallVector<TSizeDoublePr, 1>;

const std::string SINGLE_BUCKET_FEATURE_LABEL{"single_bucket"};
const std::string MULTI_BUCKET_FEATURE_LABEL{"multi_bucket"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd be better off defining these once as static constants in maths::CModel. As it is these have to match the locally declared strings in CTimeSeriesModel.cc.

Better still would be to use an enum and update the relevant types to be defined w.r.t. this enum.


for (const auto& ep : other.m_ExplainingProbabilities) {
if (ep.second.calculate(p) && !ep.second.empty()) {
auto itr = m_ExplainingProbabilities.find(ep.first);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should insert if missing, i.e.

std::tie(itr, added) = m_ExplainingProbabilities.insert(ep);
if (added == false) {
    itr->second.add(p, weight);
}

@@ -693,18 +715,34 @@ bool CProbabilityAndInfluenceCalculator::addProbability(model_t::EFeature featur
return false;
}

auto readResult = [&](const maths::SModelProbabilityResult& result) {
for (auto fp : result.s_FeatureProbabilities) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto&

@@ -827,6 +859,42 @@ bool CProbabilityAndInfluenceCalculator::calculate(double& probability) const {
return m_Probability.calculate(probability);
}

bool CProbabilityAndInfluenceCalculator::calculateExplainingProbabilities(
TStrDoubleUMap& explainingProbabilities) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems wasteful copy the strings around here. This would also be fixed if you migrated to using an enum.

double mbProbability = explainingProbabilities[MULTI_BUCKET_FEATURE_LABEL];

double ls = std::log(std::max(sbProbability, ml::maths::MINUSCULE_PROBABILITY));
double lm = std::log(mbProbability);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be bounded as well, i.e. std::log(std::max(mbProbability, ...)), in case of underflow. Also these should use CTools::smallestProbability().

return true;
}

bool CProbabilityAndInfluenceCalculator::calculateMultiBucketImpact(double& multiBucketImpact) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add some explanation for this function, i.e. something along the lines of "we choose a function s.t. the impact saturates when one "probability" < "other probability" / 1000 or one probability is close to one, i.e. one factor is not at all anomalous".

Added a brief explanation on aspects of the design of the function
calculating the multi_bucket_impact
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty much there. However, I'd recommend sticking with just one parameter, especially since the other parameter isn't currently being used to limit the calculated impact.


multiBucketImpact = std::max(
std::min(scale * (ls - lm), CAnomalyDetectorModelConfig::MAXIMUM_MULTI_BUCKET_IMPACT),
-1.0 * CAnomalyDetectorModelConfig::MAXIMUM_MULTI_BUCKET_IMPACT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't observe the CAnomalyDetectorModelConfig::MINIMUM_MULTI_BUCKET_IMPACT, i.e. if they were set differently down the line then this would still be in the range [-CAnomalyDetectorModelConfig::MAXIMUM_MULTI_BUCKET_IMPACT, CAnomalyDetectorModelConfig::MAXIMUM_MULTI_BUCKET_IMPACT]. I think this behaviour is fine, it would be weird to make them different, but I think it confuses matters having a separate MINIMUM_MULTI_BUCKET_IMPACT as a result. I'd just stick with one parameter, maybe MAXIMUM_MULTI_BUCKET_IMPACT_MAGNITUDE and negate it as appropriate?

* Consistent use of named constants
* Account for altered SResults constructor signature in test cases
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@edsavage edsavage merged commit a480dde into elastic:master Oct 4, 2018
edsavage added a commit to edsavage/ml-cpp that referenced this pull request Oct 4, 2018
Add a label indicating the impact of the multi bucket analysis on the
overall probability. The value is in the range -5 to 5 where -5
indicates a wholly single bucket contribution and 5 a wholly multi
bucket contribution to the final probability.

Backports elastic#230
edsavage added a commit that referenced this pull request Oct 5, 2018
Add a label indicating the impact of the multi bucket analysis on the
overall probability. The value is in the range -5 to 5 where -5
indicates a wholly single bucket contribution and 5 a wholly multi
bucket contribution to the final probability.

Backports #230
@edsavage edsavage deleted the label_anomalies branch October 5, 2018 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants