Skip to content

[ML] Improve residual model selection #468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 30, 2019

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Apr 23, 2019

This addresses the remaining issues from #124, which were in fact related to poor model selection. These data sets pose problems because none of candidate residual models are a particularly good fit for all values. The outcome is we end up choosing a model with undesirable characteristics. This also interferes with detecting change points correctly.

This PR makes two changes:

  1. It limits the amount a model is penalised with values which are not well fit by any of the candidate models,
  2. It additionally penalises models whose predicted variance is much larger than the data variance: anomaly detection is sensitive to this parameter and because we consider heavy-tailed distributions we are susceptible to this sort of error.

I've reviewed the result changes across a range of our QA data sets and where it has affected results it is doing better job. This generally affects data sets where there are some low and some very high values and also small numbers of values which are very different from typical. In such cases, we will generally use less skewed and lighter tailed models as a result. This means we end up being more sensitive to values which are different from our predictions. This also makes model selection more stable, so we see fewer changes to the selected model; these events are often most clear when model plot is enabled and are associated with sudden changes in bounds.

Copy link
Contributor

@edsavage edsavage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Tom.

I've left a couple of minor questions inline - also there are two constructors of COneOfNPrior where m_SampleMoments is not directly initialised, I'm not sure if this should be of concern however.

add(maths_t::count(weight), n);
if (failed) {
LOG_ERROR(<< "Failed to compute log-likelihood");
LOG_ERROR(<< "samples = " << core::CContainerPrinter::print(samples));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to print out the weights here as well?

used.push_back(use);
varianceMismatchPenalties.push_back(
-m * MAXIMUM_LOG_BAYES_FACTOR *
std::max(1.0 - 9.0 * CBasicStatistics::variance(m_SampleMoments) /
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the use of the 9.0 factor here - perhaps replacing with a named constant would aid understanding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the maximum relative error in the estimated variance of the model (vs the sample variance) for which we will not penalise it at all. I named this constant in 7c76c7f.

@tveasey
Copy link
Contributor Author

tveasey commented Apr 30, 2019

retest

@tveasey tveasey merged commit ad8cab2 into elastic:master Apr 30, 2019
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Apr 30, 2019
@tveasey tveasey deleted the residual-model-selection branch April 30, 2019 13:19
tveasey added a commit that referenced this pull request May 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants