-
Notifications
You must be signed in to change notification settings - Fork 65
[ML] Improve residual model selection #468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Tom.
I've left a couple of minor questions inline - also there are two constructors of COneOfNPrior
where m_SampleMoments
is not directly initialised, I'm not sure if this should be of concern however.
add(maths_t::count(weight), n); | ||
if (failed) { | ||
LOG_ERROR(<< "Failed to compute log-likelihood"); | ||
LOG_ERROR(<< "samples = " << core::CContainerPrinter::print(samples)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be useful to print out the weights here as well?
lib/maths/COneOfNPrior.cc
Outdated
used.push_back(use); | ||
varianceMismatchPenalties.push_back( | ||
-m * MAXIMUM_LOG_BAYES_FACTOR * | ||
std::max(1.0 - 9.0 * CBasicStatistics::variance(m_SampleMoments) / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the use of the 9.0
factor here - perhaps replacing with a named constant would aid understanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the maximum relative error in the estimated variance of the model (vs the sample variance) for which we will not penalise it at all. I named this constant in 7c76c7f.
retest |
This addresses the remaining issues from #124, which were in fact related to poor model selection. These data sets pose problems because none of candidate residual models are a particularly good fit for all values. The outcome is we end up choosing a model with undesirable characteristics. This also interferes with detecting change points correctly.
This PR makes two changes:
I've reviewed the result changes across a range of our QA data sets and where it has affected results it is doing better job. This generally affects data sets where there are some low and some very high values and also small numbers of values which are very different from typical. In such cases, we will generally use less skewed and lighter tailed models as a result. This means we end up being more sensitive to values which are different from our predictions. This also makes model selection more stable, so we see fewer changes to the selected model; these events are often most clear when model plot is enabled and are associated with sudden changes in bounds.