[ML] Improve boosted tree training initialisation #686

tveasey · 2019-09-23T13:03:57Z

This makes some changes to initialisation:

Measures the gain and sum curvature in the tree directly to estimate upper bounds for good values for gamma and lambda, respectively,
Search a large range of values from these initial overestimates looking for a turning point in the test error as the model transitions from underfit to overfit.

The hyperaparameter search is centred on the values at this transition. I've reduced the number of hyperparameter optimisation rounds as a result of the improved initialisation.

This also does a better job of monitoring progress. We explicitly account for the cost of initialisation and update progress after each forest is trained rather than per round of the hyperparameter optimisation. Since it is useful to share progress monitoring between the boosted tree factory and the implementation I've migrated to storing the loop progress monitor on the implementation and persisting and restoring. This incidentally fixes a bug in progress monitoring on resume.

Finally, I've refactored the regularisation parameters to better encapsulate. This anticipates depth based regularisation.

valeriy42

Great idea on improving the search intervals for hyperparameters! Since this code is not trivial, I left a couple of comments to improve readability.

include/core/CLoopProgress.h

include/maths/CBoostedTreeFactory.h

lib/maths/CBoostedTreeFactory.cc

valeriy42 · 2019-09-24T08:59:42Z

lib/maths/CBoostedTreeFactory.cc

@@ -334,6 +482,8 @@ CBoostedTreeFactory::constructFromString(std::istream& jsonStringStream,
        if (treePtr->acceptRestoreTraverser(traverser) == false || traverser.haveBadState()) {
            throw std::runtime_error{"failed to restore boosted tree"};
        }
+        treePtr->m_Impl->m_TrainingProgress.attach(recordProgress);


You are using recordProgress after it was moved in Line 479. It cannot end well 😉

And indeed that was what was causing the test failures!

(I'm mulling over having constructFromString return the factory as we do from construct from parameters. Aesthetically, I don't like the asymmetry. I'm just seeing how it'll work out.)

I updated along the lines of the second comment, which incidentally fixes use of moved function. I feel like this is cleaner. Let me know what you think.

See e42208c

lib/core/CLoopProgress.cc

include/maths/CBoostedTreeImpl.h

…ratch

…op progress

tveasey · 2019-09-24T13:54:15Z

lib/maths/CBoostedTreeFactory.cc

    } catch (const std::exception& e) {
-        HANDLE_FATAL(<< "Input error: '" << e.what() << "'. Check logs for more details.");
+        throw std::runtime_error{std::string{"Input error: '"} + e.what() + "'"};


Note I changed this to throw. HANDLE_FATAL will abort the program, but this is recoverable since we can just start again. The exception is already handled in the runner classes in the api library.

tveasey · 2019-09-24T13:55:38Z

Many thanks for the review @valeriy42 and good suggestions! I think I've now worked through them all, if you could take another look.

valeriy42

Great work on improving the readability and introducing the symmetry in the Factory. I left a couple of comments also we discussed further improvements offline.

valeriy42 · 2019-09-25T09:12:50Z

lib/api/CDataFrameBoostedTreeRunner.cc

@@ -228,8 +230,11 @@ bool CDataFrameBoostedTreeRunner::restoreBoostedTree(
            return false;
        }

-        m_BoostedTree = maths::CBoostedTreeFactory::constructFromString(
-            *inputStream, frame, progressRecorder(), memoryEstimator(), statePersister());
+        m_BoostedTree = maths::CBoostedTreeFactory::constructFromString(*inputStream)


Nice! I like the symmetry now. 👍

lib/maths/CBoostedTreeFactory.cc

valeriy42 · 2019-09-25T09:36:18Z

lib/maths/CBoostedTreeFactory.cc

+            TVector fallbackInterval{{MIN_REGULARIZER_SCALE, 1.0, MAX_REGULARIZER_SCALE}};
+            m_TreeImpl->m_Regularization.gamma(m_GammaSearchInterval(MIN_REGULARIZER_INDEX));
+
+            double initialLambda{totalCurvaturePerNode};


lib/maths/CBoostedTreeFactory.cc

tveasey · 2019-09-25T16:46:49Z

I also refactored the line search function along the lines we discussed off line in 937df23. I agree this makes the idea clearer: good suggestion! Can you take another look @valeriy42.

valeriy42

LGTM. I like how the changes made the core of the algorithm much easier to understand while at the same time you were able to remove "magic numbers" and improve the prediction quality. I left a couple of comments regarding updating the code comments. You can merge without the need for me to review it again.

lib/maths/CBoostedTreeFactory.cc

valeriy42 · 2019-09-26T11:57:19Z

lib/maths/CBoostedTreeFactory.cc

-            // These are scales > bestRegularizerScale hence 1 / multiplier.
-            interval(MAX_REGULARIZER_INDEX) = std::min(
-                std::pow(1.0 / multiplier, logScaleAtThreeSigma), MAX_REGULARIZER_SCALE);
+        double threeSigmaInterval{std::sqrt(3.0 * sigma / curvature)};


Nice! Maybe you can add a comment why you need to divide by the curvature.

Right, I have the comment In particular, we solve curvature * (x - best)^2 = 3 sigma... which I thought (maybe slightly tangentially) explained this. I feel like this is maybe enough.

Backport #686.

Improve hyperparameter optimisation initialisation

b742e50

tveasey added >enhancement review v8.0.0 :ml/DataFrameAnalysis v7.5.0 labels Sep 23, 2019

tveasey requested a review from valeriy42 September 23, 2019 13:03

tveasey added 7 commits September 23, 2019 14:06

Docs

31796bd

Typos

d163dfd

Build fix

dcbe3fe

Formatting

841ff03

Remove depth: this isn't needed yet

7a57883

Fix tests

cd71cd6

Merge branch 'master' into improved-initialisation

6102836

valeriy42 reviewed Sep 24, 2019

View reviewed changes

tveasey added 11 commits September 24, 2019 12:11

Create tree trainer in buildFor for both restore and creation from sc…

e42208c

…ratch

Better naming plus add explanation of persist/restore strategy for lo…

79f3b43

…op progress

Improve progress related function names

aeae71e

Improve logic readability

7111620

Unpack long line

1f5d4c0

Improve comments

f63c228

Correct out-of-date comment

e0dd26e

Improve function naming and unpack expression

4cee56f

Corrections to search endpoint estimates

cbe44d6

Formatting

e2c53d9

Test fix for rename

1370c18

tveasey commented Sep 24, 2019

View reviewed changes

Update test for refactor

958f45a

valeriy42 reviewed Sep 25, 2019

View reviewed changes

tveasey added 2 commits September 25, 2019 16:52

Rejig line search to hide the fact it's working on exponential scale

937df23

Comment plus correct scale

7d2ccb0

Typo in refactor

afc570b

valeriy42 approved these changes Sep 26, 2019

View reviewed changes

Improve comment

311ade1

tveasey merged commit 4c03078 into elastic:master Sep 26, 2019

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Sep 26, 2019

[ML] Improve boosted tree training initialisation (elastic#686)

957394b

tveasey deleted the improved-initialisation branch September 26, 2019 14:07

tveasey mentioned this pull request Sep 26, 2019

[7.5][ML] Improve boosted tree training initialisation #697

Merged

tveasey added a commit that referenced this pull request Sep 26, 2019

[7.5][ML] Improve boosted tree training initialisation (#697)

1becee7

Backport #686.

tveasey mentioned this pull request Oct 1, 2019

[ML] Trap out-of-bounds read initialising boosted tree training #709

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Improve boosted tree training initialisation #686

[ML] Improve boosted tree training initialisation #686

tveasey commented Sep 23, 2019

valeriy42 left a comment

valeriy42 Sep 24, 2019

tveasey Sep 24, 2019

tveasey Sep 24, 2019 •

edited

Loading

tveasey Sep 24, 2019

tveasey Sep 24, 2019

tveasey Sep 24, 2019

tveasey commented Sep 24, 2019

valeriy42 left a comment

valeriy42 Sep 25, 2019

valeriy42 Sep 25, 2019

tveasey commented Sep 25, 2019

valeriy42 left a comment

valeriy42 Sep 26, 2019

tveasey Sep 26, 2019

[ML] Improve boosted tree training initialisation #686

[ML] Improve boosted tree training initialisation #686

Conversation

tveasey commented Sep 23, 2019

valeriy42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tveasey Sep 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tveasey commented Sep 24, 2019

valeriy42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tveasey commented Sep 25, 2019

valeriy42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tveasey Sep 24, 2019 •

edited

Loading