[ML] Correct logistic loss function #1032

tveasey · 2020-02-28T12:56:35Z

This was a bug in the computation of the best leaf weight when predictions for all its training rows are identical. In this case, we should find the weight to add to the current prediction which minimises the cross-entropy.

valeriy42

LGTM. Good work on catching this bug! I have just a single small suggestion wrt readibility.

lib/maths/CBoostedTreeLoss.cc

valeriy42 · 2020-02-28T13:52:30Z

lib/maths/unittest/CBoostedTreeTest.cc

@@ -1220,7 +1220,7 @@ BOOST_AUTO_TEST_CASE(testLogisticRegression) {

    LOG_DEBUG(<< "mean log relative error = "
              << maths::CBasicStatistics::mean(meanLogRelativeError));
-    BOOST_TEST_REQUIRE(maths::CBasicStatistics::mean(meanLogRelativeError) < 0.5);
+    BOOST_TEST_REQUIRE(maths::CBasicStatistics::mean(meanLogRelativeError) < 0.52);


Do you have an intuition why the error goes up here?

No! Interestingly though (looking at the debug) the hyperparameter choices become more stable with this change.

My best guess is there are multiple ways the code can "learn" to avoid the bug. For example, it'll potentially choose hyperparameter values which result in deeper trees so that it'll take the non-buggy path more often. Later trees added to the forest correct for the bug, since they're more likely to have different values for at least one of the rows so take the non-buggy path.

I've tested on some additional data sets and I haven't seen evidence of any accuracy degradation, but it will be interesting to review the results on our overnights.

So of course there was a subtle undefined behaviour bug with capture by reference. In fact, the result is still slightly worse for one data set in this test, but slightly better for a different one, with the average error being back to almost exactly what it was before.

I do still think the comments above somewhat explain why the previous behaviour wasn't causing a more obvious degradation in accuracy.

…n computing node values for classification

tveasey · 2020-02-28T20:24:06Z

retest

tveasey · 2020-02-28T23:52:32Z

retest

Backport #1032.

Find additive weight in case all leaf predictions are the same

77ec356

tveasey added >bug v8.0.0 :ml/DataFrameAnalysis v7.7.0 labels Feb 28, 2020

tveasey requested a review from valeriy42 February 28, 2020 12:56

Docs

458cb7e

droberts195 added the v7.6.2 label Feb 28, 2020

valeriy42 approved these changes Feb 28, 2020

View reviewed changes

tveasey added 3 commits February 28, 2020 14:43

Add a note to explain prediction variable

5a9d840

Only try to compute leaf values and protect against invalid reads whe…

204234d

…n computing node values for classification

Bug fix

a134cda

tveasey merged commit 1cf9de8 into elastic:master Feb 29, 2020

tveasey deleted the classification-loss-bug branch February 29, 2020 09:43

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Feb 29, 2020

[ML] Correct logistic loss function (elastic#1032)

e58b79a

tveasey mentioned this pull request Feb 29, 2020

[7.7][ML] Correct logistic loss function #1033

Merged

tveasey added a commit that referenced this pull request Feb 29, 2020

[ML] Correct logistic loss function (#1033)

8a694b6

Backport #1032.

tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Mar 16, 2020

[ML] Correct logistic loss function (elastic#1032)

eff0376

tveasey mentioned this pull request Mar 16, 2020

[7.6][ML] Correct logistic loss function #1059

Merged

tveasey added a commit that referenced this pull request Mar 17, 2020

[7.6][ML] Correct logistic loss function (#1059)

e06ef9d

Backport #1032.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Correct logistic loss function #1032

[ML] Correct logistic loss function #1032

tveasey commented Feb 28, 2020

valeriy42 left a comment

valeriy42 Feb 28, 2020

tveasey Feb 28, 2020 •

edited

Loading

tveasey Feb 28, 2020

tveasey commented Feb 28, 2020

tveasey commented Feb 28, 2020

[ML] Correct logistic loss function #1032

[ML] Correct logistic loss function #1032

Conversation

tveasey commented Feb 28, 2020

valeriy42 left a comment

Choose a reason for hiding this comment

valeriy42 Feb 28, 2020

Choose a reason for hiding this comment

tveasey Feb 28, 2020 • edited Loading

Choose a reason for hiding this comment

tveasey Feb 28, 2020

Choose a reason for hiding this comment

tveasey commented Feb 28, 2020

tveasey commented Feb 28, 2020

tveasey Feb 28, 2020 •

edited

Loading