Skip to content

[ML] Correct logistic loss function #1032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 29, 2020

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Feb 28, 2020

This was a bug in the computation of the best leaf weight when predictions for all its training rows are identical. In this case, we should find the weight to add to the current prediction which minimises the cross-entropy.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good work on catching this bug! I have just a single small suggestion wrt readibility.

@@ -1220,7 +1220,7 @@ BOOST_AUTO_TEST_CASE(testLogisticRegression) {

LOG_DEBUG(<< "mean log relative error = "
<< maths::CBasicStatistics::mean(meanLogRelativeError));
BOOST_TEST_REQUIRE(maths::CBasicStatistics::mean(meanLogRelativeError) < 0.5);
BOOST_TEST_REQUIRE(maths::CBasicStatistics::mean(meanLogRelativeError) < 0.52);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an intuition why the error goes up here?

Copy link
Contributor Author

@tveasey tveasey Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No! Interestingly though (looking at the debug) the hyperparameter choices become more stable with this change.

My best guess is there are multiple ways the code can "learn" to avoid the bug. For example, it'll potentially choose hyperparameter values which result in deeper trees so that it'll take the non-buggy path more often. Later trees added to the forest correct for the bug, since they're more likely to have different values for at least one of the rows so take the non-buggy path.

I've tested on some additional data sets and I haven't seen evidence of any accuracy degradation, but it will be interesting to review the results on our overnights.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So of course there was a subtle undefined behaviour bug with capture by reference. In fact, the result is still slightly worse for one data set in this test, but slightly better for a different one, with the average error being back to almost exactly what it was before.

I do still think the comments above somewhat explain why the previous behaviour wasn't causing a more obvious degradation in accuracy.

@tveasey
Copy link
Contributor Author

tveasey commented Feb 28, 2020

retest

1 similar comment
@tveasey
Copy link
Contributor Author

tveasey commented Feb 28, 2020

retest

@tveasey tveasey merged commit 1cf9de8 into elastic:master Feb 29, 2020
@tveasey tveasey deleted the classification-loss-bug branch February 29, 2020 09:43
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Feb 29, 2020
tveasey added a commit that referenced this pull request Feb 29, 2020
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Mar 16, 2020
tveasey added a commit that referenced this pull request Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants