-
Notifications
You must be signed in to change notification settings - Fork 65
[ML] Eagerly discard node statistics for leaves which we will never split #1125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
counter_t::E_DFTPMEstimatedPeakMemoryUsage) < 6000000); | ||
BOOST_TEST_REQUIRE(core::CProgramCounters::counter(counter_t::E_DFTPMPeakMemoryUsage) < 1500000); | ||
counter_t::E_DFTPMEstimatedPeakMemoryUsage) < 4500000); | ||
BOOST_TEST_REQUIRE(core::CProgramCounters::counter(counter_t::E_DFTPMPeakMemoryUsage) < 1600000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not using more memory; however, we now properly account for the memory used by container for the leaf statistics because we resize it before estimating its memory usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is a simple memory optimisation for regression and classification model training.
Since we limit the maximum tree size, we only need retain statistics for the maximum gain "max tree size" - "current tree size" leaves. Since we add one leaf statistic for each node we add to the tree we only ever need at most "maximum tree size" / 2 node statistics in total.
We use an upper bound for the leaf statistics memory estimates so this also indirectly helps improve our memory estimate accuracy and so helps with #1106. For example, this reduced estimates by around 25% testing.