Skip to content

[ML] Eagerly discard node statistics for leaves which we will never split #1125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 15, 2020

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Apr 3, 2020

This is a simple memory optimisation for regression and classification model training.

Since we limit the maximum tree size, we only need retain statistics for the maximum gain "max tree size" - "current tree size" leaves. Since we add one leaf statistic for each node we add to the tree we only ever need at most "maximum tree size" / 2 node statistics in total.

We use an upper bound for the leaf statistics memory estimates so this also indirectly helps improve our memory estimate accuracy and so helps with #1106. For example, this reduced estimates by around 25% testing.

counter_t::E_DFTPMEstimatedPeakMemoryUsage) < 6000000);
BOOST_TEST_REQUIRE(core::CProgramCounters::counter(counter_t::E_DFTPMPeakMemoryUsage) < 1500000);
counter_t::E_DFTPMEstimatedPeakMemoryUsage) < 4500000);
BOOST_TEST_REQUIRE(core::CProgramCounters::counter(counter_t::E_DFTPMPeakMemoryUsage) < 1600000);
Copy link
Contributor Author

@tveasey tveasey Apr 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using more memory; however, we now properly account for the memory used by container for the leaf statistics because we resize it before estimating its memory usage.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tveasey tveasey merged commit cb7dc7e into elastic:master Apr 15, 2020
@tveasey tveasey deleted the improve-memory-usage branch April 15, 2020 09:49
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Apr 17, 2020
tveasey added a commit that referenced this pull request Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants