Skip to content

[ML] Improvements to regression and classification memory handling #995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 of 4 tasks
tveasey opened this issue Feb 12, 2020 · 2 comments
Closed
3 of 4 tasks

[ML] Improvements to regression and classification memory handling #995

tveasey opened this issue Feb 12, 2020 · 2 comments

Comments

@tveasey
Copy link
Contributor

tveasey commented Feb 12, 2020

Currently, our upfront memory usage estimate is an upper bound and a significant over estimate in most cases. This issue covers a couple of quick wins for improving the estimate:

  • Pass the training percentage: we need to know the number of rows used to train
  • Pass number of feature values for each feature: this would enable to better estimate how much memory we'll use for aggregate loss derivatives
  • Account for maximum number of features we will select
  • The SHAP's memory usage is not current with the leaf statistics memory usage

A better strategy (longer term) would be, rather than estimating a memory upper bound, estimate a value which training is very unlikely to exceed. This would require that we support circuit breaking during training. Since we snapshot state periodically the user would still be able to retrospectively increase the memory limit and restart analysis.

@tveasey
Copy link
Contributor Author

tveasey commented Feb 17, 2020

This is also impacted by #1003.

@tveasey
Copy link
Contributor Author

tveasey commented Feb 1, 2021

We've done quite a lot of work on memory usage since this issue was created. Whilst it would be possible to refine estimates if we knew certain features had relatively few distinct values this is not a priority at present. We can revisit if we decide we need better memory estimates in the future.

@tveasey tveasey closed this as completed Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant