-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] Node can go out of memory during daily maintainance or when listing snapshots #70372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/ml-core (Team:ML) |
hendrikmuhs
pushed a commit
to hendrikmuhs/elasticsearch
that referenced
this issue
Mar 15, 2021
… avoid memory overhead. fixes elastic#70372
hendrikmuhs
pushed a commit
that referenced
this issue
Mar 15, 2021
…tiles (#70376) Large jobs with lots of partitions can get very big, retrieving snapshots for such a job can cause a node to go out of memory. With this change do not fetch quantiles when querying for (multiple) modelSnapshots to avoid memory overhead. Quantiles aren't needed for the API's using JobResultsProvider.modelSnapshots(...) fixes #70372
hendrikmuhs
pushed a commit
to hendrikmuhs/elasticsearch
that referenced
this issue
Mar 15, 2021
…tiles (elastic#70376) Large jobs with lots of partitions can get very big, retrieving snapshots for such a job can cause a node to go out of memory. With this change do not fetch quantiles when querying for (multiple) modelSnapshots to avoid memory overhead. Quantiles aren't needed for the API's using JobResultsProvider.modelSnapshots(...) fixes elastic#70372
This was referenced Mar 15, 2021
hendrikmuhs
pushed a commit
that referenced
this issue
Mar 15, 2021
… quantiles (#70381) Large jobs with lots of partitions can get very big, retrieving snapshots for such a job can cause a node to go out of memory. With this change do not fetch quantiles when querying for (multiple) modelSnapshots to avoid memory overhead. Quantiles aren't needed for the API's using JobResultsProvider.modelSnapshots(...) fixes #70372
hendrikmuhs
pushed a commit
that referenced
this issue
Mar 15, 2021
…g quan… (#70385) Large jobs with lots of partitions can get very big, retrieving snapshots for such a job can cause a node to go out of memory. With this change do not fetch quantiles when querying for (multiple) modelSnapshots to avoid memory overhead. Quantiles aren't needed for the API's using JobResultsProvider.modelSnapshots(...) fixes #70372
Unfortunately, deleting the job may not work either, as one of the steps in deleting the job is to get model snapshots, to find the IDs of the state documents to be deleted 🤦 . |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Affected versions: 6.8-7.12
Large jobs with a lot of partitions can get very big, retrieving snapshots for such a job can cause a node to go out of memory.
This problem can happen:
Mitigation:
Best, consider an upgrade to
7.12
, if this is not an option:size
parameter with a small value to avoid going OOM), note as new snapshots get created this must be done regularly or after initial cleanup retain less snapshotsSolution:
It's not required to load all data when listing snapshots / find candidates to remove on daily maintenance. With the use of a source filter we avoid loading unnecessary parts of snapshots. This has the positive side-effect that less data will be transferred and the model snapshot API should become more responsive in general.
The text was updated successfully, but these errors were encountered: