Skip to content

[ML] Node can go out of memory during daily maintainance or when listing snapshots #70372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hendrikmuhs opened this issue Mar 15, 2021 · 2 comments · Fixed by #70376
Closed
Assignees
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team

Comments

@hendrikmuhs
Copy link

hendrikmuhs commented Mar 15, 2021

Affected versions: 6.8-7.12

Large jobs with a lot of partitions can get very big, retrieving snapshots for such a job can cause a node to go out of memory.

This problem can happen:

  • when using the Get model snapshot API either directly or indirectly triggered by the UI clicking on the Model Snapshots tab.
  • during daily maintenance when expired model snapshots should get deleted (unfortunately the OOM happens in the preparation step, therefore this error might re-appear)

Mitigation:

Best, consider an upgrade to 7.12, if this is not an option:

  • delete the offending job, consider re-creating it less resource-hungry(e.g. by splitting)
  • Increasing the heap space of the node(temporarily until upgrading to 7.12)
  • manually delete snapshots (use the size parameter with a small value to avoid going OOM), note as new snapshots get created this must be done regularly or after initial cleanup retain less snapshots

Solution:

It's not required to load all data when listing snapshots / find candidates to remove on daily maintenance. With the use of a source filter we avoid loading unnecessary parts of snapshots. This has the positive side-effect that less data will be transferred and the model snapshot API should become more responsive in general.

@hendrikmuhs hendrikmuhs added >bug :ml Machine learning labels Mar 15, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Mar 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@hendrikmuhs hendrikmuhs self-assigned this Mar 15, 2021
hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Mar 15, 2021
hendrikmuhs pushed a commit that referenced this issue Mar 15, 2021
…tiles (#70376)

Large jobs with lots of partitions can get very big, retrieving snapshots
for such a job can cause a node to go out of memory.

With this change do not fetch quantiles when querying for (multiple)
modelSnapshots to avoid memory overhead. Quantiles aren't needed for
the API's using JobResultsProvider.modelSnapshots(...)

fixes #70372
hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Mar 15, 2021
…tiles (elastic#70376)

Large jobs with lots of partitions can get very big, retrieving snapshots
for such a job can cause a node to go out of memory.

With this change do not fetch quantiles when querying for (multiple)
modelSnapshots to avoid memory overhead. Quantiles aren't needed for
the API's using JobResultsProvider.modelSnapshots(...)

fixes elastic#70372
hendrikmuhs pushed a commit that referenced this issue Mar 15, 2021
… quantiles (#70381)

Large jobs with lots of partitions can get very big, retrieving snapshots
for such a job can cause a node to go out of memory.

With this change do not fetch quantiles when querying for (multiple)
modelSnapshots to avoid memory overhead. Quantiles aren't needed for
the API's using JobResultsProvider.modelSnapshots(...)

fixes #70372
hendrikmuhs pushed a commit that referenced this issue Mar 15, 2021
…g quan… (#70385)

Large jobs with lots of partitions can get very big, retrieving snapshots
for such a job can cause a node to go out of memory.

With this change do not fetch quantiles when querying for (multiple)
modelSnapshots to avoid memory overhead. Quantiles aren't needed for
the API's using JobResultsProvider.modelSnapshots(...)

fixes #70372
@droberts195
Copy link
Contributor

if this is not an option:

  • delete the offending job, consider re-creating it less resource-hungry(e.g. by splitting)

Unfortunately, deleting the job may not work either, as one of the steps in deleting the job is to get model snapshots, to find the IDs of the state documents to be deleted 🤦 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants