[ML] Node can go out of memory during daily maintainance or when listing snapshots #70372

hendrikmuhs · 2021-03-15T10:33:28Z

Affected versions: 6.8-7.12

Large jobs with a lot of partitions can get very big, retrieving snapshots for such a job can cause a node to go out of memory.

This problem can happen:

when using the Get model snapshot API either directly or indirectly triggered by the UI clicking on the Model Snapshots tab.
during daily maintenance when expired model snapshots should get deleted (unfortunately the OOM happens in the preparation step, therefore this error might re-appear)

Mitigation:

Best, consider an upgrade to 7.12, if this is not an option:

delete the offending job, consider re-creating it less resource-hungry(e.g. by splitting)
Increasing the heap space of the node(temporarily until upgrading to 7.12)
manually delete snapshots (use the size parameter with a small value to avoid going OOM), note as new snapshots get created this must be done regularly or after initial cleanup retain less snapshots

Solution:

It's not required to load all data when listing snapshots / find candidates to remove on daily maintenance. With the use of a source filter we avoid loading unnecessary parts of snapshots. This has the positive side-effect that less data will be transferred and the model snapshot API should become more responsive in general.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-03-15T10:33:31Z

Pinging @elastic/ml-core (Team:ML)

… avoid memory overhead. fixes elastic#70372

…tiles (#70376) Large jobs with lots of partitions can get very big, retrieving snapshots for such a job can cause a node to go out of memory. With this change do not fetch quantiles when querying for (multiple) modelSnapshots to avoid memory overhead. Quantiles aren't needed for the API's using JobResultsProvider.modelSnapshots(...) fixes #70372

…tiles (elastic#70376) Large jobs with lots of partitions can get very big, retrieving snapshots for such a job can cause a node to go out of memory. With this change do not fetch quantiles when querying for (multiple) modelSnapshots to avoid memory overhead. Quantiles aren't needed for the API's using JobResultsProvider.modelSnapshots(...) fixes elastic#70372

… quantiles (#70381) Large jobs with lots of partitions can get very big, retrieving snapshots for such a job can cause a node to go out of memory. With this change do not fetch quantiles when querying for (multiple) modelSnapshots to avoid memory overhead. Quantiles aren't needed for the API's using JobResultsProvider.modelSnapshots(...) fixes #70372

…g quan… (#70385) Large jobs with lots of partitions can get very big, retrieving snapshots for such a job can cause a node to go out of memory. With this change do not fetch quantiles when querying for (multiple) modelSnapshots to avoid memory overhead. Quantiles aren't needed for the API's using JobResultsProvider.modelSnapshots(...) fixes #70372

droberts195 · 2021-06-24T09:28:46Z

if this is not an option:

delete the offending job, consider re-creating it less resource-hungry(e.g. by splitting)

Unfortunately, deleting the job may not work either, as one of the steps in deleting the job is to get model snapshots, to find the IDs of the state documents to be deleted 🤦 .

hendrikmuhs added >bug :ml Machine learning labels Mar 15, 2021

elasticmachine added the Team:ML Meta label for the ML team label Mar 15, 2021

hendrikmuhs self-assigned this Mar 15, 2021

hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Mar 15, 2021

do not fetch quantiles when querying for (multiple) modelSnapshots to…

7057850

… avoid memory overhead. fixes elastic#70372

hendrikmuhs mentioned this issue Mar 15, 2021

[ML] Prevent node potentially going out of memory due to loading quantiles #70376

Merged

hendrikmuhs closed this as completed in #70376 Mar 15, 2021

This was referenced Mar 15, 2021

[7.x][ML] Prevent node potentially going out of memory due to loading quantiles #70381

Merged

[7.12][ML] Prevent node potentially going out of memory due to loading quan… #70385

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Node can go out of memory during daily maintainance or when listing snapshots #70372

[ML] Node can go out of memory during daily maintainance or when listing snapshots #70372

hendrikmuhs commented Mar 15, 2021 •

edited

Loading

elasticmachine commented Mar 15, 2021

droberts195 commented Jun 24, 2021

[ML] Node can go out of memory during daily maintainance or when listing snapshots #70372

[ML] Node can go out of memory during daily maintainance or when listing snapshots #70372

Comments

hendrikmuhs commented Mar 15, 2021 • edited Loading

elasticmachine commented Mar 15, 2021

droberts195 commented Jun 24, 2021

hendrikmuhs commented Mar 15, 2021 •

edited

Loading