Skip to content

[ML] Prevent node potentially going out of memory due to loading quantiles #70376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 15, 2021

Conversation

hendrikmuhs
Copy link

Large jobs with lots of partitions can get very big, retrieving snapshots
for such a job can cause a node to go out of memory.

With this change do not fetch quantiles when querying for (multiple)
modelSnapshots to avoid memory overhead. Quantiles aren't needed for
the API's using JobResultsProvider.modelSnapshots(...)

fixes #70372

@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Mar 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hendrikmuhs hendrikmuhs merged commit 74feca2 into elastic:master Mar 15, 2021
@hendrikmuhs hendrikmuhs deleted the ml-#70372 branch March 15, 2021 12:56
hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this pull request Mar 15, 2021
…tiles (elastic#70376)

Large jobs with lots of partitions can get very big, retrieving snapshots
for such a job can cause a node to go out of memory.

With this change do not fetch quantiles when querying for (multiple)
modelSnapshots to avoid memory overhead. Quantiles aren't needed for
the API's using JobResultsProvider.modelSnapshots(...)

fixes elastic#70372
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Dec 18, 2023
As a followup to elastic#70376 this change further reduces the number of
places where we fetch the `quantiles` field of model snapshot
documents.

The quantiles can be very large and can cause out-of-memory errors
on small nodes, especially if more than one document containing
quantiles is loaded into memory at one time. The method
`JobManager.validateModelSnapshotIdUpdate` was a place where
two model snapshot documents were being loaded simultaneously,
both with their quantiles unnecessarily included. Following this
change there should be no risk of that method causing an
out-of-memory exception.
droberts195 added a commit that referenced this pull request Dec 19, 2023
…103530)

As a followup to #70376 this change further reduces the number of
places where we fetch the `quantiles` field of model snapshot
documents.

The quantiles can be very large and can cause out-of-memory errors
on small nodes, especially if more than one document containing
quantiles is loaded into memory at one time. The method
`JobManager.validateModelSnapshotIdUpdate` was a place where
two model snapshot documents were being loaded simultaneously,
both with their quantiles unnecessarily included. Following this
change there should be no risk of that method causing an
out-of-memory exception.
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Dec 19, 2023
…lastic#103530)

As a followup to elastic#70376 this change further reduces the number of
places where we fetch the `quantiles` field of model snapshot
documents.

The quantiles can be very large and can cause out-of-memory errors
on small nodes, especially if more than one document containing
quantiles is loaded into memory at one time. The method
`JobManager.validateModelSnapshotIdUpdate` was a place where
two model snapshot documents were being loaded simultaneously,
both with their quantiles unnecessarily included. Following this
change there should be no risk of that method causing an
out-of-memory exception.
elasticsearchmachine pushed a commit that referenced this pull request Dec 19, 2023
…103530) (#103551)

As a followup to #70376 this change further reduces the number of
places where we fetch the `quantiles` field of model snapshot
documents.

The quantiles can be very large and can cause out-of-memory errors
on small nodes, especially if more than one document containing
quantiles is loaded into memory at one time. The method
`JobManager.validateModelSnapshotIdUpdate` was a place where
two model snapshot documents were being loaded simultaneously,
both with their quantiles unnecessarily included. Following this
change there should be no risk of that method causing an
out-of-memory exception.
navarone-feekery pushed a commit to navarone-feekery/elasticsearch that referenced this pull request Dec 22, 2023
…lastic#103530)

As a followup to elastic#70376 this change further reduces the number of
places where we fetch the `quantiles` field of model snapshot
documents.

The quantiles can be very large and can cause out-of-memory errors
on small nodes, especially if more than one document containing
quantiles is loaded into memory at one time. The method
`JobManager.validateModelSnapshotIdUpdate` was a place where
two model snapshot documents were being loaded simultaneously,
both with their quantiles unnecessarily included. Following this
change there should be no risk of that method causing an
out-of-memory exception.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v7.12.0 v7.13.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Node can go out of memory during daily maintainance or when listing snapshots
4 participants