You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While triaging #69276 I noticed a different problem that could theoretically be affecting production use of ML.
The master node log contains this section:
[2021-02-19T13:09:20,538][TRACE][o.e.x.m.j.p.JobResultsProvider] [v7.12.0-0] [ml-mappings-upgrade-job] Insufficient history to calculate established memory use
[2021-02-19T13:09:20,540][TRACE][o.e.x.m.j.p.JobResultsProvider] [v7.12.0-0] ES API CALL: search latest model_size_stats for job ml-snapshots-upgrade-job
[2021-02-19T13:09:20,543][DEBUG][o.e.c.c.PublicationTransportHandler] [v7.12.0-0] received diff cluster state version [846] with uuid [5cD6NFvOS0idmNd4tB96hw], diff size [560]
[2021-02-19T13:09:20,557][INFO ][o.e.x.m.a.TransportUpgradeJobModelSnapshotAction] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] sending start upgrade request
[2021-02-19T13:09:20,589][DEBUG][o.e.c.c.C.CoordinatorPublication] [v7.12.0-0] publication ended successfully: Publication{term=6, version=846}
[2021-02-19T13:09:20,592][DEBUG][o.e.x.m.j.JobNodeSelector] [v7.12.0-0] Falling back to allocating job [ml-snapshots-upgrade-job] by job counts because its memory requirement was not available
It implies that two calls to memoryTracker.isRecentlyRefreshed() in different parts of the code which are assumed to return the same value actually returned different values (false in SnapshotUpgradeTaskExecutor.getAssignment and true in AbstractJobPersistentTasksExecutor.checkMemoryFreshness).
The effect of unnecessarily falling back to assigning jobs by count rather than memory is bad because it could lead to ML native processes suffering OOM errors if memory is constrained.
The text was updated successfully, but these errors were encountered:
This change fixes a race condition that can occur if the
return value of memoryTracker.isRecentlyRefreshed() changes
between two calls that are assumed to return the same value.
The solution is to just call the method once and pass that
value to the other place where it is needed. Then all related
code makes decisions based on the same view of whether the
memory tracker has been recently refreshed or not.
Fixes#69289
This change fixes a race condition that can occur if the
return value of memoryTracker.isRecentlyRefreshed() changes
between two calls that are assumed to return the same value.
The solution is to just call the method once and pass that
value to the other place where it is needed. Then all related
code makes decisions based on the same view of whether the
memory tracker has been recently refreshed or not.
Fixes#69289
This change fixes a race condition that can occur if the
return value of memoryTracker.isRecentlyRefreshed() changes
between two calls that are assumed to return the same value.
The solution is to just call the method once and pass that
value to the other place where it is needed. Then all related
code makes decisions based on the same view of whether the
memory tracker has been recently refreshed or not.
Fixes#69289
This change fixes a race condition that can occur if the
return value of memoryTracker.isRecentlyRefreshed() changes
between two calls that are assumed to return the same value.
The solution is to just call the method once and pass that
value to the other place where it is needed. Then all related
code makes decisions based on the same view of whether the
memory tracker has been recently refreshed or not.
Fixes#69289
While triaging #69276 I noticed a different problem that could theoretically be affecting production use of ML.
The master node log contains this section:
It implies that two calls to
memoryTracker.isRecentlyRefreshed()
in different parts of the code which are assumed to return the same value actually returned different values (false
inSnapshotUpgradeTaskExecutor.getAssignment
andtrue
inAbstractJobPersistentTasksExecutor.checkMemoryFreshness
).The effect of unnecessarily falling back to assigning jobs by count rather than memory is bad because it could lead to ML native processes suffering OOM errors if memory is constrained.
The text was updated successfully, but these errors were encountered: