[ML] Possible race condition between multiple memoryTracker.isRecentlyRefreshed() calls #69289

droberts195 · 2021-02-19T17:38:10Z

While triaging #69276 I noticed a different problem that could theoretically be affecting production use of ML.

The master node log contains this section:

[2021-02-19T13:09:20,538][TRACE][o.e.x.m.j.p.JobResultsProvider] [v7.12.0-0] [ml-mappings-upgrade-job] Insufficient history to calculate established memory use
[2021-02-19T13:09:20,540][TRACE][o.e.x.m.j.p.JobResultsProvider] [v7.12.0-0] ES API CALL: search latest model_size_stats for job ml-snapshots-upgrade-job
[2021-02-19T13:09:20,543][DEBUG][o.e.c.c.PublicationTransportHandler] [v7.12.0-0] received diff cluster state version [846] with uuid [5cD6NFvOS0idmNd4tB96hw], diff size [560]
[2021-02-19T13:09:20,557][INFO ][o.e.x.m.a.TransportUpgradeJobModelSnapshotAction] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] sending start upgrade request
[2021-02-19T13:09:20,589][DEBUG][o.e.c.c.C.CoordinatorPublication] [v7.12.0-0] publication ended successfully: Publication{term=6, version=846}
[2021-02-19T13:09:20,592][DEBUG][o.e.x.m.j.JobNodeSelector] [v7.12.0-0] Falling back to allocating job [ml-snapshots-upgrade-job] by job counts because its memory requirement was not available

It implies that two calls to memoryTracker.isRecentlyRefreshed() in different parts of the code which are assumed to return the same value actually returned different values (false in SnapshotUpgradeTaskExecutor.getAssignment and true in AbstractJobPersistentTasksExecutor.checkMemoryFreshness).

The effect of unnecessarily falling back to assigning jobs by count rather than memory is bad because it could lead to ML native processes suffering OOM errors if memory is constrained.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-02-19T17:38:13Z

Pinging @elastic/ml-core (Team:ML)

This change fixes a race condition that can occur if the return value of memoryTracker.isRecentlyRefreshed() changes between two calls that are assumed to return the same value. The solution is to just call the method once and pass that value to the other place where it is needed. Then all related code makes decisions based on the same view of whether the memory tracker has been recently refreshed or not. Fixes #69289

droberts195 added >bug :ml Machine learning labels Feb 19, 2021

droberts195 self-assigned this Feb 19, 2021

elasticmachine added the Team:ML Meta label for the ML team label Feb 19, 2021

droberts195 mentioned this issue Feb 19, 2021

[ML] Avoid memory tracker race condition #69290

Merged

droberts195 closed this as completed in #69290 Feb 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Possible race condition between multiple memoryTracker.isRecentlyRefreshed() calls #69289

[ML] Possible race condition between multiple memoryTracker.isRecentlyRefreshed() calls #69289

droberts195 commented Feb 19, 2021

elasticmachine commented Feb 19, 2021

Uh oh!

[ML] Possible race condition between multiple memoryTracker.isRecentlyRefreshed() calls #69289

[ML] Possible race condition between multiple memoryTracker.isRecentlyRefreshed() calls #69289

Comments

droberts195 commented Feb 19, 2021

elasticmachine commented Feb 19, 2021

Uh oh!