Skip to content

[ML] do not exit the worker after warning about failed cleanup #352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 8, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion lib/api/CForecastRunner.cc
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,6 @@ void CForecastRunner::forecastWorker() {
LOG_WARN(<< "Failed to cleanup temporary data from: "
<< forecastJob.s_TemporaryFolder << " error "
<< errorCode.message());
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it is really a good idea to continue the loop in this case. Are there definitely no side effects from having stale directories at the point we run subsequent forecasts?

If there are definitely no side effects from having failed to clean up then I guess this is ok. Alternatively, since you delete all pending forecasts at the end of this function, another option might be to set the m_Shutdown flag here, break out of the loop and check the worker hasn't shutdown in the code which schedules forecasts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think continuing the loop is that bad. Remember that a new process is only started for forecasting if the job is not currently open. So exiting the loop would mean it was not possible to do a forecast for a real-time job without closing and reopening the job, and that could be a major pain for production use cases.

Maybe I'm missing something though. If shutting down the loop is determined to be the best way forward then it needs to be set to true with the mutex locked in this method and checked with the mutex locked in the push method. Otherwise there would be potential for a race if a forecast is being pushed at the same time as cleanup of another forecast is failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's a good point. I'd thought the thread was spawned/joined each time a forecast was requested, but it does indeed look like the forecast thread is started only once when the CAnomalyJob object is created. In that case ever exiting this loop - except on process shutdown - seems undesirable.

I'm just wanted to raise this, since it is quite a significant change in behaviour, and wanted to be sure we'd thought about possible side effects. A couple of things that occurred to me, but I hadn't checked are: 1) what do we do about filling up the disk on the write side? 2) can there be any side effects from having stale directories for subsequent forecasts, i.e. do names definitely not get recycled?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was never intended to shutdown the loop, this is a bug. Even the code documentation says:

// not an error: there is also cleanup code on the Java side

and the log level is warning

The name of a temporary file is also not hardcoded but random: https://github.com/elastic/ml-cpp/blob/master/lib/model/CForecastModelPersist.cc#L34

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the writing of forecast documents also checks the disk isn't full. I just wanted to double check.

}
}
}
Expand Down