Skip to content

[ML] Add state format version for resilience #726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 10, 2019

Conversation

valeriy42
Copy link
Contributor

This PR adds the version number of the current release (7.5) to the format of the training state using for resilience. In the case of the version mismatch, the deserialization of the training state will fail and training will start from scratch.

I also took an opportunity to refactor the test for persistence state error handling to have a more granular control of the reasons for failure.

Copy link
Contributor

@edsavage edsavage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

errorInBayesianOptimisationState.flush();

bool throwsExceptions{false};
std::stringstream buffer;
std::streambuf* old = std::cerr.rdbuf(buffer.rdbuf());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should intercept logs like this:

  1. It's making the assumption that default logging goes through std::cerr, which is a breach of encapsulation. For example, I was wondering if I should change it to go through std::clog so that it's buffered.
  2. It's setting the buffer to a local variable which means that if the corresponding std::cerr.rdbuf(old) doesn't get run for some reason then the test suite will core dump in the next test. And when that happens it will be much more work to figure out what went wrong than looking at a nicely formatted list of test failures in Jenkins.

There's an example of how we've previously asserted on which errors are logged in CPoissonMeanConjugateTest::testSampleMarginalLikelihoodInSupportBounds. You could even use the same logging config file in your test as it's also in the maths unit tests.

If you don't like this approach then after 7.5 feature freeze I'd be happy for you to open another PR to add a method to CLogger to tell it to log to a supplied stream instead of the current logging location and then change the unit tests that use the current approach over to that. We could have a catch (...) in the tests that use that to ensure the logger is reset before the test exits for any reason. But please use the current approach for this PR and then do that change (if you think it's worthwhile) separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out, @droberts195. I adjusted the test. Could you please take another look.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the change.

I think you should call ml::core::CLogger::instance().reset() at the end of the test so that we don't have every subsequent test also logging to the file, at least in the case where it succeeds. It's not fatal for every subsequent test to log to the file, but might cause confusion one day so it's best that we don't in the common case.

But if you add that line then the PR LGTM, so go ahead and merge without further review.

@valeriy42 valeriy42 merged commit d782c42 into elastic:master Oct 10, 2019
valeriy42 added a commit to valeriy42/ml-cpp that referenced this pull request Oct 10, 2019
This PR adds the version number of the current release (7.5) to the format of the training state using for resilience. In the case of the version mismatch, the deserialization of the training state will fail and training will start from scratch.

I also took an opportunity to refactor the test for persistence state error handling to have a more granular control of the reasons for failure.
valeriy42 added a commit that referenced this pull request Oct 10, 2019
This PR adds the version number of the current release (7.5) to the format of the training state using for resilience. In the case of the version mismatch, the deserialization of the training state will fail and training will start from scratch.

I also took an opportunity to refactor the test for persistence state error handling to have a more granular control of the reasons for failure.
@valeriy42 valeriy42 deleted the version-state branch May 6, 2020 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants