[ML] Add state format version for resilience #726

valeriy42 · 2019-10-10T11:30:55Z

This PR adds the version number of the current release (7.5) to the format of the training state using for resilience. In the case of the version mismatch, the deserialization of the training state will fail and training will start from scratch.

I also took an opportunity to refactor the test for persistence state error handling to have a more granular control of the reasons for failure.

edsavage

LGTM

droberts195 · 2019-10-10T11:52:07Z

lib/maths/unittest/CBoostedTreeTest.cc

    errorInBayesianOptimisationState.flush();

    bool throwsExceptions{false};
+    std::stringstream buffer;
+    std::streambuf* old = std::cerr.rdbuf(buffer.rdbuf());


I don't think we should intercept logs like this:

It's making the assumption that default logging goes through std::cerr, which is a breach of encapsulation. For example, I was wondering if I should change it to go through std::clog so that it's buffered.

It's setting the buffer to a local variable which means that if the corresponding std::cerr.rdbuf(old) doesn't get run for some reason then the test suite will core dump in the next test. And when that happens it will be much more work to figure out what went wrong than looking at a nicely formatted list of test failures in Jenkins.

There's an example of how we've previously asserted on which errors are logged in CPoissonMeanConjugateTest::testSampleMarginalLikelihoodInSupportBounds. You could even use the same logging config file in your test as it's also in the maths unit tests.

If you don't like this approach then after 7.5 feature freeze I'd be happy for you to open another PR to add a method to CLogger to tell it to log to a supplied stream instead of the current logging location and then change the unit tests that use the current approach over to that. We could have a catch (...) in the tests that use that to ensure the logger is reset before the test exits for any reason. But please use the current approach for this PR and then do that change (if you think it's worthwhile) separately.

Thank you for pointing this out, @droberts195. I adjusted the test. Could you please take another look.

droberts195

Thanks for making the change.

I think you should call ml::core::CLogger::instance().reset() at the end of the test so that we don't have every subsequent test also logging to the file, at least in the case where it succeeds. It's not fatal for every subsequent test to log to the file, but might cause confusion one day so it's best that we don't in the common case.

But if you add that line then the PR LGTM, so go ahead and merge without further review.

This PR adds the version number of the current release (7.5) to the format of the training state using for resilience. In the case of the version mismatch, the deserialization of the training state will fail and training will start from scratch. I also took an opportunity to refactor the test for persistence state error handling to have a more granular control of the reasons for failure.

use version for persist/restore and test

c6de705

valeriy42 added >non-issue :ml v8.0.0 v7.5.0 labels Oct 10, 2019

edsavage approved these changes Oct 10, 2019

View reviewed changes

droberts195 reviewed Oct 10, 2019

View reviewed changes

review comments

7a5e788

droberts195 approved these changes Oct 10, 2019

View reviewed changes

added missing CLogger reset

bbe43b2

valeriy42 merged commit d782c42 into elastic:master Oct 10, 2019

valeriy42 mentioned this pull request Oct 23, 2019

[ML] Refactor CLogger to use custom stream #773

Merged

valeriy42 deleted the version-state branch May 6, 2020 11:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Add state format version for resilience #726

[ML] Add state format version for resilience #726

valeriy42 commented Oct 10, 2019

edsavage left a comment

droberts195 Oct 10, 2019

valeriy42 Oct 10, 2019

droberts195 left a comment

[ML] Add state format version for resilience #726

[ML] Add state format version for resilience #726

Conversation

valeriy42 commented Oct 10, 2019

edsavage left a comment

Choose a reason for hiding this comment

droberts195 Oct 10, 2019

Choose a reason for hiding this comment

valeriy42 Oct 10, 2019

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment