Random build failures: Make test failure output on numerical comparisons semi-useful #1477

TomFinley · 2018-10-31T21:39:14Z

Companion issue to #1471, specific to making the test failure messages provide more helpful information than they do right now.

Many of our tests are baseline tests -- that is, they produce a file, then we compare the two files. A while ago we changed the baseline test infrastructure a bit, so that rather than insisting on a perfect match, they only had to match numbers within a certain tolerance. This is in principle a positive change, but unfortunately the current code to do so is flawed in certain ways that compromise its usefulness.

Consider this code here, which is the place where many of the test failures actually take place.

machinelearning/test/Microsoft.ML.TestFramework/BaseTestBaseline.cs

Lines 549 to 550 in a26eca7

    
           delta = Math.Round(f1 - f2, digitsOfPrecision); 
        
           Assert.InRange(delta, -allowedVariance, allowedVariance);

This code is bad for two reasons...

Probably the most egregious issue is the code, instead of using the Fail method as we do everywhere else, instead used xunit's Assert. This means that when a difference is detected it fails immediately, and stops the test altogether. This is extremely bad. In a given test many files to be baselined may be produced, and by failing on the first one, we will lack crucial side information that may give us a more complete picture of what is going on.

This would be especially bad in the situation where we expect that the baselines will be changed, and are running the tests just to produce the new files -- instead of having to run the tests once, we must run them many times, just to get the files that are baselined.

The second reason is that the message produced by this check is totally uninformative.

, even if we wanted to use this InRange method (we don't), it's checking against a range centered around 0. So for example, when I was working on MulticlassTreeFeaturizedLRTest, this line was failing with reporting that -6 was out of some small range centered around 0. But I couldn't just re-run the test with a breakpoint or something to figure out what was going wrong, because even if the Mac Visual Studio were detecting our tests (for some reason it doesn't, not sure why), the entire problem was that the test was highly non-deterministic, and it had taken me over 100 tries to even get that one repro, and a subsequent run would almost certainly succeed. But at the same time I have absolutely no idea what files.

So first I dug in and found what files would be produced in a successful run. Then I wrote a new test that would actually just compare these files (crucially, without trying to generate new ones!), while printing out what file failed. And it turned out to be this file here. MulticlassLogisticRegression-TrainTest-iris-tree-featurized-out.txt.

L1 regularization selected 78 of 78 weights.

and it should have had this line

L1 regularization selected 72 of 72 weights.

This whole process took many frustrating minutes.

So now I found the reason for this -6. Because, I guess, 72-78=-6, and obviously letting me know that -6 was not very close to 0 was far, far more important than letting me know what file differed where. 😄

And, to top it all of, because it is an xunit assert, all the other files that would be generated by this (e.g., not just train-test but also the CV comparison, were not generated, because the test was stopped right there before they could be generated. So on the whole a very

So:

This comparison should not use an xunit assert, rather it should use the Fail mechanism, to allow the test to continue running once a difference is detected, since obviously many differences may be observed and in any event we do not want to stop files from being generated. (It can perhaps stop checking that one file, but it shouldn't stop generating files altogether.)
The test message should be enhanced to show additional information, specifically:
- Which file had the difference,
- Where in the file, e.g., line and offset?
- If a numerical comparison, what were the two numbers that were compared? (Not just their difference, the actual numbers)
- Possibly even some context of the two files (surrounding lines, characters?) to make things more obvious.

The text was updated successfully, but these errors were encountered:

TomFinley · 2018-10-31T22:57:27Z

Incidentally I see issue #218 and associated PR #1420, which might address this issue.

sfilipi · 2018-11-02T18:23:51Z

Thanks for the detailed writeup @TomFinley

TomFinley added Build Build related issue test related to tests labels Oct 31, 2018

TomFinley mentioned this issue Oct 31, 2018

Investigate the build issues, focusing on tests #1471

Closed

TomFinley assigned sfilipi Nov 2, 2018

sfilipi mentioned this issue Nov 6, 2018

adding more logging to failures. #1555

Merged

sfilipi closed this as completed in #1555 Nov 7, 2018

ghost locked as resolved and limited conversation to collaborators Mar 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random build failures: Make test failure output on numerical comparisons semi-useful #1477

Random build failures: Make test failure output on numerical comparisons semi-useful #1477

TomFinley commented Oct 31, 2018 •

edited

Loading

TomFinley commented Oct 31, 2018

sfilipi commented Nov 2, 2018

Random build failures: Make test failure output on numerical comparisons semi-useful #1477

Random build failures: Make test failure output on numerical comparisons semi-useful #1477

Comments

TomFinley commented Oct 31, 2018 • edited Loading

TomFinley commented Oct 31, 2018

sfilipi commented Nov 2, 2018

TomFinley commented Oct 31, 2018 •

edited

Loading