Skip to content

Random build failures: Make test failure output on numerical comparisons semi-useful #1477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomFinley opened this issue Oct 31, 2018 · 2 comments
Assignees
Labels
Build Build related issue test related to tests

Comments

@TomFinley
Copy link
Contributor

TomFinley commented Oct 31, 2018

Companion issue to #1471, specific to making the test failure messages provide more helpful information than they do right now.

Many of our tests are baseline tests -- that is, they produce a file, then we compare the two files. A while ago we changed the baseline test infrastructure a bit, so that rather than insisting on a perfect match, they only had to match numbers within a certain tolerance. This is in principle a positive change, but unfortunately the current code to do so is flawed in certain ways that compromise its usefulness.

Consider this code here, which is the place where many of the test failures actually take place.

delta = Math.Round(f1 - f2, digitsOfPrecision);
Assert.InRange(delta, -allowedVariance, allowedVariance);

This code is bad for two reasons...

Probably the most egregious issue is the code, instead of using the Fail method as we do everywhere else, instead used xunit's Assert. This means that when a difference is detected it fails immediately, and stops the test altogether. This is extremely bad. In a given test many files to be baselined may be produced, and by failing on the first one, we will lack crucial side information that may give us a more complete picture of what is going on.

This would be especially bad in the situation where we expect that the baselines will be changed, and are running the tests just to produce the new files -- instead of having to run the tests once, we must run them many times, just to get the files that are baselined.

The second reason is that the message produced by this check is totally uninformative.

, even if we wanted to use this InRange method (we don't), it's checking against a range centered around 0. So for example, when I was working on MulticlassTreeFeaturizedLRTest, this line was failing with reporting that -6 was out of some small range centered around 0. But I couldn't just re-run the test with a breakpoint or something to figure out what was going wrong, because even if the Mac Visual Studio were detecting our tests (for some reason it doesn't, not sure why), the entire problem was that the test was highly non-deterministic, and it had taken me over 100 tries to even get that one repro, and a subsequent run would almost certainly succeed. But at the same time I have absolutely no idea what files.

So first I dug in and found what files would be produced in a successful run. Then I wrote a new test that would actually just compare these files (crucially, without trying to generate new ones!), while printing out what file failed. And it turned out to be this file here. MulticlassLogisticRegression-TrainTest-iris-tree-featurized-out.txt.

L1 regularization selected 78 of 78 weights.

and it should have had this line

L1 regularization selected 72 of 72 weights.

This whole process took many frustrating minutes.

So now I found the reason for this -6. Because, I guess, 72-78=-6, and obviously letting me know that -6 was not very close to 0 was far, far more important than letting me know what file differed where. 😄

And, to top it all of, because it is an xunit assert, all the other files that would be generated by this (e.g., not just train-test but also the CV comparison, were not generated, because the test was stopped right there before they could be generated. So on the whole a very

So:

  1. This comparison should not use an xunit assert, rather it should use the Fail mechanism, to allow the test to continue running once a difference is detected, since obviously many differences may be observed and in any event we do not want to stop files from being generated. (It can perhaps stop checking that one file, but it shouldn't stop generating files altogether.)

  2. The test message should be enhanced to show additional information, specifically:

    • Which file had the difference,
    • Where in the file, e.g., line and offset?
    • If a numerical comparison, what were the two numbers that were compared? (Not just their difference, the actual numbers)
    • Possibly even some context of the two files (surrounding lines, characters?) to make things more obvious.
@TomFinley TomFinley added Build Build related issue test related to tests labels Oct 31, 2018
@TomFinley
Copy link
Contributor Author

Incidentally I see issue #218 and associated PR #1420, which might address this issue.

@sfilipi
Copy link
Member

sfilipi commented Nov 2, 2018

Thanks for the detailed writeup @TomFinley

@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Build Build related issue test related to tests
Projects
None yet
Development

No branches or pull requests

2 participants