-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Investigate the build issues, focusing on tests #1471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Note for
We do have a set of builds that should in principle always succeed. After every merge into the You can see the above filters (not sure how to provide a link to the query). ALL of these runs "should" have passed, but didn't for one reason or another. These runs weren't against a PR that had failures. These runs were against the current There is also an "Analytics" tab where you can see which tests fail the most. You can slice this by I assume anything less than 90% pass rate is not acceptable (looking at you LightGBM tests). |
Great, thanks @eerhardt , I wasn't quite sure where to see this information. I am opening specific issues to track each of these investigative issues, but I will link back to your comment for this "cataloging" issue. In fact I may as well write that issue right now... |
Starting to take note of tests that hug up: |
Here are instructions for reproducing the Hosted Mac environment for debugging test failures locally: https://dev.azure.com/aifx/public/_settings/agentqueues?queueId=5&_a=agents Might be helpful with #1506 |
we did improve tests last year to improve pass rate. closing this. |
At the time of writing our build system is plagued by a large number of failing tests and other build issues. This impacts our agility since an otherwise valid PR can not pass the test checks for spurious reasons that have nothing to do with the change. It also in turn leads to significant wastage of resources. The goal would be to improve the test error rate.
However, we are vexed somewhat by a lack of information on why these test failures are occurring. In particular, trying to reproduce test failures locally has, at least in my experience, very limited success. For example, in my own investigation into the random failures of
MulticlassTreeFeaturizedLRTest
on MacOS debug, I was only able to achieve a test failure twice out of some hundreds of runs on a Macbook, and what information I was able to gather was limited.In the seeming absence of the ability to reliably produce test failures outside of the build machines, we need more information.
Publish the tests logs as an artifact of the build so that we can gather more information. Random build failures: Publish the test logs #1473.
Make the error messages from tests, when they do occur, contain some actually useful information. Random build failures: Make test failure output on numerical comparisons semi-useful #1477.
Create a catalog of failures that occur in builds that in principle should have succeeded. (E.g., builds of
master
.) This is partially to validate the assumption that tests are the primary problem, as well as to get a sense of what tests are problematic. Random build failures: Catalog the failures #1474.The preceding is purely information gathering, but at the same time there are some positive steps that can be taken, pending the above.
We already know of some troublesome tests. These should be investigated for the "usual suspects," e.g., failure to set random seeds to a fixed value, having a variable number of threads in training processes, etc. (Which are known, but innocent, sources of run to run variance.)
That the tests seem to fail so readily on the build machines yet are vexingly difficult to make fail locally suggests that there is something about the build environment that is different -- perhaps a different architecture or performance characteristics raise issues or race conditions that are simply not observed on our more performant developer machines. It may therefore be worthwhile to try to get the test environment machines reproduced exactly (down to the environment, processor, memory, everything) to see if that shows any clues.
Most vague, but still useful, the nature of the failures, while mysterious, have not been entirely devoid of clues as to potential causes. I may write more about them in a comment later.
/cc @Zruty0 @eerhardt
The text was updated successfully, but these errors were encountered: