-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Change the _maxCalibrationExamples default on CalibratorUtils #5415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5415 +/- ##
==========================================
- Coverage 74.08% 74.02% -0.07%
==========================================
Files 1019 1019
Lines 190355 190363 +8
Branches 20469 20469
==========================================
- Hits 141033 140914 -119
- Misses 43791 43905 +114
- Partials 5531 5544 +13
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@@ -838,7 +838,9 @@ internal static object Create(IHostEnvironment env, ModelLoadContext ctx, object | |||
internal static class CalibratorUtils | |||
{ | |||
// maximum number of rows passed to the calibrator. | |||
private const int _maxCalibrationExamples = 1000000; | |||
// if 0, we'll actually look through the whole dataset to | |||
// when training the calibrator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier, you had explained to me that if this value is zero, we would look through the whole dataset, but still only use a million rows (randomly selected) for calibration. Is that correct?
Can you please clarify the exact behavior in comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is correct for PlattCalibrator, but depending of the calibrator the behavior is different. If this is 0, the only thing that does happen for all the calibrators is that we'll look through all the dataset. I explained this further on the other comment I left, so I don't think it's necessary to clarify it more in here.
@@ -988,6 +990,10 @@ public static ICalibrator TrainCalibrator(IHostEnvironment env, IChannel ch, ICa | |||
caliTrainer.ProcessTrainingExample(score, label > 0, weight); | |||
|
|||
if (maxRows > 0 && ++num >= maxRows) | |||
// If maxRows was 0, we'll process all of the rows in the dataset | |||
// Notice that depending of the calibrator, "processing" means | |||
// only using N random rows of the ones that where processed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, typo: "depending on the calibrator"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Update to Onnxruntime 1.5.1 (#5406) * Added variables to tests to control Gpu settings * Added dependency to prerelease * Updated to 1.5.1 * Remove prerelease feed * Nit on GPU variables * Change the _maxCalibrationExamples default on CalibratorUtils (#5415) * Change the _maxCalibrationExamples default * Improving comments * Fix perf regression in ShuffleRows (#5417) RowShufflingTransformer is using ChannelReader incorrectly. It needs to block waiting for items to read and was Thread.Sleeping in order to wait, but not spin the current core. This caused a major perf regression. The fix is to block synchronously correctly - by calling AsTask() on the ValueTask that is returned from the ChannelReader and block on the Task. Fix #5416 Co-authored-by: Antonio Velázquez <[email protected]> Co-authored-by: Eric Erhardt <[email protected]>
* Update to Onnxruntime 1.5.1 (dotnet#5406) * Added variables to tests to control Gpu settings * Added dependency to prerelease * Updated to 1.5.1 * Remove prerelease feed * Nit on GPU variables * Change the _maxCalibrationExamples default on CalibratorUtils (dotnet#5415) * Change the _maxCalibrationExamples default * Improving comments * Fix perf regression in ShuffleRows (dotnet#5417) RowShufflingTransformer is using ChannelReader incorrectly. It needs to block waiting for items to read and was Thread.Sleeping in order to wait, but not spin the current core. This caused a major perf regression. The fix is to block synchronously correctly - by calling AsTask() on the ValueTask that is returned from the ChannelReader and block on the Task. Fix dotnet#5416 Co-authored-by: Antonio Velázquez <[email protected]> Co-authored-by: Eric Erhardt <[email protected]>
* Update to Onnxruntime 1.5.1 (#5406) * Added variables to tests to control Gpu settings * Added dependency to prerelease * Updated to 1.5.1 * Remove prerelease feed * Nit on GPU variables * Change the _maxCalibrationExamples default on CalibratorUtils (#5415) * Change the _maxCalibrationExamples default * Improving comments * Fix perf regression in ShuffleRows (#5417) RowShufflingTransformer is using ChannelReader incorrectly. It needs to block waiting for items to read and was Thread.Sleeping in order to wait, but not spin the current core. This caused a major perf regression. The fix is to block synchronously correctly - by calling AsTask() on the ValueTask that is returned from the ChannelReader and block on the Task. Fix #5416 Co-authored-by: Antonio Velázquez <[email protected]> Co-authored-by: Eric Erhardt <[email protected]>
As reported offline, ML.NET yielded different results than TLC when training a PlattCalibrator with the same dataset.
Upon further investigation, it turns out that it only happened on datasets over 1 million rows, and the reason was that when porting the CalibratorUtils class from TLC, a "_maxCalibrationExamples = 1000000" default parameter was added.
Upon reading through the code (in particular CalibratorTrainingBase's ProcessingTrainingExample) it turns out that on TLC
TrainCalibrator
was called withmaxRows = 0
, and this made that when training the PlattCalibrator, all the dataset was seen, but only 1M rows where selected randomly to be added to the DataStore. In contrast, on ML.NET that same method was called withmaxRows = 1M
, and this made that only the first 1M rows were added to the DataStore (instead of randomly selecting them from the complete dataset). This caused bias and undesired results.