Skip to content

Change the _maxCalibrationExamples default on CalibratorUtils #5415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 30, 2020

Conversation

antoniovs1029
Copy link
Member

@antoniovs1029 antoniovs1029 commented Sep 30, 2020

As reported offline, ML.NET yielded different results than TLC when training a PlattCalibrator with the same dataset.

Upon further investigation, it turns out that it only happened on datasets over 1 million rows, and the reason was that when porting the CalibratorUtils class from TLC, a "_maxCalibrationExamples = 1000000" default parameter was added.

Upon reading through the code (in particular CalibratorTrainingBase's ProcessingTrainingExample) it turns out that on TLC TrainCalibrator was called with maxRows = 0, and this made that when training the PlattCalibrator, all the dataset was seen, but only 1M rows where selected randomly to be added to the DataStore. In contrast, on ML.NET that same method was called with maxRows = 1M, and this made that only the first 1M rows were added to the DataStore (instead of randomly selecting them from the complete dataset). This caused bias and undesired results.

@antoniovs1029 antoniovs1029 requested a review from a team as a code owner September 30, 2020 08:18
@codecov
Copy link

codecov bot commented Sep 30, 2020

Codecov Report

Merging #5415 into master will decrease coverage by 0.06%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5415      +/-   ##
==========================================
- Coverage   74.08%   74.02%   -0.07%     
==========================================
  Files        1019     1019              
  Lines      190355   190363       +8     
  Branches    20469    20469              
==========================================
- Hits       141033   140914     -119     
- Misses      43791    43905     +114     
- Partials     5531     5544      +13     
Flag Coverage Δ
#Debug 74.02% <ø> (-0.07%) ⬇️
#production 69.77% <ø> (-0.09%) ⬇️
#test 87.71% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/Microsoft.ML.Data/Prediction/Calibrator.cs 81.29% <ø> (+0.08%) ⬆️
...osoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs 83.60% <0.00%> (-7.27%) ⬇️
src/Microsoft.ML.Data/Training/TrainerUtils.cs 66.86% <0.00%> (-3.82%) ⬇️
...crosoft.ML.StandardTrainers/Standard/SdcaBinary.cs 85.23% <0.00%> (-3.33%) ⬇️
...crosoft.ML.StandardTrainers/Optimizer/Optimizer.cs 71.96% <0.00%> (-1.16%) ⬇️
...oft.ML.StandardTrainers/Standard/SdcaMulticlass.cs 91.46% <0.00%> (-1.03%) ⬇️
src/Microsoft.ML.Data/Utils/LossFunctions.cs 66.83% <0.00%> (-0.52%) ⬇️
...StandardTrainers/Standard/LinearModelParameters.cs 66.32% <0.00%> (-0.26%) ⬇️
test/Microsoft.ML.Functional.Tests/ONNX.cs 100.00% <0.00%> (ø)
test/Microsoft.ML.Tests/OnnxConversionTest.cs 96.17% <0.00%> (+<0.01%) ⬆️
... and 5 more

@@ -838,7 +838,9 @@ internal static object Create(IHostEnvironment env, ModelLoadContext ctx, object
internal static class CalibratorUtils
{
// maximum number of rows passed to the calibrator.
private const int _maxCalibrationExamples = 1000000;
// if 0, we'll actually look through the whole dataset to
// when training the calibrator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, you had explained to me that if this value is zero, we would look through the whole dataset, but still only use a million rows (randomly selected) for calibration. Is that correct?

Can you please clarify the exact behavior in comments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is correct for PlattCalibrator, but depending of the calibrator the behavior is different. If this is 0, the only thing that does happen for all the calibrators is that we'll look through all the dataset. I explained this further on the other comment I left, so I don't think it's necessary to clarify it more in here.

@@ -988,6 +990,10 @@ public static ICalibrator TrainCalibrator(IHostEnvironment env, IChannel ch, ICa
caliTrainer.ProcessTrainingExample(score, label > 0, weight);

if (maxRows > 0 && ++num >= maxRows)
// If maxRows was 0, we'll process all of the rows in the dataset
// Notice that depending of the calibrator, "processing" means
// only using N random rows of the ones that where processed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, typo: "depending on the calibrator"

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@antoniovs1029 antoniovs1029 merged commit 57be476 into dotnet:master Sep 30, 2020
frank-dong-ms-zz added a commit that referenced this pull request Oct 8, 2020
* Update to Onnxruntime 1.5.1 (#5406)

* Added variables to tests to control Gpu settings

* Added dependency to prerelease

* Updated to 1.5.1

* Remove prerelease feed

* Nit on GPU variables

* Change the _maxCalibrationExamples default on CalibratorUtils (#5415)

* Change the _maxCalibrationExamples default

* Improving comments

* Fix perf regression in ShuffleRows (#5417)

RowShufflingTransformer is using ChannelReader incorrectly. It needs to block waiting for items to read and was Thread.Sleeping in order to wait, but not spin the current core. This caused a major perf regression.

The fix is to block synchronously correctly - by calling AsTask() on the ValueTask that is returned from the ChannelReader and block on the Task.

Fix #5416

Co-authored-by: Antonio Velázquez <[email protected]>
Co-authored-by: Eric Erhardt <[email protected]>
mstfbl pushed a commit to mstfbl/machinelearning that referenced this pull request Nov 12, 2020
* Update to Onnxruntime 1.5.1 (dotnet#5406)

* Added variables to tests to control Gpu settings

* Added dependency to prerelease

* Updated to 1.5.1

* Remove prerelease feed

* Nit on GPU variables

* Change the _maxCalibrationExamples default on CalibratorUtils (dotnet#5415)

* Change the _maxCalibrationExamples default

* Improving comments

* Fix perf regression in ShuffleRows (dotnet#5417)

RowShufflingTransformer is using ChannelReader incorrectly. It needs to block waiting for items to read and was Thread.Sleeping in order to wait, but not spin the current core. This caused a major perf regression.

The fix is to block synchronously correctly - by calling AsTask() on the ValueTask that is returned from the ChannelReader and block on the Task.

Fix dotnet#5416

Co-authored-by: Antonio Velázquez <[email protected]>
Co-authored-by: Eric Erhardt <[email protected]>
mstfbl pushed a commit that referenced this pull request Nov 12, 2020
* Update to Onnxruntime 1.5.1 (#5406)

* Added variables to tests to control Gpu settings

* Added dependency to prerelease

* Updated to 1.5.1

* Remove prerelease feed

* Nit on GPU variables

* Change the _maxCalibrationExamples default on CalibratorUtils (#5415)

* Change the _maxCalibrationExamples default

* Improving comments

* Fix perf regression in ShuffleRows (#5417)

RowShufflingTransformer is using ChannelReader incorrectly. It needs to block waiting for items to read and was Thread.Sleeping in order to wait, but not spin the current core. This caused a major perf regression.

The fix is to block synchronously correctly - by calling AsTask() on the ValueTask that is returned from the ChannelReader and block on the Task.

Fix #5416

Co-authored-by: Antonio Velázquez <[email protected]>
Co-authored-by: Eric Erhardt <[email protected]>
@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants