Clean up our auto-caching #1604

Zruty0 · 2018-11-12T23:35:31Z

Currently, some of our trainers cache the data prior to training, without a possibility to disable that.

I believe the good incremental step would be to disable all auto-caching, and rely on user to call AppendCacheCheckpoint prior to multi-pass training.

This is not really ideal, since the default setup for multi-pass trainers will train slower. I still think it is better to have a consistent story about our 'smarts' (that is, we have no auto-normalization, no auto-caching and no auto-calibration), and use extensive documentation (and tooling, in the future) to cover these pitfalls.

cc @GalOshri @TomFinley @eerhardt

The text was updated successfully, but these errors were encountered:

justinormont · 2018-11-13T06:06:14Z

I would prefer we make a good model by default for the user w/ auto-normalization & auto-calibration.

The auto-caching is simply a matter of speed, not final model quality, but is quite related to the overall user happiness.

Zruty0 · 2018-11-13T17:42:44Z

I think this is not the first time we have this argument.
Again, the reasons we don't want to make these auto-smarts part of the core API is because sometimes they are making mistakes, and sometimes costly mistakes:

For auto-caching, we may blow up the machine's memory by caching the training data, where a non-cached training would've succeeded (maybe slower).
Auto-caching assumes that the original training data is slow to access, but if it's a memory-backed dataset (or another cache), this is not true. So auto-caching may make training slower, and ALSO consume more memory.
Auto-calibration happens on the training set. We assume that most of the time it's OK, given that we only learn 2 parameters, but we already saw cases where model quality degrades because of that.
Auto-normalization may normalize data that is otherwise pretty regular, potentially making model larger, and training slower.
Auto-normalization has a potential for user confusion: they thought that they merely trained a linear model, but in reality they train a pair of models.

Because of the above, we don't want any of these smarts to be part of the core ML.NET API. We need to expose the API to normalize, cache and calibrate at user's request. Our existing smarts can be converted into tooling (VS code analyzer warnings etc.).

TomFinley · 2018-11-14T18:28:37Z

This feeds back into a general principle (not just for ML.NET) that APIs are best explicit, not implicit. Tools can get away with implicit behavior, APIs should not. Programming against an unfamiliar API is hard enough without having to worry about the API essentially rewriting your program for you and doing things you didn't ask for because it "knows better." Like @Zruty0 I'm actually a little surprised we are still having this argument.

sfilipi · 2018-11-21T21:41:23Z

Checking the impl mechanism: so we'll have configs for norm/calibration/cache (or we already have them through TrainContext/TrainInfo) and default normalization, calibration and caching will be no?

Zruty0 · 2018-11-21T21:45:12Z

No, we just remove all auto-caching, calibration and normalization, period.

The users will be responsible to normalize the data if needed (via mlContext.Normalize), cache in memory if desired (via mlContext.Data.Cache or pipeline.AppendCacheCheckpoint), and calibrate if desired (after #1622 is done).

GalOshri · 2018-11-21T22:01:16Z

Would it be feasible to provide some documentation/hints on when normalization/caching/calibration are important? For example, if a learner today is configured to add normalization, should we update the docs for that learner to suggest that normalization is important? Or perhaps just explain in the docs on normalization in which situations it might be important.

/cc @JRAlexander

Zruty0 · 2018-11-21T22:28:52Z

For example, if a learner today is configured to add normalization,

We already changed it long ago, no learners are adding normalization. Or calibration. This work item is to also remove auto-caching, everything else is already gone.

The cookbook has a section on normalization.

wschin · 2018-12-03T18:42:17Z

It looks like both of mlContext.Data.Cache and pipeline.AppendCacheCheckPoint can't work with dynamic pipeline (the only examples I can find are in CachingTests.cs). Do we have any caching mechanism for static world?

justinormont added API Issues pertaining the friendly API usability Smoothing user interaction or experience labels Nov 13, 2018

Zruty0 mentioned this issue Nov 20, 2018

14 TB of Hundreds of Thousands of Input Files For Training? #1668

Closed

wschin self-assigned this Nov 28, 2018

wschin mentioned this issue Nov 29, 2018

Remove auto-cache mechanism #1780

Merged

TomFinley closed this as completed in #1780 Dec 6, 2018

Ivanidzo4ka mentioned this issue Dec 14, 2018

Regression due to removal of autoCache #1882

Closed

abgoswam mentioned this issue Jan 3, 2019

Update tests (using Iris dataset) to use new API #2008

Merged

This was referenced Feb 4, 2019

State of CalibratorPredictorBase v1 #2378

Closed

IPredictor and related parts hiding #2251

Closed

Sample Breaks When Change Learner from FastTree to AveragedPerceptron #2477

Closed

TomFinley mentioned this issue Mar 27, 2019

Predict expects the Label as input #3063

Open

ghost locked as resolved and limited conversation to collaborators Mar 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up our auto-caching #1604

Clean up our auto-caching #1604

Zruty0 commented Nov 12, 2018

justinormont commented Nov 13, 2018

Zruty0 commented Nov 13, 2018

TomFinley commented Nov 14, 2018

sfilipi commented Nov 21, 2018

Zruty0 commented Nov 21, 2018

GalOshri commented Nov 21, 2018

Zruty0 commented Nov 21, 2018

wschin commented Dec 3, 2018 •

edited

Loading

Clean up our auto-caching #1604

Clean up our auto-caching #1604

Comments

Zruty0 commented Nov 12, 2018

justinormont commented Nov 13, 2018

Zruty0 commented Nov 13, 2018

TomFinley commented Nov 14, 2018

sfilipi commented Nov 21, 2018

Zruty0 commented Nov 21, 2018

GalOshri commented Nov 21, 2018

Zruty0 commented Nov 21, 2018

wschin commented Dec 3, 2018 • edited Loading

wschin commented Dec 3, 2018 •

edited

Loading