Skip to content

Clean up our auto-caching #1604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Zruty0 opened this issue Nov 12, 2018 · 8 comments
Closed

Clean up our auto-caching #1604

Zruty0 opened this issue Nov 12, 2018 · 8 comments
Assignees
Labels
API Issues pertaining the friendly API usability Smoothing user interaction or experience

Comments

@Zruty0
Copy link
Contributor

Zruty0 commented Nov 12, 2018

Currently, some of our trainers cache the data prior to training, without a possibility to disable that.

I believe the good incremental step would be to disable all auto-caching, and rely on user to call AppendCacheCheckpoint prior to multi-pass training.

This is not really ideal, since the default setup for multi-pass trainers will train slower. I still think it is better to have a consistent story about our 'smarts' (that is, we have no auto-normalization, no auto-caching and no auto-calibration), and use extensive documentation (and tooling, in the future) to cover these pitfalls.

cc @GalOshri @TomFinley @eerhardt

@justinormont
Copy link
Contributor

I would prefer we make a good model by default for the user w/ auto-normalization & auto-calibration.

The auto-caching is simply a matter of speed, not final model quality, but is quite related to the overall user happiness.

@Zruty0
Copy link
Contributor Author

Zruty0 commented Nov 13, 2018

I think this is not the first time we have this argument.
Again, the reasons we don't want to make these auto-smarts part of the core API is because sometimes they are making mistakes, and sometimes costly mistakes:

  • For auto-caching, we may blow up the machine's memory by caching the training data, where a non-cached training would've succeeded (maybe slower).
  • Auto-caching assumes that the original training data is slow to access, but if it's a memory-backed dataset (or another cache), this is not true. So auto-caching may make training slower, and ALSO consume more memory.
  • Auto-calibration happens on the training set. We assume that most of the time it's OK, given that we only learn 2 parameters, but we already saw cases where model quality degrades because of that.
  • Auto-normalization may normalize data that is otherwise pretty regular, potentially making model larger, and training slower.
  • Auto-normalization has a potential for user confusion: they thought that they merely trained a linear model, but in reality they train a pair of models.

Because of the above, we don't want any of these smarts to be part of the core ML.NET API. We need to expose the API to normalize, cache and calibrate at user's request. Our existing smarts can be converted into tooling (VS code analyzer warnings etc.).

@justinormont justinormont added API Issues pertaining the friendly API usability Smoothing user interaction or experience labels Nov 13, 2018
@TomFinley
Copy link
Contributor

This feeds back into a general principle (not just for ML.NET) that APIs are best explicit, not implicit. Tools can get away with implicit behavior, APIs should not. Programming against an unfamiliar API is hard enough without having to worry about the API essentially rewriting your program for you and doing things you didn't ask for because it "knows better." Like @Zruty0 I'm actually a little surprised we are still having this argument.

@sfilipi
Copy link
Member

sfilipi commented Nov 21, 2018

Checking the impl mechanism: so we'll have configs for norm/calibration/cache (or we already have them through TrainContext/TrainInfo) and default normalization, calibration and caching will be no?

@Zruty0
Copy link
Contributor Author

Zruty0 commented Nov 21, 2018

No, we just remove all auto-caching, calibration and normalization, period.

The users will be responsible to normalize the data if needed (via mlContext.Normalize), cache in memory if desired (via mlContext.Data.Cache or pipeline.AppendCacheCheckpoint), and calibrate if desired (after #1622 is done).

@GalOshri
Copy link
Contributor

Would it be feasible to provide some documentation/hints on when normalization/caching/calibration are important? For example, if a learner today is configured to add normalization, should we update the docs for that learner to suggest that normalization is important? Or perhaps just explain in the docs on normalization in which situations it might be important.

/cc @JRAlexander

@Zruty0
Copy link
Contributor Author

Zruty0 commented Nov 21, 2018

For example, if a learner today is configured to add normalization,

We already changed it long ago, no learners are adding normalization. Or calibration. This work item is to also remove auto-caching, everything else is already gone.

The cookbook has a section on normalization.

@wschin wschin self-assigned this Nov 28, 2018
@wschin
Copy link
Member

wschin commented Dec 3, 2018

It looks like both of mlContext.Data.Cache and pipeline.AppendCacheCheckPoint can't work with dynamic pipeline (the only examples I can find are in CachingTests.cs). Do we have any caching mechanism for static world?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API usability Smoothing user interaction or experience
Projects
None yet
Development

No branches or pull requests

6 participants