Skip to content

Remove auto-cache mechanism #1780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 6, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 66 additions & 1 deletion docs/code/MlNetCookBook.md
Original file line number Diff line number Diff line change
Expand Up @@ -443,10 +443,24 @@ var reader = mlContext.Data.TextReader(ctx => (
// Now read the file (remember though, readers are lazy, so the actual reading will happen when the data is accessed).
var trainData = reader.Read(trainDataPath);

// Sometime, caching data in-memory after its first access can save some loading time when the data is going to be used
// several times somewhere. The caching mechanism is also lazy; it only caches things after being used.
// User can replace all the subsequently uses of "trainData" with "cachedTrainData". We still use "trainData" because
// a caching step, which provides the same caching function, will be inserted in the considered "learningPipeline."
var cachedTrainData = trainData.Cache();

// Step two: define the learning pipeline.

// We 'start' the pipeline with the output of the reader.
var learningPipeline = reader.MakeNewEstimator()
// We add a step for caching data in memory so that the downstream iterative training
// algorithm can efficiently scan through the data multiple times. Otherwise, the following
// trainer will read data from disk multiple times. The caching mechanism uses an on-demand strategy.
// The data accessed in any downstream step will be cached since its first use. In general, you only
// need to add a caching step before trainable step, because caching is not helpful if the data is
// only scanned once. This step can be removed if user doesn't have enough memory to store the whole
// data set.
.AppendCacheCheckpoint()
// Now we can add any 'training steps' to it. In our case we want to 'normalize' the data (rescale to be
// between -1 and 1 for all examples)
.Append(r => (
Expand Down Expand Up @@ -486,13 +500,28 @@ var reader = mlContext.Data.TextReader(new TextLoader.Arguments
// Now read the file (remember though, readers are lazy, so the actual reading will happen when the data is accessed).
var trainData = reader.Read(trainDataPath);

// Sometime, caching data in-memory after its first access can save some loading time when the data is going to be used
// several times somewhere. The caching mechanism is also lazy; it only caches things after being used.
// User can replace all the subsequently uses of "trainData" with "cachedTrainData". We still use "trainData" because
// a caching step, which provides the same caching function, will be inserted in the considered "dynamicPipeline."
var cachedTrainData = mlContext.Data.Cache(trainData);

// Step two: define the learning pipeline.

// We 'start' the pipeline with the output of the reader.
var dynamicPipeline =
// First 'normalize' the data (rescale to be
// between -1 and 1 for all examples)
mlContext.Transforms.Normalize("FeatureVector")
// We add a step for caching data in memory so that the downstream iterative training
// algorithm can efficiently scan through the data multiple times. Otherwise, the following
// trainer will read data from disk multiple times. The caching mechanism uses an on-demand strategy.
// The data accessed in any downstream step will be cached since its first use. In general, you only
// need to add a caching step before trainable step, because caching is not helpful if the data is
// only scanned once. This step can be removed if user doesn't have enough memory to store the whole
// data set. Notice that in the upstream Transforms.Normalize step, we only scan through the data
// once so adding a caching step before it is not helpful.
.AppendCacheCheckpoint(mlContext)
// Add the SDCA regression trainer.
.Append(mlContext.Regression.Trainers.StochasticDualCoordinateAscent(label: "Target", features: "FeatureVector"));

Expand Down Expand Up @@ -595,6 +624,13 @@ var learningPipeline = reader.MakeNewEstimator()
r.Label,
// Concatenate all the features together into one column 'Features'.
Features: r.SepalLength.ConcatWith(r.SepalWidth, r.PetalLength, r.PetalWidth)))
// We add a step for caching data in memory so that the downstream iterative training
// algorithm can efficiently scan through the data multiple times. Otherwise, the following
// trainer will read data from disk multiple times. The caching mechanism uses an on-demand strategy.
// The data accessed in any downstream step will be cached since its first use. In general, you only
// need to add a caching step before trainable step, because caching is not helpful if the data is
// only scanned once.
.AppendCacheCheckpoint()
.Append(r => (
r.Label,
// Train the multi-class SDCA model to predict the label using features.
Expand Down Expand Up @@ -640,6 +676,8 @@ var dynamicPipeline =
mlContext.Transforms.Concatenate("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
// Note that the label is text, so it needs to be converted to key.
.Append(mlContext.Transforms.Categorical.MapValueToKey("Label"), TransformerScope.TrainTest)
// Cache data in moemory for steps after the cache check point stage.
.AppendCacheCheckpoint(mlContext)
// Use the multi-class SDCA model to predict the label using features.
.Append(mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent())
// Apply the inverse conversion from 'PredictedLabel' column back to string value.
Expand Down Expand Up @@ -741,6 +779,7 @@ var trainData = mlContext.CreateStreamingDataView(churnData);

var dynamicLearningPipeline = mlContext.Transforms.Categorical.OneHotEncoding("DemographicCategory")
.Append(mlContext.Transforms.Concatenate("Features", "DemographicCategory", "LastVisits"))
.AppendCacheCheckpoint(mlContext) // FastTree will benefit from caching data in memory.
.Append(mlContext.BinaryClassification.Trainers.FastTree("HasChurned", "Features", numTrees: 20));

var dynamicModel = dynamicLearningPipeline.Fit(trainData);
Expand All @@ -757,6 +796,7 @@ var staticLearningPipeline = staticData.MakeNewEstimator()
.Append(r => (
r.HasChurned,
Features: r.DemographicCategory.OneHotEncoding().ConcatWith(r.LastVisits)))
.AppendCacheCheckpoint() // FastTree will benefit from caching data in memory.
.Append(r => mlContext.BinaryClassification.Trainers.FastTree(r.HasChurned, r.Features, numTrees: 20));

var staticModel = staticLearningPipeline.Fit(staticData);
Expand Down Expand Up @@ -813,6 +853,8 @@ var learningPipeline = reader.MakeNewEstimator()
// When the normalizer is trained, the below delegate is going to be called.
// We use it to memorize the scales.
onFit: (scales, offsets) => normScales = scales)))
// Cache data used in memory because the subsequently trainer needs to access the data multiple times.
.AppendCacheCheckpoint()
.Append(r => (
r.Label,
// Train the multi-class SDCA model to predict the label using features.
Expand Down Expand Up @@ -987,6 +1029,10 @@ var catColumns = data.GetColumn(r => r.CategoricalFeatures).Take(10).ToArray();

// Build several alternative featurization pipelines.
var learningPipeline = reader.MakeNewEstimator()
// Cache data in memory in an on-demand manner. Columns used in any downstream step will be
// cached in memory at their first uses. This step can be removed if user's machine doesn't
// have enough memory.
.AppendCacheCheckpoint()
.Append(r => (
r.Label,
r.NumericalFeatures,
Expand Down Expand Up @@ -1070,6 +1116,9 @@ var workclasses = transformedData.GetColumn<float[]>(mlContext, "WorkclassOneHot
var fullLearningPipeline = dynamicPipeline
// Concatenate two of the 3 categorical pipelines, and the numeric features.
.Append(mlContext.Transforms.Concatenate("Features", "NumericalFeatures", "CategoricalBag", "WorkclassOneHotTrimmed"))
// Cache data in memory so that the following trainer will be able to access training examples without
// reading them from disk multiple times.
.AppendCacheCheckpoint(mlContext)
// Now we're ready to train. We chose our FastTree trainer for this classification task.
.Append(mlContext.BinaryClassification.Trainers.FastTree(numTrees: 50));

Expand Down Expand Up @@ -1121,6 +1170,10 @@ var messageTexts = data.GetColumn(x => x.Message).Take(20).ToArray();

// Apply various kinds of text operations supported by ML.NET.
var learningPipeline = reader.MakeNewEstimator()
// Cache data in memory in an on-demand manner. Columns used in any downstream step will be
// cached in memory at their first uses. This step can be removed if user's machine doesn't
// have enough memory.
.AppendCacheCheckpoint()
.Append(r => (
// One-stop shop to run the full text featurization.
TextFeatures: r.Message.FeaturizeText(),
Expand Down Expand Up @@ -1243,6 +1296,9 @@ var learningPipeline = reader.MakeNewEstimator()
Label: r.Label.ToKey(),
// Concatenate all the features together into one column 'Features'.
Features: r.SepalLength.ConcatWith(r.SepalWidth, r.PetalLength, r.PetalWidth)))
// Add a step for caching data in memory so that the downstream iterative training
// algorithm can efficiently scan through the data multiple times.
.AppendCacheCheckpoint()
.Append(r => (
r.Label,
// Train the multi-class SDCA model to predict the label using features.
Expand Down Expand Up @@ -1298,6 +1354,10 @@ var dynamicPipeline =
mlContext.Transforms.Concatenate("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
// Note that the label is text, so it needs to be converted to key.
.Append(mlContext.Transforms.Conversions.MapValueToKey("Label"), TransformerScope.TrainTest)
// Cache data in memory so that SDCA trainer will be able to randomly access training examples without
// reading data from disk multiple times. Data will be cached at its first use in any downstream step.
// Notice that unused part in the data may not be cached.
.AppendCacheCheckpoint(mlContext)
// Use the multi-class SDCA model to predict the label using features.
.Append(mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent());

Expand Down Expand Up @@ -1439,6 +1499,7 @@ public static ITransformer TrainModel(MLContext mlContext, IDataView trainData)
Action<InputRow, OutputRow> mapping = (input, output) => output.Label = input.Income > 50000;
// Construct the learning pipeline.
var estimator = mlContext.Transforms.CustomMapping(mapping, null)
.AppendCacheCheckpoint(mlContext)
.Append(mlContext.BinaryClassification.Trainers.FastTree(label: "Label"));

return estimator.Fit(trainData);
Expand Down Expand Up @@ -1480,8 +1541,12 @@ public class CustomMappings
var estimator = mlContext.Transforms.CustomMapping<InputRow, OutputRow>(CustomMappings.IncomeMapping, nameof(CustomMappings.IncomeMapping))
.Append(mlContext.BinaryClassification.Trainers.FastTree(label: "Label"));

// If memory is enough, we can cache the data in-memory to avoid reading them from file
// when it will be accessed multiple times.
var cachedTrainData = mlContext.Data.Cache(trainData);

// Train the model.
var model = estimator.Fit(trainData);
var model = estimator.Fit(cachedTrainData);

// Save the model.
using (var fs = File.Create(modelPath))
Expand Down
9 changes: 8 additions & 1 deletion docs/samples/Microsoft.ML.Samples/Dynamic/SDCA.cs
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,18 @@ public static void SDCA_BinaryClassification()
// Read the data
var data = reader.Read(dataFile);

// ML.NET doesn't cache data set by default. Therefore, if one reads a data set from a file and accesses it many times, it can be slow due to
// expensive featurization and disk operations. When the considered data can fit into memory, a solution is to cache the data in memory. Caching is especially
// helpful when working with iterative algorithms which needs many data passes. Since SDCA is the case, we cache. Inserting a
// cache step in a pipeline is also possible, please see the construction of pipeline below.
data = mlContext.Data.Cache(data);

// Step 2: Pipeline
// Featurize the text column through the FeaturizeText API.
// Then append a binary classifier, setting the "Label" column as the label of the dataset, and
// the "Features" column produced by FeaturizeText as the features column.
// the "Features" column produced by FeaturizeText as the features column.
var pipeline = mlContext.Transforms.Text.FeaturizeText("SentimentText", "Features")
.AppendCacheCheckpoint(mlContext) // Add a data-cache step within a pipeline.
.Append(mlContext.BinaryClassification.Trainers.StochasticDualCoordinateAscent(labelColumn: "Sentiment", featureColumn: "Features", l2Const: 0.001f));

// Step 3: Run Cross-Validation on this pipeline.
Expand Down
14 changes: 14 additions & 0 deletions src/Microsoft.ML.Data/StaticPipe/DataView.cs
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
using Microsoft.ML.StaticPipe.Runtime;
using System.Collections.Generic;
using System;
using System.Linq;

namespace Microsoft.ML.StaticPipe
{
Expand All @@ -23,6 +24,19 @@ internal DataView(IHostEnvironment env, IDataView view, StaticSchemaShape shape)
AsDynamic = view;
Shape.Check(Env, AsDynamic.Schema);
}

/// <summary>
/// This function return a <see cref="DataView{TShape}"/> whose columns are all cached in memory.
/// This returned <see cref="DataView{TShape}"/> is almost the same to the source <see cref="DataView{TShape}"/>.
/// The only difference are cache-related properties.
/// </summary>
public DataView<TShape> Cache()
{
// Generate all column indexes in the source data.
var prefetched = Enumerable.Range(0, AsDynamic.Schema.ColumnCount).ToArray();
// Create a cached version of the source data by caching all columns.
return new DataView<TShape>(Env, new CacheDataView(Env, AsDynamic, prefetched), Shape);
}
}

public static class DataViewExtensions
Expand Down
9 changes: 9 additions & 0 deletions src/Microsoft.ML.Data/StaticPipe/Estimator.cs
Original file line number Diff line number Diff line change
Expand Up @@ -77,5 +77,14 @@ string NameMap(PipelineColumn col)
return new Estimator<TInShape, TNewOutShape, ITransformer>(Env, est, _inShape, newOut);
}
}

/// <summary>
/// Cache data produced in memory by this estimator. It may append an extra estimator to the this estimator
/// for caching. The newly added estimator would be returned.
/// </summary>
public Estimator<TInShape, TOutShape, ITransformer> AppendCacheCheckpoint()
{
return new Estimator<TInShape, TOutShape, ITransformer>(Env, AsDynamic.AppendCacheCheckpoint(Env), _inShape, Shape);
}
}
}
7 changes: 2 additions & 5 deletions src/Microsoft.ML.Data/Training/TrainerEstimatorBase.cs
Original file line number Diff line number Diff line change
Expand Up @@ -130,11 +130,8 @@ protected virtual void CheckLabelCompatible(SchemaShape.Column labelCol)
protected TTransformer TrainTransformer(IDataView trainSet,
IDataView validationSet = null, IPredictor initPredictor = null)
{
var cachedTrain = Info.WantCaching ? new CacheDataView(Host, trainSet, prefetch: null) : trainSet;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As requested by @GalOshri in the issue, can we add documentation?

Currently the user will have no method of knowing if a specific learner already does it own form of caching, or won't benefit from caching.

Inline w/ @GalOshri's request, I think this documentation should be required before making this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the appropriate cookbook samples to illustrate the new pattern with this little caching checkpoint thing.


In reply to: 237951473 [](ancestors = 237951473)

Copy link
Contributor

@justinormont justinormont Nov 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updating the example code is a good first step. And we should create a direct list of the components which benefit from caching. This is along with when they benefit, for instance, "a LinearSVM when the number of iterations are greater than 1".

Another route is perhaps a VS checker which look at Info.WantCaching and recommends from there? #WontFix

Copy link
Member Author

@wschin wschin Dec 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sample and some tests are modified to use those caching functions. Every caching function has at least one test now. #Resolved

Copy link
Member Author

@wschin wschin Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think having a list is a small task. We need another PR and issue.


In reply to: 237953879 [](ancestors = 237953879)

Copy link
Member Author

@wschin wschin Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will do it in next iteration.

[Update] Done. Please take a look again. Thank you.


In reply to: 237951764 [](ancestors = 237951764,237951473)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I like to see documentation in the PR. This is more so true when the user can be surprised by the change and not understand what's different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many uses are added into Coookbook.


In reply to: 238949665 [](ancestors = 238949665)

var cachedValid = Info.WantCaching && validationSet != null ? new CacheDataView(Host, validationSet, prefetch: null) : validationSet;

var trainRoleMapped = MakeRoles(cachedTrain);
var validRoleMapped = validationSet == null ? null : MakeRoles(cachedValid);
var trainRoleMapped = MakeRoles(trainSet);
var validRoleMapped = validationSet == null ? null : MakeRoles(validationSet);

var pred = TrainModelCore(new TrainContext(trainRoleMapped, validRoleMapped, null, initPredictor));
return MakeTransformer(pred, trainSet.Schema);
Expand Down
32 changes: 16 additions & 16 deletions test/BaselineOutput/Common/OVA/OVA-CV-iris-out.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,35 +21,35 @@ Confusion table
PREDICTED || 0 | 1 | 2 | Recall
TRUTH ||========================
0 || 21 | 0 | 0 | 1.0000
1 || 0 | 22 | 8 | 0.7333
1 || 0 | 20 | 10 | 0.6667
2 || 0 | 0 | 28 | 1.0000
||========================
Precision ||1.0000 |1.0000 |0.7778 |
Accuracy(micro-avg): 0.898734
Accuracy(macro-avg): 0.911111
Log-loss: 0.372620
Log-loss reduction: 65.736556
Precision ||1.0000 |1.0000 |0.7368 |
Accuracy(micro-avg): 0.873418
Accuracy(macro-avg): 0.888889
Log-loss: 0.393949
Log-loss reduction: 63.775293

Confusion table
||========================
PREDICTED || 0 | 1 | 2 | Recall
TRUTH ||========================
0 || 29 | 0 | 0 | 1.0000
1 || 0 | 18 | 2 | 0.9000
1 || 0 | 19 | 1 | 0.9500
2 || 0 | 0 | 22 | 1.0000
||========================
Precision ||1.0000 |1.0000 |0.9167 |
Accuracy(micro-avg): 0.971831
Accuracy(macro-avg): 0.966667
Log-loss: 0.357704
Log-loss reduction: 67.051654
Precision ||1.0000 |1.0000 |0.9565 |
Accuracy(micro-avg): 0.985915
Accuracy(macro-avg): 0.983333
Log-loss: 0.299620
Log-loss reduction: 72.401815

OVERALL RESULTS
---------------------------------------
Accuracy(micro-avg): 0.935283 (0.0365)
Accuracy(macro-avg): 0.938889 (0.0278)
Log-loss: 0.365162 (0.0075)
Log-loss reduction: 66.394105 (0.6575)
Accuracy(micro-avg): 0.929667 (0.0562)
Accuracy(macro-avg): 0.936111 (0.0472)
Log-loss: 0.346785 (0.0472)
Log-loss reduction: 68.088554 (4.3133)

---------------------------------------
Physical memory usage(MB): %Number%
Expand Down
2 changes: 1 addition & 1 deletion test/BaselineOutput/Common/OVA/OVA-CV-iris-rp.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
OVA
Accuracy(micro-avg) Accuracy(macro-avg) Log-loss Log-loss reduction /p Learner Name Train Dataset Test Dataset Results File Run Time Physical Memory Virtual Memory Command Line Settings
0.935283 0.938889 0.365162 66.3941 AvgPer{lr=0.8} OVA %Data% %Output% 99 0 0 maml.exe CV tr=OVA{p=AvgPer{ lr=0.8 }} threads=- norm=No dout=%Output% data=%Data% seed=1 /p:AvgPer{lr=0.8}
0.929667 0.936111 0.346785 68.08855 AvgPer{lr=0.8} OVA %Data% %Output% 99 0 0 maml.exe CV tr=OVA{p=AvgPer{ lr=0.8 }} threads=- norm=No dout=%Output% data=%Data% seed=1 /p:AvgPer{lr=0.8}

Loading