Skip to content

Remove auto-cache mechanism #1780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 6, 2018
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion docs/samples/Microsoft.ML.Samples/Dynamic/SDCA.cs
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,18 @@ public static void SDCA_BinaryClassification()
// Read the data
var data = reader.Read(dataFile);

// ML.NET doesn't cache data set by default. Therefore, if one reads a data set from a file and accesses it many times, it can be slow due to
// expensive featurization and disk operations. When the considered data can fit into memory, a solution is to cache the data in memory. Caching is especially
// helpful when working with iterative algorithms which needs many data passes. Since SDCA is the case, we cache. Inserting a
// cache step in a pipeline is also possible, please see the construction of pipeline below.
data = mlContext.Data.Cache(data);

// Step 2: Pipeline
// Featurize the text column through the FeaturizeText API.
// Then append a binary classifier, setting the "Label" column as the label of the dataset, and
// the "Features" column produced by FeaturizeText as the features column.
// the "Features" column produced by FeaturizeText as the features column.
var pipeline = mlContext.Transforms.Text.FeaturizeText("SentimentText", "Features")
.AppendCacheCheckpoint(mlContext) // Add a data-cache step within a pipeline.
.Append(mlContext.BinaryClassification.Trainers.StochasticDualCoordinateAscent(labelColumn: "Sentiment", featureColumn: "Features", l2Const: 0.001f));

// Step 3: Run Cross-Validation on this pipeline.
Expand Down
14 changes: 14 additions & 0 deletions src/Microsoft.ML.Data/StaticPipe/DataView.cs
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
using Microsoft.ML.StaticPipe.Runtime;
using System.Collections.Generic;
using System;
using System.Linq;

namespace Microsoft.ML.StaticPipe
{
Expand All @@ -23,6 +24,19 @@ internal DataView(IHostEnvironment env, IDataView view, StaticSchemaShape shape)
AsDynamic = view;
Shape.Check(Env, AsDynamic.Schema);
}

/// <summary>
/// This function return a <see cref="DataView{TShape}"/> whose columns are all cached in memory.
/// This returned <see cref="DataView{TShape}"/> is almost the same to the source <see cref="DataView{TShape}"/>.
/// The only difference are cache-related properties.
/// </summary>
public DataView<TShape> Cache()
{
// Generate all column indexes in the source data.
var prefetched = Enumerable.Range(0, AsDynamic.Schema.ColumnCount).ToArray();
// Create a cached version of the source data by caching all columns.
return new DataView<TShape>(Env, new CacheDataView(Env, AsDynamic, prefetched), Shape);
}
}

public static class DataViewExtensions
Expand Down
9 changes: 9 additions & 0 deletions src/Microsoft.ML.Data/StaticPipe/Estimator.cs
Original file line number Diff line number Diff line change
Expand Up @@ -77,5 +77,14 @@ string NameMap(PipelineColumn col)
return new Estimator<TInShape, TNewOutShape, ITransformer>(Env, est, _inShape, newOut);
}
}

/// <summary>
/// Cache data produced in memory by this estimator. It may append an extra estimator to the this estimator
/// for caching. The newly added estimator would be returned.
/// </summary>
public Estimator<TInShape, TOutShape, ITransformer> AppendCacheCheckpoint()
{
return new Estimator<TInShape, TOutShape, ITransformer>(Env, AsDynamic.AppendCacheCheckpoint(Env), _inShape, Shape);
}
}
}
15 changes: 5 additions & 10 deletions src/Microsoft.ML.Data/Training/TrainerEstimatorBase.cs
Original file line number Diff line number Diff line change
Expand Up @@ -132,21 +132,16 @@ protected virtual void CheckLabelCompatible(SchemaShape.Column labelCol)
protected TTransformer TrainTransformer(IDataView trainSet,
IDataView validationSet = null, IPredictor initPredictor = null)
{
var cachedTrain = Info.WantCaching ? new CacheDataView(Host, trainSet, prefetch: null) : trainSet;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As requested by @GalOshri in the issue, can we add documentation?

Currently the user will have no method of knowing if a specific learner already does it own form of caching, or won't benefit from caching.

Inline w/ @GalOshri's request, I think this documentation should be required before making this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the appropriate cookbook samples to illustrate the new pattern with this little caching checkpoint thing.


In reply to: 237951473 [](ancestors = 237951473)

Copy link
Contributor

@justinormont justinormont Nov 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updating the example code is a good first step. And we should create a direct list of the components which benefit from caching. This is along with when they benefit, for instance, "a LinearSVM when the number of iterations are greater than 1".

Another route is perhaps a VS checker which look at Info.WantCaching and recommends from there? #WontFix

Copy link
Member Author

@wschin wschin Dec 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sample and some tests are modified to use those caching functions. Every caching function has at least one test now. #Resolved

Copy link
Member Author

@wschin wschin Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think having a list is a small task. We need another PR and issue.


In reply to: 237953879 [](ancestors = 237953879)

Copy link
Member Author

@wschin wschin Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will do it in next iteration.

[Update] Done. Please take a look again. Thank you.


In reply to: 237951764 [](ancestors = 237951764,237951473)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I like to see documentation in the PR. This is more so true when the user can be surprised by the change and not understand what's different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many uses are added into Coookbook.


In reply to: 238949665 [](ancestors = 238949665)

var trainRoleMapped = MakeRoles(trainSet);

var trainRoles = MakeRoles(cachedTrain);

RoleMappedData validRoles;
RoleMappedData validRoleMapped;

if (validationSet == null)
validRoles = null;
validRoleMapped = null;
else
{
var cachedValid = Info.WantCaching ? new CacheDataView(Host, validationSet, prefetch: null) : validationSet;
validRoles = MakeRoles(cachedValid);
}
validRoleMapped = MakeRoles(validationSet);
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Dec 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just set it null as default and change only if validation set is != null #Resolved


var pred = TrainModelCore(new TrainContext(trainRoles, validRoles, null, initPredictor));
var pred = TrainModelCore(new TrainContext(trainRoleMapped, validRoleMapped, null, initPredictor));
return MakeTransformer(pred, trainSet.Schema);
}

Expand Down
32 changes: 16 additions & 16 deletions test/BaselineOutput/Common/OVA/OVA-CV-iris-out.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,35 +21,35 @@ Confusion table
PREDICTED || 0 | 1 | 2 | Recall
TRUTH ||========================
0 || 21 | 0 | 0 | 1.0000
1 || 0 | 22 | 8 | 0.7333
1 || 0 | 20 | 10 | 0.6667
2 || 0 | 0 | 28 | 1.0000
||========================
Precision ||1.0000 |1.0000 |0.7778 |
Accuracy(micro-avg): 0.898734
Accuracy(macro-avg): 0.911111
Log-loss: 0.372620
Log-loss reduction: 65.736556
Precision ||1.0000 |1.0000 |0.7368 |
Accuracy(micro-avg): 0.873418
Accuracy(macro-avg): 0.888889
Log-loss: 0.393949
Log-loss reduction: 63.775293

Confusion table
||========================
PREDICTED || 0 | 1 | 2 | Recall
TRUTH ||========================
0 || 29 | 0 | 0 | 1.0000
1 || 0 | 18 | 2 | 0.9000
1 || 0 | 19 | 1 | 0.9500
2 || 0 | 0 | 22 | 1.0000
||========================
Precision ||1.0000 |1.0000 |0.9167 |
Accuracy(micro-avg): 0.971831
Accuracy(macro-avg): 0.966667
Log-loss: 0.357704
Log-loss reduction: 67.051654
Precision ||1.0000 |1.0000 |0.9565 |
Accuracy(micro-avg): 0.985915
Accuracy(macro-avg): 0.983333
Log-loss: 0.299620
Log-loss reduction: 72.401815

OVERALL RESULTS
---------------------------------------
Accuracy(micro-avg): 0.935283 (0.0365)
Accuracy(macro-avg): 0.938889 (0.0278)
Log-loss: 0.365162 (0.0075)
Log-loss reduction: 66.394105 (0.6575)
Accuracy(micro-avg): 0.929667 (0.0562)
Accuracy(macro-avg): 0.936111 (0.0472)
Log-loss: 0.346785 (0.0472)
Log-loss reduction: 68.088554 (4.3133)

---------------------------------------
Physical memory usage(MB): %Number%
Expand Down
2 changes: 1 addition & 1 deletion test/BaselineOutput/Common/OVA/OVA-CV-iris-rp.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
OVA
Accuracy(micro-avg) Accuracy(macro-avg) Log-loss Log-loss reduction /p Learner Name Train Dataset Test Dataset Results File Run Time Physical Memory Virtual Memory Command Line Settings
0.935283 0.938889 0.365162 66.3941 AvgPer{lr=0.8} OVA %Data% %Output% 99 0 0 maml.exe CV tr=OVA{p=AvgPer{ lr=0.8 }} threads=- norm=No dout=%Output% data=%Data% seed=1 /p:AvgPer{lr=0.8}
0.929667 0.936111 0.346785 68.08855 AvgPer{lr=0.8} OVA %Data% %Output% 99 0 0 maml.exe CV tr=OVA{p=AvgPer{ lr=0.8 }} threads=- norm=No dout=%Output% data=%Data% seed=1 /p:AvgPer{lr=0.8}

Loading